ICRA2026

Abstract:
Photorealistic 3D reconstruction of street scenes is a critical technique for developing real-world simulators for autonomous driving. Despite the efficacy of Neural Radiance Fields (NeRF) for driving scenes, 3D Gaussian Splatting (3DGS) emerges as a promising direction due to its faster speed and more explicit representation. However, most existing street 3DGS methods require tracked 3D vehicle bounding boxes to decompose the static and dynamic elements for effective reconstruction, limiting their applications for in-the-wild scenarios. To facilitate efficient 3D scene reconstruction without costly annotations, we propose a self-supervised street Gaussian S3Gaussian method to decompose dynamic and static elements from 4D consistency. We represent each scene with 3D Gaussians to preserve the explicitness and further accompany them with a spatial-temporal field network to compactly model the 4D dynamics. We conduct extensive experiments on the challenging Waymo-Open dataset to evaluate the effectiveness of our method. Our S3Gaussian demonstrates the ability to decompose static and dynamic scenes and achieves the best performance without using 3D annotations.

Abstract:
This paper addresses the challenge of registering two rigid semantic scene graphs, an essential capability for autonomous agents to align with remote agents or prior maps. Traditional methods rely on hand-crafted descriptors or ground-truth annotations, limiting their applicability in real-world scenarios. To address these issues, we propose a scene graph network that encodes multiple semantic node modalities: open-set semantic features, local topology with spatial awareness, and shape features. These modalities are fused to form compact semantic node representations for matching layers to perform coarse-to-fine correspondence search. A robust pose estimator in the back-end determines the transformation based on these correspondences. Our approach preserves a sparse, hierarchical scene representation, requiring fewer GPU resources and less communication bandwidth in multi-agent tasks. Additionally, we introduce a novel data generation method using vision foundation models and a semantic mapping module, avoiding the need for ground-truth annotations. We validate our method on a two-agent SLAM benchmark, demonstrating superior registration success and lower communication bandwidth.

Abstract:
Diffusion and flow matching models have emerged as powerful robot policies, enabling Vision-Language-Action (VLA) models to generalize across diverse scenes and instructions. Yet, when trained via imitation learning, their high generative capacity makes them sensitive to noise in human demonstrations: jerks, pauses, and jitter which reduce action coherence. Reduced action coherence causes instability and trajectory drift during deployment, failures that are catastrophic in fine-grained manipulation where precision is crucial. In this paper, we present Action Coherence Guidance (ACG) for VLA models, a training-free test-time guidance algorithm that improves action coherence and thereby yields performance gains. Evaluated on RoboCasa, DexMimicGen, and real-world SO-101 tasks, ACG consistently improves action coherence and boosts success rates across diverse manipulation tasks.

Abstract:
Embodied intelligence relies on accurately segmenting objects actively involved in interactions. Action-based video object segmentation addresses this by linking segmentation with action semantics, but it depends on large-scale annotations and prompts that are costly, inconsistent, and prone to multimodal noise such as imprecise masks and referential ambiguity. To date, this challenge remains unexplored. In this work, we take the first step by studying action-based video object segmentation under label noise, focusing on two sources: textual prompt noise (category flips and within-category noun substitutions) and mask annotation noise (perturbed object boundaries to mimic imprecise supervision). Our contributions are threefold. First, we introduce two types of label noises for the action-based video object segmentation task. Second, we build up the first action-based video object segmentation under a label noise benchmark ActiSeg-NL and adapt six label-noise learning strategies to this setting, and establish protocols for evaluating them under textual, boundary, and mixed noise. Third, we provide a comprehensive analysis linking noise types to failure modes and robustness gains, and we introduce a Parallel Mask Head Mechanism (PMHM) to address mask annotation noise. Qualitative evaluations further reveal characteristic failure modes, including boundary leakage and mislocalization under boundary perturbations, as well as occasional identity substitutions under textual flips. Our comparative analysis reveals that different learning strategies exhibit distinct robustness profiles, governed by a foreground-background trade-off where some achieve balanced performance while others prioritize foreground accuracy at the cost of background precision. These results establish a clear sensitivity profile of action-based video object segmentation to imperfect annotations and set a benchmark for studying noise-robust learning in embodied perception.

Abstract:
Recent Vision-Language-Action (VLA) models show potential to generalize across embodiments but struggle to quickly align with a new robots action space when high-quality demonstrations are scarce, especially for bipedal humanoids. We present TrajBooster, a cross-embodiment framework that leverages abundant wheeled-humanoid data to boost bipedal VLA. Our key idea is to use end-effector trajectories as a morphology-agnostic interface. TrajBooster (i) extracts 6D dual-arm end-effector trajectories from real-world wheeled humanoids, (ii) retargets them in simulation to Unitree G1 with a whole-body controller trained via a heuristic-enhanced harmonized online DAgger to lift low-dimensional trajectory references into feasible high-dimensional whole-body actions, and (iii) forms heterogeneous triplets that couple source vision/language with target humanoid-compatible actions to post-pre-train a VLA, followed by only 10 minutes of teleoperation data collection on the target humanoid domain. Deployed on Unitree G1, our policy achieves beyond-tabletop household tasks, enabling squatting, cross-height manipulation, and coordinated whole-body motion with markedly improved robustness and generalization. Results show that TrajBooster allows existing wheeled-humanoid data to efficiently strengthen bipedal humanoid VLA performance, reducing reliance on costly same-embodiment data while enhancing action space understanding and zero-shot skill transfer capabilities.

Abstract:
Underwater stereo depth estimation provides accurate 3D geometry for robotics tasks such as navigation, inspection, and mapping, offering metric depth from low-cost passive cameras while avoiding the scale ambiguity of monocular methods. However, existing approaches face two critical challenges: (i) parameter-efficiently adapting large vision foundation encoders to the underwater domain without extensive labeled data, and (ii) tightly fusing globally coherent but scale-ambiguous monocular priors with locally metric yet photometrically fragile stereo correspondences. To address these challenges, we propose StereoAdapter, a parameter-efficient self-supervised framework that integrates a LoRA-adapted monocular foundation encoder with a recurrent stereo refinement module. We further introduce dynamic LoRA adaptation for efficient rank selection and pre-training on the synthetic UW-StereoDepth-40K dataset to enhance robustness under diverse underwater conditions. Comprehensive evaluations on both simulated and real-world benchmarks show improvements of 6.11% on TartanAir and 5.12% on SQUID compared to state-of-the-art methods, while real-world deployment with the BlueROV2 robot further demonstrates the consistent robustness of our approach.

Abstract:
Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of multi-turn dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN achieves real-time dialogues through KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE benchmarks show state-of-the-art performance with low latency, ensuring robustness and efficiency in real-world deployment.

Abstract:
Humanoid motion tracking policies are central to building teleoperation pipelines and hierarchical controllers, yet they face a fundamental challenge: the embodiment gap between humans and humanoid robots. Current approaches address this gap by retargeting human motion data to humanoid embodiments and then training reinforcement learning (RL) policies to imitate these reference trajectories. However, artifacts introduced during retargeting, such as foot sliding, self-penetration, and physically infeasible motion are often left in the reference trajectories for the RL policy to correct. While prior work has demonstrated motion tracking abilities, they often require extensive reward engineering and domain randomization to succeed. In this paper, we systematically evaluate how retargeting quality affects policy performance when excessive reward tuning is suppressed. To address issues that we identify with existing retargeting methods, we propose a new retargeting method, General Motion Retargeting (GMR). We evaluate GMR alongside two open-source retargeters, PHC and ProtoMotions2, as well as with a high-quality closed-source dataset from Unitree. Using BeyondMimic for policy training, we isolate retargeting effects without reward tuning. Our experiments on a diverse subset of the LAFAN1 dataset reveal that while most motions can be tracked, artifacts in retargeted data significantly reduce policy robustness, particularly for dynamic or long sequences. GMR consistently outperforms existing open-source methods in both tracking performance and faithfulness to the source motion, achieving perceptual fidelity and policy success rates close to the closed-source baseline.

Abstract:
In this paper, we propose FastViDAR, a novel framework that takes four fisheye camera inputs and produces a full 360 depth map along with per-camera depth, fusion depth, and confidence estimates. Our main contributions are: (1) We introduce Alternative Hierarchical Attention (AHA) mechanism that efficiently fuses features across views through separate intra-frame and inter-frame windowed self-attention, achieving cross-view feature mixing with reduced overhead. (2) We propose a novel equirectangular projection (ERP) fusion approach that projects multi-view depth estimates to a shared equirectangular coordinate system to obtain the final fusion depth. (3) We generate ERP image-depth pairs using HM3D and 2D-3D-S datasets for comprehensive evaluation, demonstrating competitive zero-shot performance on real datasets while achieving up to 20 FPS on NVIDIA Orin NX embedded hardware.

Abstract:
Trajectory planning in unstructured environments is a fundamental and challenging capability for mobile robots. Traditional modular pipelines suffer from latency and cascading errors across perception, localization, mapping, and planning modules. Recent end-to-end learning methods map raw visual observations directly to control signals or trajectories, promising greater performance and efficiency in open-world settings. However, most prior end-to-end approaches still rely on separate localization modules that depend on accurate sensor extrinsic calibration for self-state estimation, thereby limiting generalization across embodiments and environments. We introduce LoGoPlanner, a localization-grounded, end-to-end navigation framework that addresses these limitations by: (1) finetuning a long-horizon visual-geometry backbone to ground predictions with absolute metric scale, thereby providing implicit state estimation for accurate localization; (2) reconstructing surrounding scene geometry from historical observations to supply dense, fine-grained environmental awareness for reliable obstacle avoidance; and (3) conditioning the policy on implicit geometry bootstrapped by the aforementioned auxiliary tasks, thereby reducing error propagation. We evaluate LoGoPlanner in both simulation and real-world settings, where its fully end-to-end design reduces cumulative error while metric-aware geometry memory enhances planning consistency and obstacle avoidance, leading to more than a 27.3% improvement over oracle-localization baselines and strong generalization across embodiments and environments. The code and models will be made publicly available upon publication.

Abstract:
Robots can adapt to user preferences by learning reward functions from demonstrations, but with limited data, reward models often overfit to spurious correlations and fail to generalize. This happens because demonstrations show robots how to do a task but not what matters for that task, causing the model to focus on irrelevant state details. Natural language can more directly specify what the robot should focus on, and, in principle, disambiguate between many reward functions consistent with the demonstrations. However, existing language-conditioned reward learning methods typically treat instructions as simple conditioning signals, without fully exploiting their potential to resolve ambiguity. Moreover, real instructions are often ambiguous themselves, so naive conditioning is unreliable. Our key insight is that these two input types carry complementary information: demonstrations show how to act, while language specifies what is important. We propose Masked Inverse Reinforcement Learning (Masked IRL), a framework that uses large language models (LLMs) to combine the strengths of both input types. Masked IRL infers state-relevance masks from language instructions and enforces invariance to irrelevant state components. When instructions are ambiguous, it uses LLM reasoning to clarify them in the context of the demonstrations. In simulation and on a real robot, Masked IRL outperforms prior language-conditioned IRL methods by up to 15% while using up to 4.7 times less data, demonstrating improved sample-efficiency, generalization, and robustness to ambiguous language.

Abstract:
Learning to navigate in dynamic and complex open-world environments is a critical yet challenging capability for autonomous robots. Existing approaches often rely on cascaded modular frameworks, which require extensive hyperparameter tuning or learning from limited real-world demonstration data. In this paper, we propose Navigation Diffusion Policy (NavDP), an end-to-end network trained solely in simulation that enables zero-shot sim-to-real transfer across diverse environments and robot embodiments. The core of NavDP is a unified transformer-based architecture that jointly learns trajectory generation and trajectory evaluation, both conditioned solely on local RGB-D observations. By learning to predict critic values for contrastive trajectory samples, our proposed approach effectively leverages supervision from privileged information available in simulation, thereby fostering accurate spatial understanding and enabling the distinction between safe and dangerous behaviors. To support this, we develop an efficient data generation pipeline in simulation and construct a large-scale dataset encompassing over one million meters of navigation experience across 3,000 scenes. Empirical experiments in both simulated and real-world environments demonstrate that NavDP significantly outperforms prior state-of-the-art methods. Furthermore, we identify key factors influencing the generalization performance of NavDP. The dataset and code are publicly available at hrefhttps://wzcai99.github.io/navigation-diffusion-policy.github.iotextbfhttps://wzcai99.github.io/navigation-diffusion-policy.github.io.

Abstract:
Lightweight aerial swarms have potential applications in scenarios where larger drones fail to operate efficiently. The primary foundation for lightweight aerial swarms is efficient relative localization, which enables cooperation and collision avoidance. Computing the real-time position is challenging due to extreme resource constraints. This paper presents an autonomous relative localization technique for lightweight aerial swarms without infrastructure by fusing ultra-wideband wireless distance measurements and the shared state information (e.g., velocity, yaw rate, height) from neighbors. This is the first fully autonomous, tiny, fast, and accurate relative localization scheme implemented on a team of 13 lightweight (33 grams) and resource-constrained (168MHz MCU with 192 KB memory) aerial vehicles. The proposed resource-constrained swarm ranging protocol is scalable, and a surprising theoretical result is discovered: the unobservability poses no issues because the state drift leads to control actions that make the state observable again. By experiment, less than 0.2m position error is achieved at the frequency of 16Hz for as many as 13 drones. The code is open-sourced, and the proposed technique is relevant not only for tiny drones but can be readily applied to many other resource-restricted robots. Video and code can be found at https://shushuai3.github.io/autonomous-swarm/.

Abstract:
Current autonomous driving vehicles rely mainly on their individual sensors to understand surrounding scenes and plan for future trajectories, which can be unreliable when the sensors are malfunctioning or occluded. To address this problem, cooperative perception methods via vehicle-to-vehicle (V2V) communication have been proposed, but they have tended to focus on perception tasks like detection or tracking. How those approaches contribute to overall cooperative planning performance is still under-explored. Inspired by recent progress using Large Language Models (LLMs) to build autonomous driving systems, we propose a novel problem setting that integrates a Multimodal LLM into cooperative autonomous driving, with the proposed Vehicle-to-Vehicle Question-Answering (V2V-QA) dataset and benchmark. We also propose our baseline method Vehicle-to-Vehicle Multimodal Large Language Model (V2V-LLM), which uses an LLM to fuse perception information from multiple connected autonomous vehicles (CAVs) and answer various types of driving-related questions: grounding, notable object identification, and planning. Experimental results show that our proposed V2V-LLM can be a promising unified model architecture for performing various tasks in cooperative autonomous driving, and outperforms other baseline methods that use different fusion approaches. Our work also creates a new research direction that can improve the safety of future autonomous driving systems. Our code and dataset are released to facilitate open-source research at https://eddyhkchiu.github.io/v2vllm.github.io/.

Abstract:
Collaborative perception aims to extend sensing coverage and improve perception accuracy by sharing information among multiple agents. However, due to differences in viewpoints and spatial positions, agents often acquire heterogeneous observations. Existing intermediate fusion methods primarily focus on aligning similar features, often overlooking the perceptual diversity among agents. To address this limitation, we propose CoBEVMoE, a novel collaborative perception framework that operates in the Birds Eye View (BEV) space and incorporates a Dynamic Mixture-of-Experts (DMoE) architecture. In DMoE, each expert is dynamically generated based on the input features of a specific agent, enabling it to extract distinctive and reliable cues while attending to shared semantics. This design allows the fusion process to explicitly model both feature similarity and heterogeneity across agents. Furthermore, we introduce a Dynamic Expert Metric Loss (DEML) to enhance inter-expert diversity and improve the discriminability of the fused representation. Extensive experiments on the OPV2V and DAIR-V2X-C datasets demonstrate that CoBEVMoE achieves state-of-the-art performance. Specifically, it improves the IoU for Camera-based BEV segmentation by +1.5% on OPV2V and the AP@0.5 for LiDAR-based 3D object detection by 3.0% on DAIR-V2X-C, verifying the effectiveness of expert-based heterogeneous feature modeling in multi-agent collaborative perception. The source code will be made available.

Abstract:
As the embodiment gap between a robot and a human narrows, new opportunities arise to leverage datasets of humans interacting with their surroundings for robot learning. % We propose a novel technique for training sensorimotor policies with reinforcement learning by imitating predictive models of human motions. % Our key insight is that the motion of keypoints on human-inspired robot end-effectors closely mirrors the motion of corresponding human body keypoints. % This enables us to use a model trained to predict future motion on human data emphzero-shot on robot data. % We train sensorimotor policies to track the predictions of such a model, conditioned on a history of past robot states, while optimizing a relatively sparse task reward. % This approach entirely bypasses gradient-based kinematic retargeting and adversarial losses, which limit existing methods from fully leveraging the scale and diversity of modern human-scene interaction datasets. % Empirically, we find that our approach can work across robots and tasks, outperforming existing baselines by a large margin. % In addition, we find that tracking a human motion model can substitute for carefully designed dense rewards and curricula in manipulation tasks. Code, data and qualitative results available at urlhttps://dynamicsprediction.space

Abstract:
Diffusion policies (DPs) achieve state-of-the-art performance on complex manipulation tasks by learning from large-scale demonstration datasets, often spanning multiple embodiments and environments. However, they cannot guarantee safe behavior, requiring external safety mechanisms. These, however, alter actions in ways unseen during training, causing unpredictable behavior and performance degradation. To address these problems, we propose path-consistent safety filtering (PACS) for DPs. Our approach performs path-consistent braking on a trajectory computed from the sequence of generated actions. In this way, we keep the execution consistent with the training distribution of the policy, maintaining the learned, task-completing behavior. To enable real-time deployment and handle uncertainties, we verify safety using set-based reachability analysis. Our experimental evaluation in simulation and on three challenging real-world human-robot interaction tasks shows that PACS (a) provides formal safety guarantees in dynamic environments, (b) preserves task success rates, and (c) outperforms reactive safety approaches, such as control barrier functions, by up to 68% in terms of task success. Videos are available at our project website: tum-lsy.github.io/pacs.

Abstract:
Panoramic perception holds significant potential for autonomous driving, enabling vehicles to acquire a comprehensive 360° surround view in a single shot. However, autonomous driving is a data-driven task. Complete panoramic data acquisition requires complex sampling systems and annotation pipelines, which are time-consuming and labor-intensive. Although existing street view generation models have demonstrated strong data regeneration capabilities, they can only learn from the fixed data distribution of existing datasets and cannot achieve high-quality, controllable panoramic generation. In this paper, we propose the first panoramic generation method, Percep360, for autonomous driving. Percep360 enables coherent generation of panoramic data with control signals based on the stitched panoramic data. Percep360 focuses on two key aspects: coherence and controllability. Specifically, to overcome the inherent information loss caused by the pinhole sampling process, we propose the Local Scenes Diffusion Method (LSDM). LSDM reformulates the panorama generation as a spatially continuous diffusion process, bridging the gaps between different data distributions. Additionally, to achieve the controllable generation of panoramic images, we propose a Probabilistic Prompting Method (PPM). PPM dynamically selects the most relevant control cues, enabling controllable panoramic image generation. We evaluate the effectiveness of the generated images from three perspectives: image quality assessment (i.e.,no-reference and with reference), controllability, and their utility in real-world Birds Eye View (BEV) segmentation. Notably, the generated data consistently outperforms the original stitched images in no-reference quality metrics and enhances downstream perception models, leading to an improvement of 2.5% in mIoU for panoramic BEV segmentation. The source code will be publicly available.

Abstract:
Teleoperation presents a promising paradigm for remote control and robot proprioceptive data collection. Despite recent progress, current teleoperation systems still suffer from limitations in efficiency and ergonomics, particularly in challenging scenarios. In this paper, we propose CaFe-TeleVision, a coarse-to-fine teleoperation system with immersive situated visualization for enhanced ergonomics. At its core, a coarse-tofine control mechanism is proposed in the retargeting module to bridge workspace disparities, jointly optimizing efficiency and physical ergonomics. To stream immersive feedback with adequate visual cues for human vision systems, an on-demand situated visualization technique is integrated in the perception module, which reduces the cognitive load for multi-view processing. The system is built on a humanoid collaborative robot and validated with six challenging bimanual manipulation tasks. User study among 24 participants confirms that CaFe-TeleVision enhances ergonomics with statistical significance, indicating a lower task load and a higher user acceptance during teleoperation. Quantitative results also validate superior performance of our system across six tasks, surpassing comparative methods by up to 28.89% in success rate and accelerating by 26.81% in completion time. The system will be open-sourced later.

Abstract:
Current state-of-the-art autonomous vehicles could face safety critical situations when their local sensors are occluded by large objects on the road nearby. Vehicle-to-vehicle (V2V) cooperative autonomous driving is proposed to address this problem. More recent work further adopts a new approach that applies Multimodal Large Language Models (MLLMs) for cooperative autonomous driving due to its potential multimodal understanding and reasoning abilities. However, graph-of-thoughts reasoning frameworks have not been considered for prior research on V2V cooperative autonomous driving. In this paper, we propose a novel graph-of-thoughts framework specifically designed for MLLM-based cooperative autonomous driving. Our graph-of-thoughts includes our proposed novel ideas of occlusion-aware perception and planning-aware prediction. We curate the V2V-GoT-QA dataset and develop the V2V-GoT model for training and testing the cooperative driving graph-of-thoughts. Our experimental results show that our proposed method outperforms other baselines in cooperative perception, prediction, and planning tasks. Our code and dataset are released to facilitate open-source research at https://eddyhkchiu.github.io/v2vgot.github.io/.

Abstract:
Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences (1.32M frames) rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines (e.g., Depth-Anything-v2, DepthCrafter), and a normal variant (DKT-Normal) sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame (832×480). Integrated into a grasping stack, DKTs depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: Diffusion knows transparency. Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.

Abstract:
Dexterous grasp datasets are vital for embodied intelligence, but mostly emphasize grasp stability, ignoring functional grasps needed for tasks like opening bottle caps or holding cup handles. Most rely on bulky, costly, and hard-to-control high-DOF Shadow Hands. Inspired by the human hands underactuated mechanism, we establish UniFucGrasp, a universal functional grasp annotation strategy and dataset for multiple dexterous hand types. Based on biomimicry, it maps natural human motions to diverse hand structures and uses geometry-based force closure to ensure functional, stable, human-like grasps. This method supports low-cost, efficient collection of diverse, high-quality functional grasps. Finally, we establish the first multi-hand functional grasp dataset and provide a synthesis model to validate its effectiveness. Experiments on the UFG dataset, IsaacSim, and complex robotic tasks show that our method improves functional manipulation accuracy and grasp stability, demonstrates improved adaptability across multiple robotic hands, helping to alleviate annotation cost and generalization challenges in dexterous grasping.

Abstract:
Vision-based tactile sensors (VBTS) face an inherent trade-off in tactile skin design. Opaque ink markers enable accurate force and tangential displacement estimation but occlude geometric features essential for object and texture classification. Conversely, markerless skins preserve surface details yet provide limited capability for tangential motion estimation. Existing approaches, including UV illumination and learning-based virtual marker transfer, increase hardware complexity or computational cost. We present a novel tactile skin with translucent, tinted markers balancing the modes of marker and markerless for VBTS. This design enables concurrent tangential displacement tracking, force estimation, and preservation of surface geometry. It integrates directly with the GelSight sensor family, requiring no additional hardware and minimal software modification. Experimental evaluation demonstrates that translucent skin improves overall sensing performance relative to both opaque-marker and markerless configurations. It achieves 99.17% accuracy in object classification, 93.51% in texture classification, 97% point retention in tangential displacement tracking, and a 66% reduction in total force error. These results indicate that translucent skin substantially mitigate the traditional trade-off between marker-based and markerless tactile sensing, thereby expanding the applicability of multi-modal VBTS in tactile robotics.

Abstract:
Diffusion-based visuomotor policies excel at modeling action distributions but are inference-inefficient, since recursively denoising from noise to policy requires many steps and heavy UNet backbones, which hinders deployment on resource-constrained robots. Flow matching alleviates the sampling burden by learning a one-step vector field, yet prior implementations still inherit large UNet-style architectures. In this work, we present KAN-We-Flow, a flow-matching policy that draws on recent advances in Receptance Weighted Key Value (RWKV) and Kolmogorov-Arnold Networks (KAN) from vision to build a lightweight and highly expressive backbone for 3D manipulation. Concretely, we introduce an RWKV-KAN block: an RWKV first performs efficient sequence/spatial mixing to propagate task context, and a subsequent GroupKAN layer applies learnable spline-based, groupwise functional mappings to perform feature-wise nonlinear calibration of the action mapping on RWKV outputs. Moreover, we introduce an Action Consistency Regularization (ACR), a lightweight auxiliary loss that enforces alignment between predicted action trajectories and expert demonstrations via Euler extrapolation, providing additional supervision to stabilize training and improve policy precision. Without resorting to large UNets, our design reduces parameters by 86.8%, maintains fast runtime, and achieves state-of-the-art success rates on Adroit, Meta-World, and DexArt benchmarks.

Abstract:
Ergodic control synthesizes optimal coverage behaviors over spatial distributions for nonlinear systems. However, existing formulations model the robot as a non-volumetric point, whereas in practice a robot interacts with the environment through its body and sensors with physical volume. In this work, we introduce a new ergodic control formulation that optimizes spatial coverage using a volumetric state representation. Our method preserves the asymptotic coverage guarantees of ergodic control, adds minimal computational overhead for real-time control, and supports arbitrary sample-based volumetric models. We evaluate our method across search and manipulation tasks---with multiple robot dynamics and end-effector geometries or sensor models---and show that it improves coverage efficiency by more than a factor of two while maintaining a 100% task completion rate across all experiments, outperforming the standard ergodic control method. Finally, we demonstrate the effectiveness of our method on a robot arm performing mechanical erasing tasks. Project website: https://murpheylab.github.io/vec/.

Abstract:
Conventional passivity-based torque controllers for manipulators are typically unconstrained, which can lead to safety violations under external perturbations. In this paper, we employ viability theory to pre-compute safe sets in the state-space of joint positions and velocities. These viable sets, constructed via data-driven and analytical methods for self-collision avoidance, external object collision avoidance and joint-position and joint-velocity limits, provide constraints on joint accelerations and thus joint torques via the robot dynamics. A quadratic programming-based control framework enforces these constraints on a passive controller tracking a dynamical system, ensuring the robot states remain within the safe set in an infinite time horizon. We validate the proposed approach through simulations and hardware experiments on a 7-DoF Franka Emika manipulator. In comparison to a baseline constrained passive controller, our method operates at higher control-loop rates and yields smoother trajectories.

Abstract:
Vision-language-action policies learn manipulation skills across tasks, environments and embodiments through large-scale pre-training. However, their ability to generalize to novel robot configurations remains limited. Most approaches emphasize model size, dataset scale and diversity while paying less attention to the design of action spaces. This leads to the configuration generalization problem, which requires costly adaptation. We address this challenge by formulating cross-embodiment pre-training as designing policies equivariant to embodiment configuration transformations. Building on this principle, we propose a framework that (i) establishes a embodiment equivariance theory for action space and policy design, (ii) introduces an action decoder that enforces configuration equivariance, and (iii) incorporates a geometry-aware network architecture to enhance embodiment-agnostic spatial reasoning. Extensive experiments in both simulation and real-world settings demonstrate that our approach improves pre-training effectiveness and enables efficient fine-tuning on novel robot embodiments. Our code is available at the anonymous repository: urlhttps://github.com/hhcaz/e2vla

Abstract:
Establishing reliable image correspondences is essential for many robotic vision problems. However, existing methods often struggle in challenging scenarios with large viewpoint changes or textureless regions, where incorrect correspondences may still receive high similarity scores. This is mainly because conventional models rely solely on feature similarity, lacking an explicit mechanism to estimate the reliability of predicted matches, leading to overconfident errors. To address this issue, we propose SURE, a Semi-dense Uncertainty-REfined matching framework that jointly predicts correspondences and their confidence by modeling both aleatoric and epistemic uncertainties. Our approach introduces a novel evidential head for trustworthy coordinate regression, along with a lightweight spatial fusion module that enhances local feature precision with minimal overhead. We evaluated our method on multiple standard benchmarks, where it consistently outperforms existing state-of-the-art semi-dense matching models in both accuracy and efficiency. The code will be released upon publication to facilitate future research.

Abstract:
Industrial workflows demand adaptive and trustworthy assistance that can operate under limited computing, connectivity, and strict privacy constraints. In this work, we present MICA (Multi-Agent Industrial Coordination Assistant), a perception-grounded and speech-interactive system that delivers real-time guidance for assembly, troubleshooting, part queries, and maintenance. MICA coordinates five role-specialized language agents, audited by a safety checker, to ensure accurate and compliant support. To achieve robust step understanding, we introduce Adaptive Step Fusion (ASF), which dynamically blends expert reasoning with online adaptation from natural speech feedback. Furthermore, we establish a new multi-agent coordination benchmark across representative task categories and propose evaluation metrics tailored to industrial assistance, enabling systematic comparison of different coordination topologies. Our experiments demonstrate that MICA consistently improves task success, reliability, and responsiveness over baseline structures, while remaining deployable on practical offline hardware. Together, these contributions highlight MICA as a step toward deployable, privacy-preserving multi-agent assistants for dynamic factory environments. The source code will be made publicly available at https://github.com/Kratos-Wen/MICA.

Abstract:
Adapting trajectories to dynamic situations and user preferences is crucial for robot operation in unstructured environments with non-expert users. Natural language enables users to express these adjustments in an interactive manner. We introduce OVITA, an interpretable, open-vocabulary, language-driven framework designed for adapting robot trajectories in dynamic and novel situations based on human instructions. OVITA leverages multiple pre-trained Large Language Models (LLMs) to integrate user commands into trajectories generated by motion planners or those learned through demonstrations. OVITA employs code as an adaptation policy generated by an LLM, enabling users to adjust individual waypoints, thus providing flexible control. Another LLM which acts as a code explainer removes the need for expert users, enabling intuitive interactions. The efficacy and significance of the proposed OVITA framework is demonstrated through extensive simulations and real-world environments with diverse tasks involving spatiotemporal variations on heterogeneous robotic platforms such as a KUKA IIWA robot manipulator, Clearpath Jackal ground robot, and CrazyFlie drone.

Abstract:
Many manipulation tasks require careful force modulation. With insufficient force the task may fail, while excessive force could cause damage. The high cost, bulky size and fragility of commercial force/torque (F/T) sensors have limited large-scale, force-aware policy learning. We introduce UMI-FT, a handheld data-collection platform that mounts compact, six-axis force/torque sensors on each finger, enabling finger-level wrench measurements alongside RGB, depth, and pose. Using the multimodal data collected from this device, we train an adaptive compliance policy that predicts position targets, grasp force, and stiffness for execution on standard compliance controllers. In evaluations on three contact-rich, force-sensitive tasks (whiteboard wiping, skewering zucchini, and lightbulb insertion), UMI-FT enables policies that reliably regulate external contact forces and internal grasp forces, outperforming baselines that lack compliance or force sensing. UMI-FT offers a scalable path to learning compliant manipulation from in-the-wild demonstrations. We open-source the hardware and software to facilitate broader adoption at: https://umi-ft.github.io/

Abstract:
End-to-end learning is emerging as a powerful paradigm for robotic manipulation, but its effectiveness is limited by data scarcity and the heterogeneity of action spaces across robot embodiments. In particular, diverse action spaces across different end-effectors create barriers for cross-embodiment learning and skill transfer. We address this challenge through diffusion policies learned in a latent action space that unifies diverse end-effector actions. We first show that we can learn a semantically aligned latent action space for anthropomorphic robotic hands, a human hand, and a parallel jaw gripper using encoders trained with a contrastive loss. Second, we show that by using our proposed latent action space for co-training on manipulation data from different end-effectors, we can utilize a single policy for multi-robot control and obtain up to 25.3% improved manipulation success rates, indicating successful skill transfer despite a significant embodiment gap. Our approach using latent cross-embodiment policies presents a new method to unify different action spaces across embodiments, enabling efficient multi-robot control and data sharing across robot setups. This unified representation significantly reduces the need for extensive data collection for each new robot morphology, accelerates generalization across embodiments, and ultimately facilitates more scalable and efficient robotic learning.

Abstract:
In this paper, we address force-aware control and force distribution in robotic platforms with multi-fingered hands. Given a target goal and force estimates from tactile sensors, we design a controller that adapts the motion of the torso, arm, wrist, and fingers, redistributing forces to maintain stable contact with objects of varying mass distribution or unstable contacts. To estimate forces, we collect a dataset of tactile signals and ground-truth measurements using five Xela magnetic sensors interacting with indenters, and train force estimators. We then introduce a model-based control scheme that minimizes the distance between the Center of Pressure (CoP) and the centroid of the fingertips contact polygon. Since our method relies on estimated forces rather than raw tactile signals, it has the potential to be applied to any sensor capable of force estimation. We validate our framework on a balancing task with five objects, achieving a 82.7% success rate, and further evaluate it in multi-object scenarios, achieving 80% accuracy.

Abstract:
Seams are information-rich components of garments. The presence of different types of seams and their combinations helps to select grasping points for garment handling. In this paper, we propose a new Seam-Informed Strategy (SIS) for finding actions for handling a garment, such as grasping and unfolding a T-shirt. Candidates for a pair of grasping points for a dual-arm manipulator system are extracted using the proposed Seam Feature Extraction Method (SFEM). A pair of grasping points for the robot system is selected by the proposed Decision Matrix Iteration Method (DMIM). The decision matrix is first computed by multiple human demonstrations and updated by the robot execution results to improve the grasping and unfolding performance of the robot. Note that the proposed scheme is trained on real data without relying on simulation. Experimental results demonstrate the effectiveness and generalization ability of the proposed strategy.

Abstract:
Given a dataset of expert trajectories, standard imitation learning approaches typically learn a direct mapping from observations (e.g., RGB images) to actions. However, such methods often overlook the rich interplay between different modalities, i.e., sensory inputs, actions, and rewards which is crucial for modeling robot behavior and understanding task outcomes. In this work, we propose Multimodal Diffusion Forcing, a unified framework for learning from multimodal robot trajectories that extends beyond action generation. Rather than modeling a fixed distribution, MDF applies random partial masking and trains a diffusion model to reconstruct the trajectory. This training objective encourages the model to learn temporal and cross-modal dependencies, such as predicting the effects of actions on force signals or inferring states from partial observations. We evaluate MDF on contact-rich, forceful manipulation tasks in simulated and real-world environments. Our results show that MDF not only delivers versatile functionalities, but also achieves strong performance, and robustness under noisy observations.

Abstract:
In this paper, we explore adapter tuning and introduce a novel dual-adapter architecture for spatio-temporal multimodal tracking, dubbed DMTrack. The key of our DMTrack lies in two simple yet effective modules, including a spatio-temporal modality adapter (STMA) and a progressive modality complementary adapter (PMCA) module. The former, applied to each modality alone, aims to adjust spatio-temporal features extracted from a frozen backbone by self-prompting, which to some extent can bridge the gap between different modalities and thus allows better cross-modality fusion. The latter seeks to facilitate cross-modality prompting progressively with two specially designed pixel-wise shallow and deep adapters. The shallow adapter employs shared parameters between the two modalities, aiming to bridge the information flow between the two modality branches, thereby laying the foundation for following modality fusion, while the deep adapter modulates the preliminarily fused information flow with pixel-wise inner-modal attention and further generates modality-aware prompts through pixel-wise inter-modal attention. With such designs, DMTrack achieves promising spatio-temporal multimodal tracking performance with merely textbf0.93M trainable parameters. Extensive experiments on five benchmarks demonstrate that DMTrack achieves state-of-the-art results. Our code and models will be available at https://github.com/Nightwatch-Fox11/DMTrack

Abstract:
Robots operating in shared human environments must not only navigate, interact, and detect their surroundings, they must also interpret and respond to dynamic, and often unpredictable, human behaviours. Although recent advances have shown promise in enhancing robotic perception and instruction-following using Vision-Language Models (VLMs), they remain limited in addressing the complexities of multimodal human-robot interactions (HRI). Motivated by this challenge, we introduce a lightweight language-to-vision feedback module that closes the loop between an LLM and the vision encoder in VLMs. The module projects image-token hidden states through a gated Multi-Layer Perceptron (MLP) back into the encoder input, prompting a second pass that reinterprets the scene under text context. We evaluate this approach on three robotics-centred tasks: navigation in a simulated environment (Habitat), sequential scene description (Mementos-Robotics), and human-intention recognition (our HRI dataset). Results show that our method improves Qwen 2.5 (7B) by 3.3% (less distance), +0.057 description score, and +2.93% accuracy, with less than 3% extra parameters; Gemma 3 (4B) and LLaVA OV 1.5 (4B) show mixed navigation results but gains +0.111,+0.055 and +10.81%, +4.79% on the latter two tasks.

Abstract:
Semi-supervised 3D object detection (SS3D), aiming to explore unlabeled data for boosting 3D object detectors, has emerged as an active research area in recent years. Some previous methods have shown substantial improvements by either employing heterogeneous teacher models to provide high-quality pseudo labels or enforcing feature-perspective consistency between the teacher and student networks. However, these methods overlook the fact that the model usually tends to exhibit low sensitivity to object geometries with limited labeled data, making it difficult to capture geometric information, which is crucial for enhancing the student models ability in object perception and localization. In this paper, we propose GeoTeacher to enhance the student model's ability to capture geometric relations of objects with limited training data, especially unlabeled data. We design a keypoint-based geometric relation supervision module that transfers the teacher models knowledge of object geometry to the student, thereby improving the students capability in understanding geometric relations. Furthermore, we introduce a voxel-wise data augmentation strategy that increases the diversity of object geometries, thereby further improving the student models ability to comprehend geometric structures. To preserve the integrity of distant objects during augmentation, we incorporate a distance-decay mechanism into this strategy. Moreover, GeoTeacher can be combined with different SS3D methods to further improve their performance. Extensive experiments on the ONCE and Waymo datasets indicate the effectiveness and generalization of our method and we achieve the new state-of-the-art results. Code will be available at https://github.com/LogosRoboticsGroup/GeoTeacher

Abstract:
This paper presents a novel semantics-aware inspection path planning paradigm called "Semantics-aware Predictive Planning" (SPP). Industrial environments that require the inspection of specific objects or structures (called "semantics"), such as ballast water tanks inside ships, often present structured and repetitive spatial arrangements of the semantics of interest. Motivated by this, we first contribute an algorithm that identifies spatially repeating patterns of semantics - exact or inexact - in a semantic scene graph representation and makes predictions about the evolution of the graph in the unseen parts of the environment using these patterns. Furthermore, two inspection path planning strategies, tailored to ballast water tank inspection, that exploit these predictions are proposed. To assess the performance of the novel predictive planning paradigm, both simulation and experimental evaluations are performed. First, we conduct a simulation study comparing the method against relevant state-of-the-art techniques and further present tests showing its ability to handle imperfect patterns. Second, we deploy our method onboard a collision-tolerant aerial robot operating inside the ballast tanks of two real ships. The results, both in simulation and field experiments, demonstrate significant improvement over the state-of-the-art in terms of inspection time while maintaining equal or better semantic surface coverage.

Abstract:
This paper jointly addresses three key limitations in conventional pedestrian trajectory forecasting: pedestrian perception errors, real-world data collection costs, and person ID annotation costs. We propose a novel framework, RealTraj, that enhances the real-world applicability of trajectory forecasting. Our approach includes two training phases--self-supervised pretraining on synthetic data and weakly-supervised fine-tuning with limited real-world data--to minimize data collection efforts. To improve robustness to real-world errors, we focus on both model design and training objectives. Specifically, we present Det2TrajFormer, a trajectory forecasting model that remains invariant to tracking noise by using past detections as inputs. Additionally, we pretrain the model using multiple pretext tasks, which enhance robustness and improve forecasting performance based solely on detection data. Unlike previous trajectory forecasting methods, our approach fine-tunes the model using only ground-truth detections, reducing the need for costly person ID annotations. In the experiments, we comprehensively verify the effectiveness of the proposed method against the limitations, and the method outperforms state-of-the-art trajectory forecasting methods on multiple datasets.

Abstract:
Multi-modal collaborative perception calls for great attention to enhancing the safety of autonomous driving. However, current multi-modal approaches remain a ``local fusion to communication sequence, which fuses multi-modal data locally and needs high bandwidth to transmit an individual's feature data before collaborative fusion. EIMC innovatively proposes an early collaborative paradigm. It injects lightweight collaborative voxels, transmitted by neighbor agents, into the egos local modality-fusion step, yielding compact yet informative 3D collaborative priors that tighten cross-modal alignment. Next, a heatmap-driven consensus protocol identifies exactly where cooperation is needed by computing per-pixel confidence heatmaps. Only the Top-K instance vectors located in these low-confidence, high-discrepancy regions are queried from peers, then fused via cross-attention for completion. Afterwards, we apply a refinement fusion that involves collecting the top-K most confident instances from each agent and enhancing their features using self-attention. The above instance-centric messaging reduces redundancy while guaranteeing that critical occluded objects are recovered. Evaluated on OPV2V and DAIR-V2X, EIMC attains 73.01% AP@0.5 while reducing byte bandwidth usage by 87.98% compared with the best published multi-modal collaborative detector. Code will be publicly released.

Abstract:
Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly deployed in robotic environments but remain vulnerable to jailbreaking attacks that bypass safety mechanisms and drive unsafe or physically harmful behaviors in the real world. Data-driven defenses such as jailbreak classifiers show promise, yet they struggle to generalize in domains where specialized datasets are scarce, limiting their effectiveness in robotics and other safety-critical contexts. To address this gap, we introduce J-DAPT, a lightweight framework for multimodal jailbreak detection through attention-based fusion and domain adaptation. J-DAPT integrates textual and visual embeddings to capture both semantic intent and environmental grounding, while aligning general-purpose jailbreak datasets with domain-specific reference data. Evaluations across autonomous driving, maritime robotics, and quadruped navigation show that J-DAPT boosts detection accuracy to very high levels (up to 100% in certain scenarios) under our evaluation protocol. These results demonstrate that J-DAPT provides a practical defense for securing VLMs in robotic applications. Additional materials are made available at: https://j-dapt.github.io.

Abstract:
Aerial robotics for transporting suspended payloads as the form of freely-floating manipulator are growing great interest in recent years. However, the force/torque caused by payload and residual dynamics will introduce unmodeled perturbations to the aerial robotics, which negatively affects the closed-loop performance. Different from estimation-like methods, this paper proposes Neural Predictor, a learning-based approach to model force/torque induced by payload and residual dynamics as a dynamical system. It yields a hybrid model that combines the first-principles dynamics with the learned dynamics. The hybrid model is then integrated into a MPC framework to improve closed-loop performance. Effectiveness of proposed framework is verified extensively in both numerical simulations and real-world flight experiments. The results indicate that our approach can capture force/torque caused by suspended payload and residual dynamics accurately, respond quickly to the changes of them and improve the closed-loop performance significantly. In particular, Neural Predictor outperforms a state-of-the-art learning-based estimator and has reduced the force and torque estimation errors by up to 66.15% and 33.33% while requiring less samples.

Abstract:
Multi-agent path finding (MAPF) involves planning efficient paths for multiple agents to move simultaneously while avoiding collisions. In typical warehouse environments, agents are often sparsely distributed along aisles; however, increasing the agent density can improve space efficiency. When the agent density is high, it becomes necessary to optimize the paths not only for goal-assigned agents but also for those obstructing them. This study proposes a novel MAPF framework for high-density environments (MAPF-HD). Several studies have explored MAPF in similar settings using integer linear programming (ILP). However, ILP-based methods require substantial computation time to optimize all agent paths simultaneously. Even in small grid-based environments with fewer than 100 cells, these computations can take tens to hundreds of seconds. Such high computational costs render these methods impractical for large-scale applications such as automated warehouses and valet parking. To address these limitations, we introduce the phased null-agent swapping (PHANS) method. PHANS employs a heuristic approach to incrementally swap positions between agents and empty vertices. This method solves the MAPF-HD problem within a few seconds, even in large environments containing more than 700 cells. The proposed method has the potential to improve efficiency in various real-world applications such as warehouse logistics, traffic management, and crowd control.

Abstract:
Accurate LiDAR-camera calibration is crucial for multi-sensor systems. However, traditional methods often rely on physical targets, which are impractical for real-world deployment. Moreover, even carefully calibrated extrinsics can degrade over time due to sensor drift or external disturbances, necessitating periodic recalibration. To address these challenges, we present a Targetless LiDARCamera Calibration (TLC-Calib) that jointly optimizes sensor poses with a neural Gaussianbased scene representation. Reliable LiDAR points are frozen as anchor Gaussians to preserve global structure, while auxiliary Gaussians prevent local overfitting under noisy initialization. Our fully differentiable pipeline with photometric and geometric regularization achieves robust and generalizable calibration, consistently outperforming existing targetless methods on the KITTI-360, Waymo, and Fast-LIVO2 datasets. In addition, it yields more consistent Novel View Synthesis results, reflecting improved extrinsic alignment. The project page is available at: https://www.haebeom.com/tlc-calib-site/.

Abstract:
We present a novel recursive Bayesian estimation framework using B-splines for continuous-time 6-DoF dynamic motion estimation. The state vector consists of a recurrent set of position control points and orientation control point increments, enabling efficient estimation via a modified iterated extended Kalman filter without involving error-state formulations. The resulting recursive spline estimator (RESPLE) is further leveraged to develop a versatile suite of direct LiDAR-based odometry solutions, supporting the integration of one or multiple LiDARs and an IMU. We conduct extensive real-world evaluations using public datasets and our own experiments, covering diverse sensor setups, platforms, and environments. Compared to existing systems, RESPLE achieves comparable or superior estimation accuracy and robustness, while attaining real-time efficiency. Our results and analysis demonstrate RESPLE's strength in handling highly dynamic motions and complex scenes within a lightweight and flexible design, showing strong potential as a universal framework for multi-sensor motion estimation. We release the source code and experimental datasets at https://github.com/ASIG-X/RESPLE.

Abstract:
Sequential robot manipulation tasks require finding collision-free trajectories that satisfy geometric constraints across multiple object interactions in potentially high-dimensional configuration spaces. Solving these problems in real-time and at large scales has remained out of reach due to computational requirements. Recently, GPU-based acceleration has shown promising results, but prior methods achieve limited performance due to CPU-GPU data transfer overhead and complex logic that prevents full hardware utilization. To this end, we present SPaSM (Sampling Particle optimization for Sequential Manipulation), a fully GPU-parallelized framework that compiles constraint evaluation, sampling, and gradient-based optimization into optimized CUDA kernels for end-to-end trajectory optimization without CPU coordination. The method consists of a two-stage particle optimization strategy: first solving placement constraints through massively parallel sampling, then lifting solutions to full trajectory optimization in joint space. Unlike hierarchical approaches, SPaSM jointly optimizes object placements and robot trajectories to handle scenarios where motion feasibility constrains placement options. Experimental evaluation on challenging benchmarks demonstrates solution times in the realm of milliseconds with a 100% success rate; a 4000x speedup compared to existing approaches. Code and examples are available at commalab.org/papers/spasm.

Abstract:
Recent advances in vision-language models have made zero-shot navigation feasible, enabling robots to interpret and follow natural language instructions without requiring labeling. However, existing methods that explicitly store language vectors in grid or node-based maps struggle to scale to large environments due to excessive memory requirements and limited resolution for fine-grained planning. We introduce LAMP (Language Map), a novel neural language field-based navigation framework that learns a continuous, language-driven map and directly leverages it for fine-grained path generation. Unlike prior approaches, our method encodes language features as an implicit neural field rather than storing them explicitly at every location. By combining this implicit representation with a sparse graph, LAMP supports efficient coarse path planning and then performs gradient-based optimization in the learned field to refine poses near the goal. Our two-stage pipeline of coarse graph search followed by language-driven, gradient-guided optimization is the first application of an implicit language map for precise path generation. This refinement is particularly effective at selecting goal regions not directly observed by leveraging semantic similarities in the learned feature space. To further enhance robustness, we adopt a Bayesian framework that models embedding uncertainty via the von MisesFisher distribution, thereby improving generalization to unobserved regions. To scale to large environments, LAMP employs a graph sampling strategy that prioritizes spatial coverage and embedding confidence, retaining only the most informative nodes and substantially reducing computational overhead. Our experimental results, both in NVIDIA Isaac Sim and on a real multi-floor building, demonstrate that LAMP outperforms existing explicit methods in both memory efficiency and fine-grained goal-reaching accuracy, opening new possibilities for scalable, language-driven robot navigation.

Abstract:
Sequentially solving similar optimization problems under strict runtime constraints is essential for many applications, such as robot control, autonomous driving, and portfolio management. The performance of local optimization methods in these settings is sensitive to the initial solution: poor initialization can lead to slow convergence or suboptimal solutions. To address this challenge, we propose learning to predict multiple diverse initial solutions given parameters that define the problem instance. We introduce two strategies for utilizing multiple initial solutions: (i) a single-optimizer approach, where the most promising initial solution is chosen using a selection function, and (ii) a multiple-optimizers approach, where several optimizers, potentially run in parallel, are each initialized with a different solution, with the best solution chosen afterward. Notably, by including a default initialization among predicted ones, the cost of the final output is guaranteed to be equal or lower than with the default initialization. We validate our method on three optimal control benchmark tasks: cart-pole, reacher, and autonomous driving, using different optimizers: DDP, MPPI, and iLQR. We find significant and consistent improvement with our method across all evaluation settings and demonstrate that it efficiently scales with the number of initial solutions required.

Abstract:
The recent trend in scaling models for robot learning has resulted in impressive policies that can perform various manipulation tasks and generalize to novel scenarios. However, these policies continue to struggle with following instructions, likely due to the limited linguistic and action sequence diversity in existing robotics datasets. This paper introduces Task Robustness via RE-Labelling Vision-Action Robot Data (TREAD), a scalable framework that leverages large Vision-Language Models (VLMs) to augment existing robotics datasets without additional data collection, harnessing the transferable knowledge embedded in these models. Our approach leverages a pretrained VLM through three stages: generating semantic sub-tasks from original instruction labels and initial scenes, segmenting demonstration videos conditioned on these sub-tasks, and producing diverse instructions that incorporate object properties, effectively decomposing longer demonstrations into grounded language-action pairs. We further enhance robustness by augmenting the data with linguistically diverse versions of the text goals. Evaluations on LIBERO demonstrate that policies trained on our augmented datasets exhibit improved performance on novel, unseen tasks and goals. Our results show that TREAD enhances both planning generalization through trajectory decomposition and language-conditioned policy generalization through increased linguistic diversity. Project website: https://akuramshin.github.io/tread.

Abstract:
Through multi-agent competition and the sparse high-level objective of winning a race, we find that both agile flight (e.g., high-speed motion pushing the platform to its physical limits) and strategy (e.g., overtaking or blocking) emerge from agents trained with reinforcement learning. We provide evidence in both simulation and the real world that this approach outperforms the common paradigm of training agents in isolation with rewards that prescribe behavior, e.g., progress on the raceline, in particular when the complexity of the environment increases, e.g., in the presence of obstacles. Moreover, we find that multi-agent competition yields policies that transfer more reliably to the real world than policies trained with a single-agent progress-based reward, despite the two methods using the same simulation environment, randomization strategy, and hardware. In addition to improved sim-to-real transfer, the multi-agent policies also exhibit some degree of generalization to opponents unseen at training time. Overall, our work, following in the tradition of multi-agent competitive game-play in digital domains, shows that sparse task-level rewards are sufficient for training agents capable of advanced low-level control in the physical world.

Abstract:
We propose a framework for active mapping and exploration that leverages Gaussian splatting for constructing dense maps. Further, we develop a GPU-accelerated motion planning algorithm that can exploit the Gaussian map for real-time navigation. The Gaussian map constructed onboard the robot is optimized for both photometric and geometric quality while enabling real-time situational awareness for autonomy. We show through viewpoint selection experiments that our method yields comparable Peak Signal-to-Noise Ratio (PSNR) and similar reconstruction error to state-of-the-art approaches, while being orders of magnitude faster to compute. In closed-loop physics-based simulation and real-world experiments, our algorithm achieves better map quality (at least 0.8dB higher PSNR and more than 16% higher geometric reconstruction accuracy) than maps constructed by a state-of-the-art method, enabling semantic segmentation using off-the-shelf open-set models.

Abstract:
Safe autonomous systems in complex environments require robust road anomaly segmentation to identify unknown obstacles. However, existing approaches often rely on pixel-level statistics to determine whether a region appears anomalous. This reliance leads to high false-positive rates on semantically normal background regions such as sky or vegetation, and poor recall of true Out-of-distribution (OOD) instances, thereby posing safety risks for robotic perception and decision-making. To address these challenges, we propose VL-Anomaly, a vision-language anomaly segmentation framework that incorporates semantic priors from pre-trained Vision-Language Models (VLMs). Specifically, we design a prompt learning-driven alignment module that adapts Mask2Formers visual features to CLIP text embeddings of known categories, effectively suppressing spurious anomaly responses in background regions. At inference time, we further introduce a multi-source inference strategy that integrates text-guided similarity, CLIP-based image-text similarity and detector confidence, enabling more reliable anomaly prediction by leveraging complementary information sources. Extensive experiments demonstrate that VL-Anomaly achieves state-of-the-art performance on benchmark datasets including RoadAnomaly, SMIYC, and Fishyscapes. Code will be released upon acceptance.

Abstract:
Robots must satisfy safety-critical state and input constraints despite disturbances and model mismatch. We introduce a robust model predictive control (RMPC) formulation that is scalable and compatible with real-time implementation. Our formulation guarantees robust constraint satisfaction, input-to-state stability (ISS) and recursive feasibility. The key idea is to decompose the uncertain nonlinear system into (i) a nominal nonlinear dynamic model, (ii) disturbance-feedback controllers, and (iii) bounds on the model error. These components are optimized jointly using sequential convex programming. The resulting convex subproblems are solved using a recent disturbance-feedback MPC solver. The approach is validated across multiple dynamics, including a rocket-landing problem with steerable thrust. An open-source implementation is available at https://github.com/antoineleeman/robust-nonlinear-mpc.

Abstract:
Dexterous robotic hands enable robots to perform complex manipulations that require fine-grained control and adaptability. Achieving such manipulation is challenging because the high degrees of freedom tightly couple hand and arm motions, making learning and control difficult. Successful dexterous manipulation relies not only on precise hand motions, but also on accurate spatial positioning of the arm and coordinated arm-hand dynamics. However, most existing visuomotor policies represent arm and hand actions in a single combined space, which often causes high-dimensional hand actions to dominate the coupled action space and compromise arm control. To address this, we propose DQ-RISE, which quantizes hand states to simplify hand motion prediction while preserving essential patterns, and applies a continuous relaxation that allows arm actions to diffuse jointly with these compact hand states. This design enables the policy to learn arm-hand coordination from data while preventing hand actions from overwhelming the action space. Experiments show that DQ-RISE achieves more balanced and efficient learning, paving the way toward structured and generalizable dexterous manipulation. Project website: https://rise-policy.github.io/DQ-RISE/.

Abstract:
We introduce M3CAD, a comprehensive benchmark designed to advance research in generic cooperative autonomous driving. M3CAD comprises 204 sequences with 30,000 frames. Each sequence includes data from multiple vehicles and different types of sensors, e.g., LiDAR point clouds, RGB images, and GPS/IMU, supporting a variety of autonomous driving tasks, including object detection and tracking, mapping, motion forecasting, occupancy prediction, and path planning. This rich multimodal setup enables M3CAD to support both single-vehicle and multi-vehicle cooperative autonomous driving research. To the best of our knowledge, M3CAD is the most complete benchmark specifically designed for cooperative, multi-task autonomous driving research. To test its effectiveness, we use M3CAD to evaluate both state-of-the-art single-vehicle and cooperative driving solutions, setting baseline performance results. Since most existing cooperative perception methods focus on merging features but often ignore network bandwidth requirements, we propose a new multi-level fusion approach which adaptively balances communication efficiency and perception accuracy based on the current network conditions. We release M3CAD, along with the baseline models and evaluation results, to support the development of robust cooperative autonomous driving systems. All resources will be made publicly available on our project webpage.

Abstract:
Fisheye cameras are increasingly adopted in robotics for near-field manipulation, navigation, and immersive perception, yet indoor depth benchmarks with accurate ground truth are still missing. To address this, we introduce WideDepth the first indoor dataset for fisheye depth estimation, featuring 101 scenes containing 5K high-resolution stereo pairs labeled with millimeter-level ground truth depth and disparity. Our dataset also includes paired pinhole and fisheye samples across varying fields of view and baselines in both horizontal and vertical stereo setups. We further propose a method to adapt pinhole-trained stereo models to fisheye images and introduce a novel stereo fisheye image generation pipeline based on high-resolution LiDAR scans. Leveraging these methods, we thoroughly evaluate state-of-the-art monocular depth, stereo matching, and depth completion models on our benchmark. Additionally, we provide 18K LiDAR-derived sparse depth training samples, achieving up to a 62% performance boost on fisheye data when fine-tuning pinhole-based stereo models. In summary, the high precision and versatility of our benchmark set a strong foundation for advancing research in fisheye depth estimation and robotics perception. Project page: ilyaind.github.io/WideDepth

Abstract:
We introduce SoftMimic, a framework for learning compliant whole-body control policies for humanoid robots from example motions. Imitating human motions with reinforcement learning allows humanoids to quickly learn new skills, but existing methods incentivize stiff control that aggressively corrects deviations from a reference motion, leading to brittle and unsafe behavior when the robot encounters unexpected contacts. In contrast, SoftMimic enables robots to respond compliantly to external forces while maintaining balance and posture. Our approach leverages an inverse kinematics solver to generate an augmented dataset of feasible compliant motions, which we use to train a reinforcement learning policy. By rewarding the policy for matching compliant responses rather than rigidly tracking the reference motion, SoftMimic learns to absorb disturbances and generalize to varied tasks from a single motion clip. We validate our method through simulations and real-world experiments, demonstrating safe and effective interaction with the environment.

Abstract:
Functional dexterous grasping requires precise hand-object interaction, going beyond simple gripping. Existing affordance-based methods primarily predict coarse interaction regions and cannot directly constrain the grasping posture, leading to a disconnection between visual perception and manipulation. To address this issue, we propose a multi-keypoint affordance representation for functional dexterous grasping, which directly encodes task-driven grasp configurations by localizing functional contact points. Our method introduces Contact-guided Multi-Keypoint Affordance (CMKA), leveraging human grasping experience images for weak supervision combined with Large Vision Models for fine affordance feature extraction, achieving generalization while avoiding manual keypoint annotations. Additionally, we present a Keypoint-based Grasp matrix Transformation (KGT) method, ensuring spatial consistency between hand keypoints and object contact points, thus providing a direct link between visual perception and dexterous grasping actions. Experiments on public real-world FAH datasets, IsaacGym simulation, and challenging robotic tasks demonstrate that our method significantly improves affordance localization accuracy, grasp consistency, and generalization to unseen tools and tasks, bridging the gap between visual affordance learning and dexterous robotic manipulation. The source code and demo videos are publicly available at https://github.com/PopeyePxx/MKA.

Abstract:
Visionlanguage models demonstrate unprecedented performance and generalization across a wide range of tasks and scenarios. Integrating these foundation models into robotic navigation systems opens pathways toward building general-purpose robots. Yet, evaluating these models navigation capabilities remains constrained by costly real-world trials, overly simplified simulations, and limited benchmarks. We introduce NaviTrace, a high-quality Visual Question Answering benchmark where a model receives an instruction and embodiment type (human, legged robot, wheeled robot, bicycle) and must output a 2D navigation trace in image space. Across 1000 scenarios and more than 3000 expert traces, we systematically evaluate eight state-of-the-art VLMs using a newly introduced semantic-aware trace score. This metric combines Dynamic Time Warping distance, goal endpoint error, and embodiment-conditioned penalties derived from per-pixel semantics and correlates with human preferences. Our evaluation reveals consistent gap to human performance caused by poor spatial grounding and goal localization. NaviTrace establishes a scalable and reproducible benchmark for real-world robotic navigation.

Abstract:
Accurate visual state estimation has been a central topic in robotics with a wide range of applications in robot navigation, autonomous driving, and autonomous flight. Recent advances in robot perception have led to significant improvements in the accuracy and robustness of state estimation, yet a fundamental challenge remains in how to quantify and calibrate its precision, i.e., how confident we are in an estimate and whether failures can be detected. This issue is particularly pronounced in visualinertial odometry (VIO), where the heteroscedastic and multimodal nature of the problem makes uncertainty quantification especially difficult. This paper introduces MUSE (Multimodal Uncertainty Quantification of State Estimation), a novel real-time learning-based framework that leverages the strong and efficient sequential modeling capacity of Mamba to estimate localization uncertainty from multiple asynchronous sensor streams. Experiments on both public and in-house datasets demonstrate that MUSE achieves superior reliability and robustness compared to existing uncertainty quantification methods, and ablation studies justify the benefits of its key design choices. We release our source code and dataset at https://github.com/hungdche/MUSE.

Abstract:
We introduce PhysWorld, a framework that enables robot learning from video generation through physical world modeling. Recent video generation models can synthesize photorealistic visual demonstrations from language commands and images, offering a powerful yet underexplored source of training signals for robotics. However, directly retargeting pixel motions from generated videos to robots neglects physics, often resulting in inaccurate manipulations. PhysWorld addresses this limitation by coupling video generation with physical world reconstruction. Given a single image and a task command, our method generates task-conditioned videos and reconstructs the underlying physical world from the videos, and the generated video motions are grounded into physically accurate actions through object-centric residual reinforcement learning with the physical world model. This synergy transforms implicit visual guidance into physically executable robotic trajectories, eliminating the need for real robot data collection and enabling zero-shot generalizable robotic manipulation. Experiments on diverse real-world tasks demonstrate that PhysWorld substantially improves manipulation accuracy compared to previous approaches. Visit \hrefhttps://pointscoder.github.io/PhysWorld_Web/the project webpage for details.

Abstract:
This paper presents GSWorld, a robust, photo-realistic simulator for robotics manipulation that combines 3D Gaussian Splatting with physics engines. Our framework advocates closing the loop of developing manipulation policies with reproducible evaluation of policies learned from real-robot data and sim2real policy training without using real robots. To enable photo-realistic rendering of diverse scenes, we propose a new asset format, which we term GSDF (Gaussian Scene Description File), that infuses Gaussian-on-Mesh representation with robot URDF and other objects. With a streamlined reconstruction pipeline, we curate a database of GSDF that contains 3 robot embodiments for single-arm and bimanual manipulation, as well as more than 40 objects. Combining GSDF with physics engines, we demonstrate several immediate interesting applications: (1) learning zero-shot sim2real pixel-to- action manipulation policy with photo-realistic rendering, (2) automated high-quality DAgger data collection for adapting policies to deployment environments, (3) reproducible bench- marking of real-robot manipulation policies in simulation, (4) simulation data collection by virtual teleoperation, and (5) sim2real visual reinforcement learning. We plan to open-source both the simulation assets and code.

Abstract:
Layout estimation and 3D object detection are two fundamental tasks in indoor scene understanding. When combined, they enable the creation of a compact yet semantically rich spatial representation of a scene. Existing approaches typically rely on point cloud input, which poses a major limitation since most consumer cameras lack depth sensors and visual-only data remains far more common. We address this issue with TUN3D, the first method that tackles joint layout estimation and 3D object detection in real scans, given multi-view images as input, and does not require ground-truth camera poses or depth supervision. Our approach builds on a lightweight sparse-convolutional backbone and employs two dedicated heads: one for 3D object detection and one for layout estimation, leveraging a novel and effective parametric wall representation. Extensive experiments show that TUN3D achieves state-of-the-art performance across three challenging scene understanding benchmarks: (i) using ground-truth point clouds, (ii) using posed images, and (iii) using unposed images. While performing on par with specialized 3D object detection methods, TUN3D significantly advances layout estimation, setting a new benchmark in holistic indoor scene understanding.

Abstract:
Intelligent agents progress by continually refining their capabilities through actively exploring environments. Yet robot policies often lack sufficient exploration capability due to action mode collapse. Existing methods that encourage exploration typically rely on random perturbations, which are unsafe and induce unstable, erratic behaviors, thereby limiting their effectiveness. We propose Self-Improvement via On-Manifold Exploration (SOE), a framework that enhances policy exploration and improvement in robotic manipulation. SOE learns a compact latent representation of task-relevant factors and constrains exploration to the manifold of valid actions, ensuring safety, diversity, and effectiveness. It can be seamlessly integrated with arbitrary policy models as a plug-in module, augmenting exploration without degrading the base policy performance. Moreover, the structured latent space enables human-guided exploration, further improving efficiency and controllability. Extensive experiments in both simulation and real-world tasks demonstrate that SOE consistently outperforms prior methods, achieving higher task success rates, smoother and safer exploration, and superior sample efficiency. These results establish on-manifold exploration as a principled approach to sample-efficient policy self-improvement.

Abstract:
As the foundation of closed-loop training and evaluation in autonomous driving, traffic simulation still faces two fundamental challenges: covariate shift introduced by open-loop imitation learning and limited capacity to reflect the multimodal behaviors observed in real-world traffic. Although recent frameworks such as RIFT have partially addressed these issues through group-relative optimization, their forward simulation procedures remain largely non-reactive, leading to unrealistic agent interactions within the virtual domain and ultimately limiting simulation fidelity. To address these issues, we propose ForSim, a stepwise closed-loop forward simulation paradigm. At each virtual timestep, the traffic agent propagates the virtual candidate trajectory that best spatiotemporally matches the reference trajectory through physically grounded motion dynamics, thereby preserving multimodal behavioral diversity while ensuring intra-modality consistency. Other agents are updated with stepwise predictions, yielding coherent and interaction-aware evolution. When incorporated into the RIFT traffic simulation framework, ForSim operates in conjunction with group-relative optimization to fine-tune traffic policy. Extensive experiments confirm that this integration consistently improves safety while maintaining efficiency, realism, and comfort. These results underscore the importance of modeling closed-loop multimodal interactions within forward simulation and enhance the fidelity and reliability of traffic simulation for autonomous driving.

Abstract:
Accurately predicting pedestrian motion is crucial for safe and reliable autonomous driving in complex urban environments. In this work, we present a 3D vehicle-conditioned pedestrian pose forecasting framework that explicitly incorporates surrounding vehicle information. To support this, we enhance the Waymo-3DSkelMo dataset with aligned 3D vehicle bounding boxes, enabling realistic modeling of multi-agent pedestrianvehicle interactions. We introduce a sampling scheme to categorize scenes by pedestrian and vehicle count, facilitating training across varying interaction complexities. Our proposed network adapts the TBIFormer architecture with a dedicated vehicle encoder and pedestrianvehicle interaction cross-attention module to fuse pedestrian and vehicle features, allowing predictions to be conditioned on both historical pedestrian motion and surrounding vehicles. Extensive experiments demonstrate substantial improvements in forecasting accuracy and validate different approaches for modeling pedestrianvehicle interactions, highlighting the importance of vehicle-aware 3D pose prediction for autonomous driving.

Abstract:
This paper presents RynnVLA-001, a vision-language-action (VLA) model built upon large-scale video generative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image-to-Video model to predict future frames based on an image and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby bridging visual frame prediction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoencoder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of-the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.

Abstract:
We provide a sober look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving, challenging common assumptions about their ability to interpret dynamic driving scenarios. Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored. Our experimental study assesses various MLLMs as world models using in-car camera perspectives and reveals that while these models excel at interpreting individual images, they struggle to synthesize coherent narratives across frames, leading to considerable inaccuracies in understanding (i) ego vehicle dynamics, (ii) interactions with other road actors, (iii) trajectory planning, and (iv) open-set scene reasoning. We introduce the Eval-LLM-Drive dataset and DriveSim simulator to enhance our evaluation, highlighting gaps in current MLLM capabilities and the need for improved models in dynamic real-world environments.

Abstract:
Scooping items with tools such as spoons and ladles is common in daily life, ranging from assistive feeding to retrieving items from environmental disaster sites. However, developing a general and autonomous robotic scooping policy is challenging since it requires reasoning about complex tool-object interactions. Furthermore, scooping often involves manipulating deformable objects, such as granular media or liquids, which is challenging due to their infinite-dimensional configuration spaces and complex dynamics. We propose a method, SCOOP'D, which uses simulation from OmniGibson (built on NVIDIA Omniverse) to collect scooping demonstrations using algorithmic procedures that rely on privileged state information. Then, we use generative policies via diffusion to imitate demonstrations from observational input. We directly apply the learned policy in diverse real-world scenarios, testing its performance on various item quantities, item characteristics, and container types. In zero-shot deployment, our method demonstrates promising results across 465 trials in diverse scenarios, including objects of different difficulty levels that we categorize as "Level 1" and "Level 2." SCOOP'D outperforms all baselines and ablations, suggesting that this is a promising approach to acquiring robotic scooping skills. Project page: https://scoopdiff.github.io/

Abstract:
In this work, we present SpaRC, a novel sparse fusion transformer for 3D perception that integrates multi-view image semantics with Radar and Camera point features. The fusion of radar and camera modalities has emerged as an efficient perception paradigm for autonomous driving systems. While conventional approaches utilize dense Bird's Eye View (BEV)-based architectures for depth estimation, contemporary query-based transformers excel in camera-only detection through object-centric methodology. However, these query-based approaches exhibit limitations in false positive detections and localization precision due to implicit depth modeling. We address these challenges through three key contributions: (1) sparse frustum fusion (SFF) for cross-modal feature alignment, (2) range-adaptive radar aggregation (RAR) for precise object localization, and (3) local self-attention (LSA) for focused query aggregation. In contrast to existing methods requiring computationally intensive BEV-grid rendering, SpaRC operates directly on encoded point features, yielding substantial improvements in efficiency and accuracy. Empirical evaluations on the nuScenes and TruckScenes benchmarks demonstrate that SpaRC significantly outperforms existing dense BEV-based and sparse query-based detectors. Our method achieves state-of-the-art performance of 67.1 NDS and 63.1 AMOTA. The code is available at https://phi-wol.github.io/sparc/.

Abstract:
This paper introduces SLIM-VDB, a new lightweight semantic mapping system with probabilistic semantic fusion for closed-set or open-set dictionaries. Advances in data structures from the computer graphics community, such as OpenVDB, have demonstrated significantly improved computational and memory efficiency in volumetric scene representation. Although OpenVDB has been used for geometric mapping in robotics applications, semantic mapping for scene understanding with OpenVDB remains unexplored. In addition, existing semantic mapping systems lack support for integrating both fixed-category and open-language label predictions within a single framework. In this paper, we propose a novel 3D semantic mapping system that leverages the OpenVDB data structure and integrates a unified Bayesian update framework for both closed- and open-set semantic fusion. Our proposed framework, SLIM-VDB, achieves significant reduction in both memory and integration times compared to current state-of-the-art semantic mapping approaches, while maintaining comparable mapping accuracy. An open-source C++ codebase with a Python interface will accompany the paper release.

Abstract:
Monocular depth estimation (MDE) from thermal images is a crucial technology for robotic systems operating in challenging conditions such as fog, smoke, and low light. The limited availability of labeled thermal data constrains the generalization capabilities of thermal MDE models compared to foundational RGB MDE models, which benefit from datasets of millions of images across diverse scenarios. To address this challenge, we introduce a novel pipeline that enhances thermal MDE through knowledge distillation from a versatile RGB MDE model. Our approach features a confidence-aware distillation method that utilizes the predicted confidence of the RGB MDE to selectively strengthen the thermal MDE model, capitalizing on the strengths of the RGB model while mitigating its weaknesses. Our method significantly improves the accuracy of the thermal MDE, independent of the availability of labeled depth supervision, and greatly expands its applicability to new scenarios. In our experiments on new scenarios without labeled depth, the proposed confidence-aware distillation method reduces the absolute relative error of thermal MDE by 22.88% compared to the baseline without distillation.

Abstract:
Vision language action models (VLAs) are increasingly used for Physical AI, but deploying a pre-trained VLA model to unseen environments, embodiments, or tasks still requires adaptation. Parameter-efficient fine-tuning (PEFT), especially LoRA, is common for VLA policies, yet the exposed capacity knob, the rank, does not transfer uniformly: robotics transfer exhibits a higher and task-varying intrinsic rank than language fine-tuning. Small ranks suffice for LLMs (e.g., r �?4, 8), while spectral analyses indicate VLAs may require much larger ranks (e.g., r �?128) or nearfull rank, a mismatch that worsens in multi-task settings. We present LoRA-SP (SelectPrune), a rank-adaptive fine-tuning method that replaces fixed-rank updates with input- and layer-wise capacity. LoRA-SP uses an SVD-style parameterization with a small router whose nonnegative scores act as singular values over a shared vector bank. The active set is chosen by an energy target on the cumulative squared scores E(k) �?η, providing a direct link to approximation error via our spectral analysis. During training, η concentrates energy on a few directions and teaches the router to rely on fewer vectors while preserving accuracy. This yields compact adapters that reduce cross-task interference and improve generalization. On four real-robot manipulation tasks collected on an unseen AgileX PiPER arm, across two VLA backbones (π0 and SmolVLA), LoRA-SP matches or exceeds full fine-tuning with far fewer trainable parameters, and improves multi-task success by up to 31.6% over standard LoRA while remaining robust to rank choice.

Abstract:
This paper investigates how the performance of visual navigation policies trained in simulation compares to policies trained with real-world data. Performance degradation of simulator-trained policies is often significant when they are evaluated in the real world. However, despite this well-known sim-to-real gap, we demonstrate that simulator-trained policies can match the performance of their real-world-trained counterparts. Central to our approach is a navigation policy architecture that bridges the sim-to-real appearance gap by leveraging pretrained visual representations and runs real-time on robot hardware. Evaluations on a wheeled mobile robot show that the proposed policy, when trained in simulation, outperforms its real-world-trained version by 31 and the prior state-of-the-art methods by 50 points in navigation success rate. Policy generalization is verified by deploying the same model onboard a drone. Our results highlight the importance of diverse image encoder pretraining for sim-to-real generalization, and identify on-policy learning as a key advantage of simulated training over training with real data. Code, model checkpoints and multimedia materials are available at https://lasuomela.github.io/faint/.

Abstract:
Falling is an inherent risk of humanoid mobility. Maintaining stability is therefore a primary safety focus in robot control and learning, yet no existing approach fully averts loss of balance. When instability does occur, prior work addresses only isolated aspects of falling: avoiding falls, choreographing a controlled descent, or standing up afterward. Consequently, humanoid robots lack integrated strategies for impact mitigation and prompt recovery when real falls defy these scripts. We aim to go beyond keeping balance to make the entire fall-and-recovery process safe and autonomous: Prevent falls when possible, reduce impact when unavoidable, and stand up when fallen. By fusing sparse human demonstrations with reinforcement learning and a diffusion-based memory of safe reactions, we learn whole-body behaviors that unify fall prevention, impact mitigation, and rapid recovery in a single policy. Experiments in simulation and on a Unitree G1 demonstrate robust sim-to-real transfer, lower impact forces, and consistently fast recovery across diverse disturbances, pointing toward safer, more resilient humanoids in real environments. Videos are available at https://firm2025.github.io.

Abstract:
Safety filters, particularly those based on control barrier functions, have gained increased interest as effective tools for safe control of dynamical systems. Existing correct-by-construction synthesis algorithms for such filters, however, suffer from the curse-of-dimensionality. Deep learning approaches have been proposed in recent years to address this challenge. In this paper, we add to this set of approaches an algorithm for training neural control barrier functions from offline datasets. Such functions can be used to design constraints for quadratic programs that are then used as safety filters. Our algorithm trains these functions so that the system is not only prevented from reaching unsafe states, but is also disincentivized from reaching out-of-distribution ones, at which they would be less reliable. It is inspired by Conservative Q-learning, an offline reinforcement learning algorithm. We call its outputs Conservative Control Barrier Functions (CCBFs). Our empirical results demonstrate that CCBFs outperform existing methods in maintaining safety while minimally affecting task performance.

Abstract:
Point cloud completion is critical for autonomous driving and robotic perception, yet deep learning models often experience severe performance degradation under the domain gap between synthetic training and real-world data. While unsupervised domain adaptation (UDA) has been explored to mitigate this issue, its reliance on access to source datasets limits practical applicability, as source data are often proprietary or restricted.We pioneer source-free domain adaptation (SFDA) for point cloud completion, which adapts a pre-trained source model to an unlabeled target domain without requiring source data access. To this end, we propose PointSFDA, a framework that combines global knowledge transfer with target-specific local adaptation. Specifically, we design (i) a Coarse-to-Fine Point Cloud Distillation module to extract domain-invariant global geometric priors from the source model, and (ii) a Partial-Mask Consistency Training strategy to enforce prediction consistency across masking augmentations, enabling self-supervised learning of local target-domain geometry. Experiments on real-world datasets (KITTI, ScanNet) and synthetic benchmarks (ModelNet40, 3D-FUTURE) demonstrate that PointSFDA achieves significant improvements over state-of-the-art methods in cross-domain shape completion, establishing a practical and scalable solution for robotics applications. Our code is available at https://github.com/Starak-x/PointSFDA.

Abstract:
Monocular omnidirectional visual odometry (OVO) systems leverage 360-degree cameras to overcome field-of-view limitations of perspective VO systems. However, existing methods, reliant on handcrafted features or photometric objectives, often lack robustness in challenging scenarios, such as aggressive motion and varying illumination. To address this, we present 360DVO, the first deep learning-based OVO framework. Our approach introduces a distortion-aware spherical feature extractor (DAS-Feat) that adaptively learns distortion-resistant features from 360-degree images. These sparse feature patches are then used to establish constraints for effective pose estimation within a novel omnidirectional differentiable bundle adjustment (ODBA) module. To facilitate evaluation in realistic settings, we also contribute a new real-world OVO benchmark. Extensive experiments on this benchmark and public synthetic datasets (TartanAir V2 and 360VO) demonstrate that 360DVO surpasses state-of-the-art baselines (including 360VO and OpenVSLAM), improving robustness by 50% and accuracy by 37.5%. Homepage: https://360dvo.hkustvgd.com

Abstract:
This paper proposes a semidefinite relaxation for landmark-based localization with unknown data associations in planar environments. The proposed method simultaneously solves for the optimal robot states and data associations in a globally optimal fashion. Relative position measurements to a fixed set of known landmarks are used, but the data association is unknown in that the robot does not know which landmark each measurement is generated from. The relaxation is shown to be tight in a majority of cases for moderate noise levels. The proposed algorithm is compared to local Gauss-Newton baselines initialized at the dead-reckoned trajectory, and is shown to significantly improve convergence to the problems global optimum in simulation and experiment. Accompanying software and supplementary material can be found at https://github.com/decargroup/certifiable_uda_loc.

Abstract:
Scaling mobile manipulation imitation learning is bottlenecked by expensive mobile robot teleoperation. We present Egocentric Mobile MAnipulation (EMMA), an end-to-end framework training mobile manipulation policies from human mobile manipulation data with static robot data, sidestepping mobile teleoperation. To accomplish this, we co-train the human full-body motion data with the static robot data. In our experiments across three real-world tasks, EMMA demonstrates comparable performance to baselines trained on teleoperated mobile robot data (Mobile ALOHA), achieving higher or equivalent task performance in full task success. We find that EMMA is able to generalize to new spatial configurations and scenes, and we observe positive performance scaling as we increase the hours of human data, opening new avenues for scalable robotic learning in real-world environments.

Abstract:
The evolution of Advanced Driver Assistance Systems (ADAS) has increased the need for robust and generalizable algorithms for multi-object tracking. Traditional statistical model-based tracking methods rely on predefined motion models and assumptions about system noise distributions. Although computationally efficient, they often lack adaptability to varying traffic scenarios and require extensive manual design and parameter tuning. To address these issues, we propose a novel 3D multi-object tracking approach for vehicles, HybridTrack, which integrates a data-driven Kalman Filter (KF) within a tracking-by-detection paradigm. In particular, it learns the transition residual and Kalman gain directly from data, which eliminates the need for manual motion and stochastic parameter modeling. Validated on the real-world KITTI dataset, HybridTrack achieves 82.72% HOTA accuracy, significantly outperforming state-of-the-art methods. We also evaluate our method under different configurations, achieving the fastest processing speed of 112 FPS. Consequently, HybridTrack eliminates the dependency on scene-specific designs while improving performance and maintaining real-time efficiency.

Abstract:
Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch: VLMs are primarily pretrained on static, disembodied vision-language tasks, which fundamentally clash with the dynamic, embodied, and spatially-structured nature of navigation. Existing large-model-based methods often resort to converting rich visual and spatial information into text, forcing models to implicitly infer complex visual-topological relationships or limiting their global action capabilities. To bridge this gap, we propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone. To introduce topological edge information, Spatial Topology Aware Residual Attention (STAR-Att) directly integrates it into the VLMs self-attention mechanism, enabling intrinsic spatial reasoning while preserving pretrained knowledge. To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment. Finally, with the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction. On the R2R benchmark, TagaVLM achieves state-of-the-art performance among large-model-based methods, with a Success Rate (SR) of 51.09% and SPL of 47.18 in unseen environments, outperforming prior work by 3.39% in SR and 9.08 in SPL. This demonstrates that, for embodied spatial reasoning, targeted enhancements on smaller open-source VLMs can be more effective than brute-force model scaling. The code can be found on our project page: https://apex-bjut.github.io/Taga-VLM/.

Abstract:
Evaluating the safety of autonomous vehicles (AVs) requires diverse, safety-critical scenarios, with collisions being especially important yet rare and unsafe to collect in the real world. Therefore, the community has been focusing on generating safety-critical scenarios in simulation. However, controlling attributes such as collision type and time-to-accident (TTA) remains challenging. We introduce a new task called controllable collision scenario generation, where the goal is to produce trajectories that realize a user-specified collision type and TTA, to investigate the feasibility of automatically generating desired collision scenarios. To support this task, we present COLLIDE, a large-scale collision scenario dataset constructed by transforming real-world driving logs into diverse collisions, balanced across five representative collision types and different TTA intervals. We propose a framework that predicts Collision Pattern, a compact and interpretable representation that captures the spatial configuration of the ego and the adversarial vehicles at impact, before rolling out full adversarial trajectories. Experiments show that our approach outperforms strong baselines in both collision rate and controllability. Furthermore, generated scenarios consistently induce higher planner failure rates, revealing limitations of existing planners. We demonstrate that these scenarios fine-tune planners for robustness improvements, contributing to safer AV deployment in different collision scenarios.

Abstract:
We introduce the latest achievements and results of Visual S-Graphs (vS-Graphs), our open-source, real-time VSLAM framework that tightly couples map reconstruction to online 3D scene graph generation. vS-Graphs employs visual and depth cues to detect and localize building components, such as walls and ground surfaces, from which higher-level structural elements, including variant-shaped rooms and floors, are inferred. These entities are incorporated into an optimizable hierarchical 3D scene graph, jointly maintained with the SLAM pipeline, enabling richer map semantics and improved localization. The framework is publicly available at https://github.com/snt-arg/visual_sgraphs. We evaluated vS-Graphs on both public RGB-D benchmarks and our in-house SMapper dataset, which includes diverse multi-room indoor environments with LiDAR-derived ground truth. These evaluations focused on trajectory estimation, map quality, semantic structural detection, and runtime performance. The results highlight the potential of tightly coupling VSLAM with online hierarchical scene graph generation for richer, more structurally meaningful environmental understanding. In particular, the ability of vS-Graphs to infer higher-level layout entities from visually detected building components suggests a promising direction for bridging geometric mapping and semantic scene reasoning within a unified framework. Full evaluation results and figures are available on https://snt-arg.github.io/vsgraphs-results/.

Abstract:
Trajectory generation is a pivotal task in autonomous driving. Recent studies have introduced the autoregressive paradigm, leveraging the state transition model to approximate future trajectory distributions. This paradigm closely mirrors the real-world trajectory generation process and has achieved notable success. However, its potential is limited by the ineffective representation of realistic trajectories within the redundant state space. To address this limitation, we propose the Kinematic-Driven Generative Model for Realistic Agent Simulation (KiGRAS). Instead of modeling in the state space, KiGRAS factorizes the driving scene into action probability distributions at each time step, providing a compact space to represent realistic driving patterns. By establishing physical causality from actions (cause) to trajectories (effect) through the kinematic model, KiGRAS eliminates massive redundant trajectories. All states derived from actions in the cause space are constrained to be physically feasible. Furthermore, redundant trajectories representing identical action sequences are mapped to the same representation, reflecting their underlying actions. This approach significantly reduces task complexity and ensures physical feasibility. KiGRAS achieves state-of-the-art performance in Waymo's SimAgents Challenge, ranking first on the WOMD leaderboard with significantly fewer parameters than other models. The video documentation is available at https://kigras-mach.github.io/KiGRAS/.

Abstract:
We present Galaxy Open-World Dataset, a large-scale, diverse collection of robot behaviors recorded in authentic human living and working environments. All demonstrations are gathered using a consistent robotic embodiment, paired with precise subtask-level language annotations to facilitate both training and evaluation. Building on this dataset, we introduce G0, a dual-system framework that couples a Vision-Language Model (VLM) for multimodal planning with a Vision-Language-Action (VLA) model for fine-grained execution. G0 is trained using a three-stage curriculum: cross-embodiment pre-training, single-embodiment pre-training, and task-specific post-training. A comprehensive benchmarkspanning tabletop manipulation, few-shot learning, and long-horizon mobile manipulationdemonstrates the effectiveness of our approach. In particular, we find that the single-embodiment pre-training stage, together with the Galaxy Open-World Dataset, plays a critical role in achieving strong performance. Dataset, code and pretrained weights will be made publicly available.

Abstract:
Human videos are a scalable source of training data for robot learning. However, humans and robots significantly differ in embodiment, making many human actions infeasible for direct execution on a robot. Still, these demonstrations convey rich object-interaction cues and task intent. Our goal is to learn from this coarse guidance without transferring embodiment-specific, infeasible execution strategies. Recent advances in generative modeling tackle a related problem of learning from low-quality data. In particular, Ambient Diffusion is a recent method for diffusion modeling that incorporates low-quality data only at high-noise timesteps of the forward diffusion process. Our key insight is to view human actions as noisy counterparts of robot actions. As noise increases along the forward diffusion process, embodiment-specific differences fade away while task-relevant guidance is preserved. Based on these observations, we present X-Diffusion, a cross-embodiment learning framework based on Ambient Diffusion that selectively trains diffusion policies on noised human actions. This enables effective use of easy-to-collect human videos without sacrificing robot feasibility. Across five real-world manipulation tasks, we show that X-Diffusion improves average success rates by 16% over naive co-training and manual data filtering.

Abstract:
Accurate spatial-geometric perception remains fundamental to robotic grasping, yet physical artifacts in real depth maps like voids and noise establish a significant sim-to-real gap that critically impedes policy transfer. Training-time strategies like procedural noise injection or learned mappings suffer from data inefficiency due to unrealistic noise simulation, which is often ineffective for grasping tasks that require fine manipulation or dependency on paired datasets heavily. Furthermore, leveraging foundation models to reduce the sim-to-real gap via intermediate representations fails to fully mitigate the domain shift and adds computational overhead during deployment. This work confronts dual challenges of data inefficiency and deployment complexity. We propose DiffuDepGrasp, a deploy-efficient sim-to-real framework enabling zero-shot transfer through simulation-exclusive policy training. Its core innovation, the Diffusion Depth Generator, synthesizes geometrically pristine simulation depth with learned sensor-realistic noise via two synergistic modules. The first Diffusion Depth Module leverages temporal geometric priors to enable sample-efficient training of a conditional diffusion model that captures complex sensor noise distributions, while the second Noise Grafting Module preserves metric accuracy during perceptual artifact injection. Policies trained via our framework require only raw depth inputs during deployment, thus eliminating computational overhead. Extensive sim-to-real validation demonstrates 95.7% average success (SOTA) on 12-object grasping with zero-shot transfer and strong generalization to unseen objects, proving data efficiency and practical value.

Abstract:
Photorealistic 3D scene reconstruction plays an important role in autonomous driving, enabling the generation of novel data from existing datasets to simulate safety-critical scenarios and expand training data without additional acquisition costs. Gaussian Splatting (GS) facilitates real-time, photorealistic rendering with an explicit 3D Gaussian representation of the scene, providing faster processing and more intuitive scene editing than the implicit Neural Radiance Fields (NeRFs). While extensive GS research has yielded promising advancements in autonomous driving applications, they overlook two critical aspects: First, existing methods mainly focus on low-speed and feature-rich urban scenes and ignore the fact that highway scenarios play a significant role in autonomous driving. Second, while LiDARs are commonplace in autonomous driving platforms, existing methods learn primarily from images and use LiDAR only for initial estimates or without precise sensor modeling, thus missing out on leveraging the rich depth information LiDAR offers and limiting the ability to synthesize LiDAR data. In this paper, we propose a novel GS method for dynamic scene synthesis and editing with improved scene reconstruction through LiDAR supervision and support for LiDAR rendering. Unlike prior works that are tested mostly on urban datasets, to the best of our knowledge, we are the first to focus on the more challenging and highly relevant highway scenes for autonomous driving, with sparse sensor views and monotone backgrounds.

Abstract:
Hair care is an essential daily activity, yet it remains inaccessible to individuals with limited mobility and challenging for autonomous robot systems due to the fine-grained physical structure and complex dynamics of hair. In this work, we present DYMO-Hair, a model-based robot hair care system. We introduce a novel dynamics learning paradigm that is suited for volumetric quantities such as hair, relying on an action-conditioned latent state editing mechanism, coupled with a compact 3D latent space of diverse hairstyles to improve generalizability. This latent space is pre-trained at scale using a novel hair physics simulator, enabling generalization across previously unseen hairstyles. Using the dynamics model with a Model Predictive Path Integral (MPPI) planner, DYMO-Hair is able to perform visual goal-conditioned hair styling. Experiments in simulation demonstrate that DYMO-Hair's dynamics model outperforms baselines on capturing local deformation for diverse, unseen hairstyles. DYMO-Hair further outperforms baselines in closed-loop hair styling tasks on unseen hairstyles, with an average of 22% lower final geometric error and 42% higher success rate than the state-of-the-art system. Real-world experiments exhibit zero-shot transferability of our system to wigs, achieving consistent success on challenging unseen hairstyles where the state-of-the-art system fails. Together, these results introduce a foundation for model-based robot hair care, advancing toward more generalizable, flexible, and accessible robot hair styling in unconstrained physical environments.

Abstract:
One of the most important, yet challenging, skills for a dexterous robot is grasping a diverse range of objects. Much of the prior work has been limited by speed, generality, or reliance on depth maps and object poses. In this paper, we introduce DextrAH-RGB, a system that can perform dexterous arm-hand grasping end-to-end from RGB image input. We train a policy in simulation through reinforcement learning that acts on a geometric fabric controller to dexterously grasp a wide variety of objects. We then distill this into an RGB-based policy strictly in simulation using photorealistic tiled rendering. To our knowledge, this is the first work that is able to demonstrate robust sim-to-real transfer of an end-to-end (monocular or stereo) RGB-based policy for complex, dynamic, contact-rich tasks such as dexterous grasping with multi-fingered hands. Unlike previous methods, DextrAH-RGB requires no explicit depth or CAD models, making it significantly more practical and robust in varied real-world lighting and texture conditions. It generalizes to novel objects and scenes, offering a strong step toward deployable, vision-based dexterous manipulation.

Abstract:
The collection of large-scale and diverse robot demonstrations remains a major bottleneck for imitation learning, as real-world data acquisition is costly and simulators offer limited diversity and fidelity with pronounced sim-to-real gaps. While generative models present an attractive solution, existing methods often alter only visual appearances without creating new behaviors, or suffer from embodiment inconsistencies that yield implausible motions. To address these limitations, we introduce AnchorDream, an embodiment-aware world model that repurposes pretrained video diffusion models for robot data synthesis. AnchorDream conditions the diffusion process on robot motion renderings, anchoring the embodiment to prevent hallucination while synthesizing objects and environments consistent with the robot's kinematics. Starting from only a handful of human teleoperation demonstrations, our method scales them into large, diverse, high-quality datasets without requiring explicit environment modeling. Experiments show that the generated data leads to consistent improvements in downstream policy learning, with relative gains of 36.4% in simulator benchmarks and nearly double performance in real-world studies. These results suggest that grounding generative world models in robot motion provides a practical path toward scaling imitation learning.

Abstract:
Action-labeled data for robotics is scarce and expensive, limiting the generalization of learned policies. In contrast, vast amounts of action-free video data are readily available, but translating these observations into effective policies remains a challenge. We introduce AMPLIFY, a framework that leverages large-scale video data by encoding visual dynamics into compact, discrete motion tokens derived from keypoint trajectories. Our modular approach separates visual motion prediction from action inference, decoupling the challenges of learning what motion defines a task from how robots can perform it. We train a forward dynamics model on abundant action-free videos and an inverse dynamics model on a limited set of action-labeled examples, allowing for independent scaling. Extensive evaluations demonstrate our dynamics model achieves over 2× better point track prediction accuracy compared to the prior state-of-the-art. In downstream policy learning, our dynamics predictions enable a 1.2-2.2× success rate improvement in low-data regimes, a 1.4× average improvement by learning from action-free human videos, and the first generalization to LIBERO tasks with zero in-distribution action data. Beyond robotic control, we find the latent dynamics learned by AMPLIFY to enhance video prediction quality. Our results present a novel paradigm leveraging heterogeneous data sources to build efficient, generalizable world models.

Abstract:
The incorporation of world modeling into manipulation policy learning has pushed the boundary of manipulation performance. However, existing efforts simply model the 2D visual dynamics, which is insufficient for robust manipulation when target tasks involve prominent depth-wise movement. To address this, we present a 3D dynamics-aware manipulation framework that seamlessly integrates 3D world modeling and policy learning. Three self-supervised learning tasks (current depth estimation, future RGB-D prediction, 3D flow prediction) are introduced within our framework, which complement each other and endow the policy model with 3D foresight. Extensive experiments on simulation and the real world show that 3D foresight can greatly boost the performance of manipulation policies without sacrificing inference speed.

Abstract:
Recent advances in robot control methods, from end-to-end vision-language-action frameworks to modular systems with predefined primitives, have advanced robots ability to follow natural language instructions. Nonetheless, many approaches still struggle to scale to diverse environments, as they often rely on large annotated datasets and offer limited interpretability. In this work, we introduce EmbodiedCoder, a training-free framework for open-world mobile robot manipulation that leverages coding models to directly generate executable robot trajectories. By grounding high-level instructions in code, EmbodiedCoder enables flexible object geometry parameterization and task trajectory synthesis without additional data collection or fine-tuning. This coding-based paradigm provides a transparent and generalizable way to connect perception with manipulation. Experiments on real mobile robots show that EmbodiedCoder achieves robust performance across diverse long-horizon tasks and generalizes effectively to unseen objects and environments. Our results demonstrate an interpretable approach for bridging high-level reasoning and low-level control, moving beyond fixed primitives toward versatile robot intelligence. See the project page at https://embodiedcoder.github.io/EmbodiedCoder/.

Abstract:
The reconstruction of surgical scenes from monocular endoscopic video is crucial for advancing robotic-assisted surgery, but applying state-of-the-art general-purpose reconstruction models is hindered by a severe lack of supervised training data and performance degradation over long sequences. To address these challenges, we propose SurgCUT3R, a systematic framework for adapting unified 4D reconstruction models to the surgical domain. Our approach makes three primary contributions. First, we introduce a data generation pipeline that leverages public stereo surgical datasets to create large-scale, metric-scale pseudo-ground-truth depth maps, effectively bridging the data gap. Second, we employ a hybrid supervision strategy that combines our pseudo-GT with geometric self-correction to enhance robustness against inherent data imperfections. Third, we design a hierarchical inference framework that utilizes two specialized modelsone for global stability and one for local accuracyto significantly reduce accumulated pose drift in long videos. Experiments on the public SCARED and StereoMIS datasets demonstrate that our method achieves a highly competitive balance between accuracy and efficiency. It delivers near state-of-the-art pose estimation, offering a practical and effective solution for robust reconstruction in surgical environments.

Abstract:
In this paper, we derive a new Kalman filter (KF) with probabilistic data association between measurements and states. We formulate a variational inference problem to approximate the posterior density of the state conditioned on the measurement data. We view the unknown data association as a latent variable and apply Expectation Maximization (EM) to obtain a filter with the update step in the same form as the Kalman filter but with an expanded measurement vector of all potential associations. We show that the association probabilities can be computed as permanents of matrices with measurement likelihood entries. We name our probabilistic data association Kalman filter the PKF with P emphasizing both the probabilistic nature of the data association and the matrix permanent computation of the association weights. We compare PKF with the well-established Probabilistic Multi-Hypothesis Tracking (PMHT) and Joint Probabilistic Data Association Filter (JPDAF) in both theory and simulated experiments. The experiments show that we can achieve lower tracking errors than both. We also demonstrate the effectiveness of our filter in multi-object tracking (MOT) on multiple real-world datasets, including MOT17, MOT20, and DanceTrack. We can achieve comparable tracking results with previous KF-based methods without using velocities or doing multi-stage data association and remain real-time. We further show that our PKF can serve as a backbone for other KF-based trackers by applying it to a method that uses varieties of features for association, and improving its results.

Abstract:
Simultaneous localization and mapping is a critical capability for autonomous systems. Traditional SLAM approaches often rely on visual or LiDAR sensors and face significant challenges in adverse conditions such as low light or featureless environments. To overcome these limitations, we propose a novel Doppler-aided radar-inertial and LiDAR-inertial SLAM framework that leverages the complementary strengths of 4D radar, FMCW LiDAR, and inertial measurement units. Our system integrates Doppler velocity measurements and spatial data into a tightly-coupled front-end and graph optimization back-end to provide enhanced ego velocity estimation, accurate odometry, and robust mapping. We also introduce a Doppler-based scan-matching technique to improve front-end odometry in dynamic environments. In addition, our framework incorporates an innovative online extrinsic calibration mechanism, utilizing Doppler velocity and loop closure to dynamically maintain sensor alignment. Extensive evaluations on both public and proprietary datasets show that our system significantly outperforms state-of-the-art radar-SLAM and LiDAR-SLAM frameworks in terms of accuracy and robustness. To encourage further research, the code of our Doppler-SLAM and our dataset are available at: urlhttps://github.com/Wayne-DWA/Doppler-SLAM.

Abstract:
Traditional Visual Simultaneous Localization and Mapping systems focus solely on static scene structures, overlooking dynamic elements in the environment. Although effective for accurate visual odometry in complex scenarios, these methods discard crucial information about moving objects. By incorporating this information into a Dynamic SLAM framework, the motion of dynamic entities can be estimated, enhancing navigation whilst ensuring accurate localization. However, the fundamental formulation of Dynamic SLAM remains an open challenge, with no consensus on the optimal approach for accurate motion estimation within a SLAM pipeline. Therefore, we developed DynoSAM, an open-source framework for Dynamic Objects SLAM that enables the efficient implementation, testing, and comparison of various Dynamic SLAM optimization formulations. We further propose a novel formulation that encodes rigid-body motion model in object pose estimation as well as an error metric agnostic to object frame definition. DynoSAM integrates static and dynamic measurements into a unified optimization problem solved using factor graphs, simultaneously estimating camera poses, static scene, object motion or poses, and object structures. We evaluate DynoSAM across diverse simulated and real-world datasets, achieving state-of-the-art motion estimation in indoor and outdoor environments, with substantial improvements over existing systems. Additionally, we demonstrate DynoSAM's contributions to downstream applications, including 3D reconstruction of dynamic scenes and trajectory prediction, thereby showcasing potential for advancing dynamic object-aware SLAM systems. Code is open-sourced.

Abstract:
In this work, we propose the use of Ground Penetrating Radar (GPR) for rover localization on Mars. Precise pose estimation is an important task for mobile robots exploring planetary surfaces, as they operate in GPS-denied environments. Although visual odometry (VO) provides accurate localization, it is computationally expensive and can fail in dim or high-contrast lighting. Wheel encoders can also provide odometry estimation, but are prone to slipping on the sandy terrain encountered on Mars. Although traditionally a scientific surveying sensor, GPR has been used on Earth for terrain classification and localization through subsurface feature matching. The Perseverance rover and the upcoming ExoMars rover have GPR sensors already equipped to aid in the search of water and mineral resources. We propose to leverage GPR to aid in Mars rover localization. Specifically, we develop a novel GPR-based deep learning model that predicts 1-D relative pose translation. We fuse our GPR pose prediction method with inertial and wheel encoder data in a filtering framework to output rover localization. We perform experiments in a Mars analog environment and demonstrate that our GPR-based displacement predictions both outperform wheel encoders and improve multimodal filtering estimates in high-slip environments. Finally, we present the first dataset aimed at GPR-based localization in Mars analog environments, which will be made publicly available at: https://umfieldrobotics.github.io/marslgpr

Abstract:
Developing autonomous robots capable of learning and reproducing complex motions from demonstrations remains a fundamental challenge in robotics. On the one hand, movement primitives (MPs) provide a compact and modular representation of continuous trajectories. On the other hand, autonomous systems provide control policies that are time independent. We propose in this paper a simple and flexible approach that gathers the advantages of both representations by transforming MPs into autonomous systems. The key idea is to transform the explicit representation of a trajectory as an implicit shape encoded as a distance field. This conversion from a time-dependent motion to a spatial representation enables the definition of an autonomous dynamical system with modular reactions to perturbation. Asymptotic stability guarantees are provided by using Bernstein basis functions in the MPs, representing trajectories as concatenated quadratic Bézier curves, which provide an analytical method for computing distance fields. This approach bridges conventional MPs with distance fields, ensuring smooth and precise motion encoding, while maintaining a continuous spatial representation. By simply leveraging the analytic gradients of the curve and its distance field, a stable dynamical system can be computed to reproduce the demonstrated trajectories while handling perturbations, without requiring a model of the dynamical system to be estimated. Numerical simulations and real-world robotic experiments validate our method's ability to encode complex motion patterns while ensuring trajectory stability, together with the flexibility of designing the desired reaction to perturbations.

Abstract:
The contact-rich nature of manipulation makes it a significant challenge for robotic teleoperation. While haptic feedback is critical for contact-rich tasks, providing intuitive directional cues within wearable teleoperation interfaces remains a bottleneck. Existing solutions, such as non-directional vibrations from handheld controllers, provide limited information, while vibrotactile arrays are prone to perceptual interference. To address these limitations, we propose HapCompass, a novel, low-cost wearable haptic device that renders 2D directional cues by mechanically rotating a single linear resonant actuator (LRA). We evaluated HapCompass's ability to convey directional cues to human operators and showed that it increased the success rate, decreased the completion time and the maximum contact force for teleoperated manipulation tasks when compared to vision-only and non-directional feedback baselines. Furthermore, we conducted a preliminary imitation-learning evaluation, suggesting that the directional feedback provided by HapCompass enhances the quality of demonstration data and, in turn, the trained policy. We release the design of the HapCompass device along with the code that implements our teleoperation interface: https://ripl.github.io/HapCompass/.

Abstract:
Non-prehensile manipulation using onboard sensing presents a fundamental challenge: the manipulated object occludes the sensor's field of view, creating occluded regions that can lead to collisions. We propose CURA-PPO, a reinforcement learning framework that addresses this challenge by explicitly modeling uncertainty under partial observability. By predicting collision possibility as a distribution, we extract both risk and uncertainty to guide the robot's actions. The uncertainty term encourages active perception, enabling simultaneous manipulation and information gathering to resolve occlusions. When combined with confidence maps that capture observation reliability, our approach enables safe navigation despite severe sensor occlusion. Extensive experiments across varying object sizes and obstacle configurations demonstrate that CURA-PPO achieves up to 3 times higher success rates than the baselines, with learned behaviors that handle occlusions. Our method provides a practical solution for autonomous manipulation in cluttered environments using only onboard sensing.

Abstract:
Autonomous vehicles rely on HD maps for their operation, but offline HD maps eventually become outdated. For this reason, online HD map construction methods use live sensor data to infer map information instead. Research on real map changes shows that oftentimes entire parts of an HD map remain unchanged and can be used as a prior. We therefore introduce M3TR (Multi-Masking Map Transformer), a generalist approach for HD map completion both with and without offline HD map priors. As a necessary foundation, we address shortcomings in ground truth labels for Argoverse 2 and nuScenes and propose the first comprehensive benchmark for HD map completion. Unlike existing models that specialize in a single kind of map change, which is unrealistic for deployment, our Generalist model handles all kinds of changes, matching the effectiveness of Expert models. With our map masking as augmentation regime, we can even achieve a +1.4 mAP improvement without a prior. Finally, by fully utilizing prior HD map elements and optimizing query designs, M3TR outperforms existing methods by +4.3 mAP while being the first real-world deployable model for offline HD map priors.

Abstract:
Inferring object motion representations from observations enhances the performance of robotic manipulation tasks. This paper introduces a new paradigm for robot imitation learning that generates action sequences by reasoning about object motion from visual observations.We propose MBA, a novel module that employs two cascaded diffusion processes for object motion generation and robot action generation under object motion guidance. MBA first predicts the future pose sequence of the object based on observations, then uses this sequence as a condition to guide robot action generation. Designed as a plug-and-play component, MBA can be flexibly integrated into existing robotic manipulation policies with diffusion action heads. Extensive experiments in both simulated and real-world environments demonstrate that our approach substantially improves the performance of existing policies across a wide range of manipulation tasks. Project page: https://selen-suyue. github.io/MBApage/

Abstract:
Accurate trajectory prediction can improve General Aviation safety in non-towered terminal airspace, where high traffic density increases accident risk. We present ASCENT, a lightweight transformer-based model for multimodal 3D aircraft trajectory forecasting, which integrates domain-aware 3D coordinate normalization and parameterized predictions. ASCENT employs a transformer-based motion encoder and a query-based decoder, enabling the generation of diverse maneuver hypotheses with low latency. Experiments on the TrajAir and TartanAviation datasets demonstrate that our model outperforms prior baselines, as the encoder effectively captures motion dynamics and the decoder aligns with structured aircraft traffic patterns. Furthermore, ablation studies confirm the contributions of the decoder design, coordinate-frame modeling, and parameterized outputs. These results establish ASCENT as an effective approach for real-time aircraft trajectory prediction in non-towered terminal airspace.

Abstract:
Simulation-based learning has enabled policies for precise, contact-rich tasks (e.g., robotic assembly) to reach high success rates (~80%) under high levels of observation noise and control error. Although such performance may be sufficient for research applications, it falls short of industry standards and makes policy chaining exceptionally brittle. A key limitation is the high variance in individual policy performance across diverse initial conditions. We introduce Refinery, an effective framework that bridges this performance gap, robustifying policy performance across initial conditions. We propose Bayesian Optimization-guided fine-tuning to improve individual policies, and Gaussian Mixture Model-based sampling during deployment to select initializations that maximize execution success. Using Refinery, we improve mean success rates by 10.98% over state-of-the-art methods in simulation-based learning for robotic assembly, reaching 91.51% in simulation and comparable performance in the real world. Furthermore, we demonstrate that these fine-tuned policies can be chained to accomplish long-horizon, multi-part assemblysuccessfully assembling up to 8 parts without requiring explicit multi-step training.

Abstract:
Dexterous grasping with multi-fingered hands remains challenging due to high-dimensional articulations and the cost of optimization-based pipelines. Existing end-to-end methods require training on large-scale datasets for specific hands, limiting their ability to generalize across different embodiments. We propose MachaGrasp, an eigengrasp-based, end-to-end framework for cross-embodiment grasp generation. From a hands morphology description, we derive a morphology embedding and an eigengrasp set. Conditioned on these, together with the object point cloud and wrist pose, an amplitude predictor regresses articulation coefficients in a low-dimensional space, which are decoded into full joint articulations. Articulation learning is supervised with a Kinematic-Aware Articulation Loss (KAL) that emphasizes fingertip-relevant motions and injects morphology-specific structure. In simulation on unseen objects across three dexterous hands, MachaGrasp attains a 91.9% average grasp success rate with <0.4s inference per grasp. With few-shot adaptation to an unseen hand, it achieves 85.6% success on unseen objects in simulation, and real-world experiments on this few-shot-generalized hand achieve an 87% success rate. The code and additional materials are available on our project website https://connor-zh.github.io/MachaGrasp/.

Abstract:
Long-term human motion prediction (LHMP) is important for the safe and efficient operation of autonomous robots and vehicles in environments shared with humans. Accurate predictions are important for applications including motion planning, tracking, human-robot interaction, and safety monitoring. In this paper, we exploit Maps of Dynamics (MoDs), which encode spatial or spatio-temporal motion patterns as environment features, to achieve LHMP for horizons of up to 60 seconds. We propose an MoD-informed LHMP framework that supports various types of MoDs and includes a ranking method to output the most likely predicted trajectory, improving practical utility in robotics. Further, a time-conditioned MoD is introduced to capture motion patterns that vary across different times of day. We evaluate MoD-LHMP instantiated with three types of MoDs. Experiments on two real-world datasets show that MoD-informed method outperforms learning-based ones, with up to 50% improvement in average displacement error, and the time-conditioned variant achieves the highest accuracy overall.

Abstract:
Language-conditioned robotic manipulation in open-world settings requires not only accurate task execution but also the ability to detect failures for robust deployment in real-world environments. Although recent advances in vision-language models (VLMs) have significantly improved the spatial reasoning and task-planning capabilities of robots, they remain limited in their ability to recognize their own failures. In particular, a critical yet underexplored challenge lies in detecting semantic misalignment errors, where the robot executes a task that is semantically meaningful but inconsistent with the given instruction. To address this, we propose a method for building datasets targeting Semantic Misalignment Failures detection, from existing language-conditioned manipulation datasets. We also present I-FailSense, an open-source VLM framework with grounded arbitration designed specifically for failure detection. Our approach relies on fine-tuning a base VLM, followed by training lightweight classification heads, called FS blocks, attached to different internal layers of the VLM and whose predictions are aggregated using an ensembling mechanism. Experiments show that I-FailSense outperforms state-of-the-art VLMs, both comparable in size and larger, in detecting semantic misalignment errors. Notably, despite being trained only on semantic misalignment detection, I-FailSense generalizes to broader robotic failure categories and effectively transfers to other simulation environments. These findings highlight that leveraging VLM representations at multiple levels and training for semantic misalignment failure detection, are key to effective and generalizable robotic failure detection. The datasets and models are publicly released on HuggingFace.

Abstract:
Conventional visual simultaneous localization and mapping (SLAM) algorithms often fail under rapid motion, low illumination, or abrupt lighting transitions due to motion blur and limited dynamic range. Event cameras mitigate these issues with high temporal resolution and high dynamic range (HDR), but their sparse, asynchronous outputs complicate feature extraction and integration with other sensors; e.g. inertial measurement units (IMUs) and standard cameras. We present Edged USLAM, a hybrid visualinertial system that extends Ultimate SLAM (USLAM) with an edge-aware front-end and a lightweight depth module. The frontend enhances event frames for robust feature tracking and nonlinear motion compensation, while the depth module provides coarse, region-of-interest (ROI)-based scene depth to improve motion compensation and scale consistency. Evaluations across public benchmarks and real-world unmanned air vehicle (UAV) flights demonstrate that performance varies significantly by scenario. For instance, event-only methods like pointline event-based visualinertial odometry (PL-EVIO) or learning-based pipelines such as deep event-based visual odometry (DEVO) excel in highly aggressive or extreme HDR conditions. In contrast, Edged USLAM provides superior stability and minimal drift in slow or structured trajectories, ensuring consistently accurate localization on real flights under challenging illumination. These findings highlight the complementary strengths of event-only, learning-based, and hybrid approaches, while positioning Edged USLAM as a robust solution for diverse aerial navigation tasks.

Abstract:
The end-to-end autonomous driving paradigm has recently attracted lots of attention due to its scalability. However, existing methods are constrained by the limited scale of real-world data, which hinders a comprehensive exploration of the scaling laws associated with end-to-end autonomous driving. To address this issue, we collected substantial data from various driving scenarios and behaviors and conducted an extensive study on the scaling laws of existing imitation learning-based end-to-end autonomous driving paradigms. Specifically, approximately 4 million demonstrations from 23 different scenario types were gathered, amounting to over 30,000 hours of driving demonstrations. We performed open-loop evaluations and closed-loop simulation evaluations in 1,400 diverse driving demonstrations (1,300 for open-loop and 100 for closed-loop) under stringent assessment conditions. Through experimental analysis, we discovered that (1) the performance of the driving model exhibits a power-law relationship with the amount of data, but this is not the case in closed-loop evaluation. The inconsistency between the two assessments shifts our focus toward the distribution of data rather than merely expanding its volume. (2) a small increase in the quantity of long-tailed data can significantly improve the performance for the corresponding scenarios; (3) appropriate scaling of data enables the model to achieve combinatorial generalization in novel scenes and actions. Our results highlight the critical role of data scaling in improving the generalizability of models across diverse autonomous driving scenarios, assuring safe deployment in the real world.

Abstract:
We present HI-SLAM2, a geometry-aware Gaussian SLAM system that achieves fast and accurate monocular scene reconstruction using only RGB input. Existing Neural SLAM or 3DGS-based SLAM methods often trade off between rendering quality and geometry accuracy, our research demonstrates that both can be achieved simultaneously with RGB input alone. The key idea of our approach is to enhance the ability for geometry estimation by combining easy-to-obtain monocular priors with learning-based dense SLAM, and then using 3D Gaussian splatting as our core map representation to efficiently model the scene. Upon loop closure, our method ensures on-the-fly global consistency through efficient pose graph bundle adjustment and instant map updates by explicitly deforming the 3D Gaussian units based on anchored keyframe updates. Furthermore, we introduce a grid-based scale alignment strategy to maintain improved scale consistency in prior depths for finer depth details. Through extensive experiments on Replica, ScanNet, Waymo Open, ETH3D SLAM and ScanNet++ datasets, we demonstrate significant improvements over existing Neural SLAM methods and even surpass RGB-D-based methods in both reconstruction and rendering quality.

Abstract:
Signal Temporal Logic (STL) offers a concise yet expressive framework for specifying and reasoning about spatio-temporal behaviors of robotic systems. Attractively, STL admits the notion of robustness, the degree to which an input signal satisfies or violates an STL specification, thus providing a nuanced evaluation of system performance. Notably, the differentiability of STL robustness enables direct integration to robotics workflows that rely on gradient-based optimization, such as trajectory optimization and deep learning. However, existing approaches to evaluating and differentiating STL robustness rely on recurrent computations, which become inefficient with longer sequences, limiting their use in time- sensitive applications. In this paper, we present STLCG++, a masking-based approach that parallelizes STL robustness evaluation and backpropagation across timesteps, achieving significant speed-ups compared to a recurrent approach. We also introduce a smoothing technique to enable differentiation of time interval bounds, expanding STLs applicability in gradient-based optimization tasks over spatial and temporal variables. Finally, we demonstrate STLCG++s benefits through three robotics use cases and provide JAX and PyTorch libraries for seamless integration into modern robotics workflows.

Abstract:
Intent inferencing in teleoperation has been instrumental in aligning operator goals and coordinating actions with robotic partners. However, current intent inference methods often ignore subtle motion that can be strong indicators for a sudden change in intent. Specifically, we aim to tackle 1) if we can detect sudden jumps in operator trajectories, 2) how to appropriately use these sudden jump motions to infer an operators goal state, and 3) how to incorporate these discontinuous and continuous dynamics to infer operator motion. Our framework, called Psychic, models these small indicative motions through a jump-drift-diffusion stochastic differential equation to cover discontinuous and continuous dynamics. Kramers-Moyal (KM) coefficients allow us to detect jumps with a trajectory which we pair with a statistical outlier detection algorithm to nominate goal transitions. Through identifying jumps, we can perform early detection of existing goals and discover undefined goals in unstructured scenarios. Our framework then applies a Sparse Identification of Nonlinear Dynamics (SINDy) model using KM coefficients with the goal transitions as a control input to infer an operators motion behavior in unstructured scenarios. We demonstrate Psychic can produce probabilistic reachability sets and compare our strategy to a negative log-likelihood model fit. We perform a retrospective study on 600 operator trajectories in a hands-free teleoperation task to evaluate the efficacy of our opensource package, Psychic, in both offline and online learning.

Abstract:
Safe navigation within a workspace is a fundamental skill for autonomous robots to accomplish more complex tasks. Harmonic potentials are artificial potential fields that are analytical, globally convergent and provably free of local minima. Thus, it has been widely used for generating safe and reliable robot navigation control policies. However, most existing methods do not allow customization of the harmonic potential fields nor the resulting paths, particularly regarding their topological properties. In this paper, we propose a novel method that automatically finds homotopy classes of paths that can be generated by valid harmonic potential fields. The considered complex workspaces can be as general as forest worlds consisting of numerous overlapping star-obstacles. The method is based on a hybrid optimization algorithm that searches over homotopy classes, selects the structure of each tree-of-stars within the forest, and optimizes over the continuous weight parameters for each purged tree via the projected gradient descent. The key insight is to transform the forest world to the unbounded point world via proper diffeomorphic transformations. It not only facilitates a simpler design of the multi-directional D-signature between non-homotopic paths, but also retain the safety and convergence properties. Extensive simulations and hardware experiments are conducted for non-trivial scenarios, where the navigation potentials are customized for desired homotopic properties.

Abstract:
Reinforcement learning (RL) has emerged as a powerful method to learn robust control policies for bipedal locomotion. Yet, it can be difficult to tune desired robot behaviors due to unintuitive and complex reward design. In comparison, trajectory optimization-based methods offer more tuneable, interpretable, and mathematically grounded motion plans for high-dimensional legged systems. However, these methods often remain brittle to real-world disturbances like external perturbations. In this work, we present NaviGait, a hierarchical framework that combines the structure of trajectory optimization with the adaptability of RL for robust and intuitive locomotion control. NaviGait leverages RL to synthesize new motions by selecting, minimally morphing, and stabilizing gaits taken from an offline-generated gait library. NaviGait results in walking policies that match the reference motion well while maintaining robustness comparable to other locomotion controllers. Additionally, the structure imposed by NaviGait drastically simplifies the RL reward composition. Our experimental results demonstrate that NaviGait enables faster training compared to conventional and imitation-based RL, and produces motions that remain closest to the original reference. Overall, by decoupling high-level motion generation from low-level correction, NaviGait offers a more scalable and generalizable approach for achieving dynamic and robust locomotion. Videos and the full framework are publicly available at https://dynamicmobility.github.io/navigait/.

Abstract:
Diffusion-based models have recently shown strong performance in trajectory planning, as they are capable of capturing the diverse, multi-modal distributions of complex behaviors. A key limitation of these models, however, is their slow inference speed due to the iterative denoising process. This makes them less suitable for real-time applications such as closed-loop model predictive control (MPC), where plans must be generated quickly and adapted continuously to a changing environment. In this paper, we investigate Implicit Maximum Likelihood Estimation (IMLE) as an alternative generative modeling approach for planning. IMLE offers strong mode coverage while enabling inference that is two orders of magnitude faster, making it particularly well-suited for real-time MPC tasks. Our results demonstrate that IMLE achieves competitive performance on standard offline reinforcement learning benchmarks compared to the standard diffusion-based planner, while substantially improving planning speed in both open-loop and closed-loop settings. We further validate IMLE in a real-time closed-loop human navigation scenario, demonstrating how it enables rapid and adaptive plan generation in dynamic environments.

Abstract:
Sequential Visual Place Recognition (Seq-VPR) leverages transformers to capture spatio-temporal features effectively. In practice, a transformer-based Seq-VPR model should be flexible to the number of frames per sequence (sequence length), deliver fast inference, and use little memory to meet real-time constraints; however, existing approaches often prioritize performance at the expense of flexibility and efficiency. To address this gap, we propose Adapt-STformer, a Seq-VPR method built around our novel Recurrent Deformable Transformer Encoder (Recurrent-DTE), which uses an iterative recurrent mechanism to fuse information from multiple sequential frames. This design naturally supports variable sequence lengths, fast inference, and low memory usage. Experiments on the Nordland, Oxford, and NuScenes datasets show that Adapt-STformer boosts recall by up to 17% while reducing sequence extraction time by 36% and lowering memory usage by 35% relative to our best comparable baseline. Our code is released at https://ai4ce.github.io/Adapt-STFormer/

Abstract:
Image-goal navigation (ImageNav) tasks a robot with autonomously exploring an unknown environment and reaching a location that visually matches a given target image. While prior works primarily study ImageNav for ground robots, enabling this capability for autonomous drones is substantially more challenging due to their need for high-frequency feedback control and global localization for stable flight. In this paper, we propose a novel sim-to-real framework that leverages reinforcement learning (RL) to achieve ImageNav for drones. To enhance visual representation ability, our approach trains the vision backbone with auxiliary tasks, including image perturbations and future transition prediction, which results in more effective policy training. The proposed algorithm enables end-to-end ImageNav with direct velocity control, eliminating the need for external localization. Furthermore, we integrate a depth-based safety module for real-time obstacle avoidance, allowing the drone to safely navigate in cluttered environments. Unlike most existing drone navigation methods that focus solely on reference tracking or obstacle avoidance, our framework supports comprehensive navigation behaviors, including autonomous exploration, obstacle avoidance, and image-goal seeking, without requiring explicit global mapping.

Abstract:
Inertial measurement units (IMUs), which provide high-frequency linear acceleration and angular velocity measurements, serve as fundamental sensing modalities in robotic systems. Recent advances in deep neural networks have led to remarkable progress in inertial odometry. However, the heavy reliance on ground truth data during training fundamentally limits scalability and generalization to unseen and diverse environments. We propose KISS-IMU, a novel self-supervised inertial odometry framework that eliminates ground truth dependency by leveraging simple LiDAR-based ICP registration and pose graph optimization as a supervisory signal. Our approach embodies two key principles: keeping the IMU stable through motion-aware balanced training and keeping the IMU strong through uncertainty-driven adaptive weighting during inference. To evaluate performance across diverse motion patterns and scenarios, we conducted comprehensive experiments on various real-world platforms, including quadruped robots. Importantly, we train only the IMU network in a self-supervised manner, with LiDAR serving solely as a lightweight supervisory signal rather than requiring additional learnable processes. This design enables the framework to ensure robustness without relying on joint multimodal learning or ground truth supervision. The supplementary materials are available at https://sparolab.github.io/ research/kiss_imu.

Abstract:
Learning generalizable policies for robotic manipulation increasingly relies on large-scale models that excel at mapping language instructions to actions (L2A). However, this unidirectional training paradigm often produces policies that can execute tasks without deeper contextual understanding, thereby limiting their ability to generalize and to explain their behavior. We argue that the complementary skill of mapping actions back to language (A2L) is essential for developing more holistic and robust grounding. An agent capable of both acting and explaining its actions can form richer internal representa- tions and, critically, unlock new paradigms for self-supervised learning. In this paper, we introduce LACY (Language-Action CYcle), a unified framework that learns such bidirectional mappings within a single vision-language model. LACY is jointly trained on three synergistic tasks: generating parameterized actions from language (L2A), explaining observed actions in language (A2L), and verifying semantic consistency between language pairs (L2C). The framework enables a self-improving cycle that autonomously generates new training data by chain- ing the L2A and A2L modules in an L2A2L pipeline. The L2C module then filters this data using an active data augmentation strategy that selectively targets low-confidence cases, thereby improving the model efficiently without requiring additional human annotations. Extensive experiments on pick-and-place tasks in both simulation and the real world demonstrate that LACY substantially improves task success rates by over 56.46% compared to baseline methods and yields more robust language- action grounding for robotic manipulation.

Abstract:
Automating chicken shoulder deboning requires precise 6-DoF cutting through a partially occluded, deformable, multi-material joint, since contact with the bones presents serious health and safety risks. Our work makes both systems-level and algorithmic contributions to train and deploy a reactive force-feedback cutting policy that dynamically adapts a nominal trajectory and enables full 6-DoF knife control to traverse the narrow joint gap while avoiding contact with the bones. First, we introduce an open-source custom-built simulator for multi-material cutting that models coupling, fracture, and cutting forces, and supports reinforcement learning, enabling efficient training and rapid prototyping. Second, we design a reusable physical testbed emulate the chicken shoulder: two rigid ``bone spheres with controllable pose embedded in a softer block, enabling rigorous and repeatable evaluation while preserving essential multi-material characteristics of the target problem. Third, we train and deploy a residual RL policy, with discretized force observations and domain randomization, enabling robust zero-shot sim-to-real transfer and the first demonstration of a learned policy that debones a real chicken shoulder. Our experiments in our simulator, on our physical testbed, and on real chicken shoulders show that our learned policy reliably navigates the joint gap and reduces undesired bone/cartilage contact, resulting in up to a 4x improvement over existing open-loop cutting baselines in terms of success rate and bone avoidance. Our results also illustrate the necessity of force feedback for safe and effective multi-material cutting. The project website is at https://star-lab.cc.gatech.edu/papers/Yang-automated-deboning.

Abstract:
Recent advancements in world models have revolutionized dynamic environment simulation, allowing systems to foresee future states and assess potential actions. In autonomous driving, these capabilities help vehicles anticipate the behavior of other road users, perform risk-aware planning, accelerate training in simulation, and adapt to novel scenarios, thereby enhancing safety and reliability. Current approaches exhibit deficiencies in maintaining robust 3D geometric consistency or accumulating artifacts during occlusion handling, both critical for reliable safety assessment in autonomous navigation tasks. To address this, we introduce GeoDrive, which explicitly integrates robust 3D geometry conditions into driving world models to enhance spatial understanding and action controllability. Specifically, we first extract a 3D representation from the input frame and then obtain its 2D rendering based on the user-specified ego-car trajectory. To enable dynamic modeling, we propose a dynamic editing module during training to enhance the renderings by editing the positions of the vehicles. Extensive experiments demonstrate that our method significantly outperforms existing models in both action accuracy and 3D spatial awareness, leading to more realistic, adaptable, and reliable scene modeling for safer autonomous driving. Additionally, our model can generalize to novel trajectories and offers interactive scene editing capabilities, such as object editing and object trajectory control.

Abstract:
This paper introduces a novel Model Predictive Control (MPC) implementation for legged robot locomotion that leverages GPU parallelization. Our approach enables both temporal and state-space parallelization by incorporating a parallel associative scan to solve the primal-dual Karush-Kuhn-Tucker (KKT) system. In this way, the optimal control problem is solved in O(log2(n)log(N) + log2(m)) complexity, instead of O(N(n + m)3) , where n, m, and N are the dimension of the system state, control vector, and the length of the prediction horizon. We demonstrate the advantages of this implementation over two state-of-the-art solvers (acados and crocoddyl), achieving up to a 60% improvement in runtime for Whole Body Dynamics (WB)-MPC and a 700% improvement for Single Rigid Body Dynamics (SRBD)-MPC when varying the prediction horizon length. The presented formulation scales efficiently with the problem state dimensions as well, enabling the definition of a centralized controller for up to 16 legged robots that can be computed in less than 25 ms. Furthermore, thanks to the JAX implementation, the solver supports large-scale parallelization across multiple environments, allowing the possibility of performing learning with the MPC in the loop directly in GPU. The code associated with this work can be found at https://github.com/iit-DLSLab/mpx .

Abstract:
Diffusion-based policies have achieved remarkable results in robotic manipulation but often struggle to adapt rapidly in dynamic scenarios, leading to delayed responses or task failures. We present DCDP, a Dynamic Closed-Loop Diffusion Policy framework that integrates chunk-based action generation with real-time correction. DCDP integrates a self-supervised dynamic feature encoder, cross-attention fusion, and an asymmetric action encoder-decoder to inject environmental dynamics before action execution, achieving real-time closed-loop action correction and enhancing the system's adaptability in dynamic scenarios. In dynamic PushT simulations, DCDP improves adaptability by 19% without retraining while requiring only 5% additional computation. Its modular design enables plug-and-play integration, achieving both temporal coherence and real-time responsiveness in dynamic robotic scenarios, including real-world manipulation tasks. The project page is at: urlhttps://github.com/wupengyuan/dcdp

Abstract:
In this paper, we present the tidiness score-guided Monte Carlo tree search (TSMCTS), a novel framework designed to address the tabletop tidying up problem using only an RGB-D camera. We address two major problems for tabletop tidying up problem: (1) the lack of public datasets and benchmarks, and (2) the difficulty of specifying the goal configuration of unseen objects. We address the former by presenting the tabletop tidying up (TTU) dataset, a structured dataset collected in simulation. Using this dataset, we train a vision-based discriminator capable of predicting the tidiness score. This discriminator can consistently evaluate the degree of tidiness across unseen configurations, including real-world scenes. Addressing the second problem, we employ Monte Carlo tree search (MCTS) to find tidying trajectories without specifying explicit goals. Instead of providing specific goals, we demonstrate that our MCTS-based planner can find diverse tidied configurations using the tidiness score as a guidance. Consequently, we propose TSMCTS, which integrates a tidiness discriminator with an MCTS-based tidying planner to find optimal tidied arrangements. TSMCTS has successfully demonstrated its capability across various environments, including coffee tables, dining tables, office desks, and bathrooms. The TTU dataset and code will be publicly available.

Abstract:
Classical autonomous driving systems connect perception and prediction modules via hand-crafted bounding-box interfaces, limiting information flow and propagating errors to downstream tasks. Recent research aims to develop end-to-end models that jointly address perception and prediction; however, they often fail to fully exploit the synergy between appearance and motion cues, relying mainly on short-term visual features. We follow the idea of ''looking backward to look forward'', and propose MASAR, a novel fully differentiable framework for joint 3D detection and trajectory forecasting compatible with any transformer-based 3D detector. MASAR employs an object-centric spatio-temporal mechanism that jointly encodes appearance and motion features. By predicting past trajectories and refining them using guidance from appearance cues, MASAR captures long-term temporal dependencies that enhance future trajectory forecasting. Experiments conducted on the nuScenes dataset demonstrate MASAR's effectiveness, showing improvements of over 20% in minADE and minFDE while maintaining robust detection performance. Code and models are available at link.

Abstract:
Enabling robots to dexterously grasp and manipulate objects based on human commands is a promising direction in robotics. However, existing approaches are challenging to generalize across diverse objects or tasks due to the limited scale of semantic dexterous grasp datasets. Foundation models offer a new way to enhance generalization, yet directly leveraging them to generate feasible robotic actions remains challenging due to the gap between abstract model knowledge and physical robot execution. To address these challenges, we propose OmniDexGrasp, a generalizable framework that achieves omni-capabilities in user prompting, dexterous embodiment, and grasping tasks by combining foundation models with the transfer and control strategies. OmniDexGrasp integrates three key modules: (i) foundation models are used to enhance generalization by generating human grasp images supporting omni-capability of user prompt and task; (ii) a human-image-to-robot-action transfer strategy converts human demonstrations into executable robot actions, enabling omni dexterous embodiment; (iii) force-aware adaptive grasp strategy ensures robust and stable grasp execution. Experiments in simulation and on real robots validate the effectiveness of OmniDexGrasp on diverse user prompts, grasp task and dexterous hands, and further results show its extensibility to dexterous manipulation tasks.

Abstract:
Vision-Language Models (VLMs) have been increasingly integrated into object navigation tasks for their rich prior knowledge and strong reasoning abilities. However, applying VLMs to navigation presents two key challenges: effectively parsing and structuring complex environment information and determining when and how to query VLMs. Insufficient environment understanding and over-reliance on VLMs (e.g. querying at every step) can easily lead to unnecessary backtracking and reduced navigation efficiency, especially in large continuous environments. To address these challenges, we propose a novel framework that incrementally constructs a multi-layer environment representation consisting of viewpoints, object nodes, and room nodes during navigation. Viewpoints and object nodes facilitate intra-room exploration and accurate target localization, while room nodes support efficient inter-room planning. Building on this structured representation, we propose a novel two-stage navigation policy, integrating high-level planning guided by VLM reasoning with low-level VLM-assisted exploration to efficiently and reliably locate a goal object. We evaluated our approach on four simulated benchmarks (HM3D v1&v2, RoboTHOR, and MP3D), and achieved state-of-the-art performance on both the success rate (SR% 13.1%) and navigation efficiency (SPL% 6.2%). We further validate our method on a real robot platform, demonstrating strong robustness across 120 episodes in 10 different indoor environments. Project page is available at: https://zwandering.github.io/STRIVE.github.io/.

Abstract:
Autonomous navigation in dynamic environments requires spatial representations that capture both semantic structure and temporal evolution. 3D Scene Graphs (3DSGs) provide hierarchical multi-resolution abstractions that encode geometry and semantics, but existing extensions toward dynamics largely focus on individual objects or agents. In parallel, Maps of Dynamics (MoDs) model typical motion patterns and temporal regularities, yet are usually tied to grid-based discretizations that lack semantic awareness and do not scale well to large environments. In this paper we introduce textbfAion, a framework that embeds emphtemporal flow dynamics directly within a hierarchical 3DSG, effectively incorporating the temporal dimension. Aion employs a graph-based sparse MoD representation to capture motion flows over arbitrary time intervals and attaches them to navigational nodes in the scene graph, yielding more interpretable and scalable predictions that improve planning and interaction in complex dynamic environments.

Abstract:
This research introduces two efficient methods to estimate the collision risk of planned trajectories in autonomous driving under uncertain driving conditions. Deterministic collision checks of planned trajectories are often inaccurate or overly conservative, as noisy perception, localization errors, and uncertain predictions of other traffic participants introduce significant uncertainty into the planning process. This paper presents two semi-analytic methods to compute the collision probability of planned trajectories with arbitrary convex obstacles. The first approach evaluates the probability of spatial overlap between an autonomous vehicle and surrounding obstacles, while the second estimates the collision probability based on stochastic boundary crossings. Both formulations incorporate full state uncertainties, including position, orientation and velocity, and achieve high accuracy at computational costs suitable for real-time planning. Simulation studies verify that the proposed methods closely match Monte Carlo results while providing significant runtime advantages, enabling their use in risk-aware trajectory planning. The collision estimation methods are available as open-source software.

Abstract:
We present a novel approach, termed ADGaussian, for generalizable street scene reconstruction. The proposed method enables high-quality rendering from merely single-view input. Unlike prior Gaussian Splatting methods that primarily focus on geometry refinement, we emphasize the importance of joint optimization of image and depth features for accurate Gaussian prediction. To this end, we first incorporate sparse LiDAR depth as an additional input modality, formulating the Gaussian prediction process as a joint learning framework of visual information and geometric clue. Furthermore, we propose a Multi-modal Feature Matching strategy coupled with a Multi-scale Gaussian Decoding model to enhance the joint refinement of multi-modal features, thereby enabling efficient multi-modal Gaussian learning. Extensive experiments on Waymo and KITTI demonstrate that our ADGaussian achieves state-of-the-art performance and exhibits superior zero-shot generalization capabilities in novel-view shifting.

Abstract:
The inherent sequential modeling capabilities of autoregressive models make them a formidable baseline for end-to-end planning in autonomous driving. Nevertheless, their performance is constrained by a spatio-temporal misalignment, as the planner must condition future actions on past sensory data. This creates an inconsistent worldview, limiting the upper bound of performance for an otherwise powerful approach. To address this, we propose a Time-Invariant Spatial Alignment (TISA) module that learns to project initial environmental features into a consistent ego-centric frame for each future time step, effectively correcting the agent's worldview without explicit future scene prediction. In addition, we employ a kinematic action prediction head (i.e., acceleration and yaw rate) to ensure physically feasible trajectories. Finally, we introduce a multi-objective post-training stage using Direct Preference Optimization (DPO) to move beyond pure imitation. Our approach provides targeted feedback on specific driving behaviors, offering a more fine-grained learning signal than the single, overall objective used in standard DPO. Our model achieves a state-of-the-art 89.8 PDMS on the NAVSIM dataset among autoregressive models.

Abstract:
Robust robot navigation in outdoor environments requires accurate perception systems capable of handling visual challenges such as repetitive structures and changing appearances. Visual feature matching is crucial to vision-based pipelines but remains particularly challenging in natural outdoor settings due to perceptual aliasing. We address this issue in vineyards, where repetitive vine trunks and other natural elements generate ambiguous descriptors that hinder reliable feature matching. We hypothesise that semantic information tied to keypoint positions can alleviate perceptual aliasing by enhancing keypoint descriptor distinctiveness. To this end, we introduce a keypoint semantic integration technique that improves the descriptors in semantically meaningful regions within the image, enabling more accurate differentiation even among visually similar local features. We validate this approach in two vineyard perception tasks: (i) relative pose estimation and (ii) visual localisation. Our method improves matching accuracy across all tested keypoint types and descriptors, demonstrating its effectiveness over multiple months in challenging vineyard conditions.

Abstract:
Learning safe and stable robot motions from demonstrations remains a challenge, especially in complex, nonlinear tasks involving dynamic, obstacle-rich environments. In this paper, we propose Safe and Stable Neural Network Dynamical Systems S²-NNDS, a learning-from-demonstration framework that simultaneously learns expressive neural dynamical systems alongside neural Lyapunov stability and barrier safety certificates. Unlike traditional approaches with restrictive polynomial parameterizations, S²-NNDS leverages neural networks to capture complex robot motions providing probabilistic guarantees through split conformal prediction in learned certificates. Experimental results on various 2D and 3D datasetsincluding LASA handwriting and demonstrations recorded kinesthetically from the Franka Emika Panda robotvalidate S²-NNDS effectiveness in learning robust, safe, and stable motions from potentially unsafe demonstrations.

Abstract:
To handle the complexities of real-world traffic, learning planners for self-driving from data is a promising direction. While recent approaches have shown great progress, they typically assume a setting in which the ground-truth world state is available as input. However, when deployed, planning needs to be robust to the long-tail of errors incurred by a noisy perception system, which is often neglected in evaluation. To address this, previous work has proposed drawing adversarial samples from a perception error model (PEM) mimicking the noise characteristics of a target object detector. However, these methods use simple PEMs that fail to accurately capture all failure modes of detection. In this paper, we present EMPERROR, a novel transformer-based generative PEM, apply it to stress-test an imitation learning (IL)-based planner and show that it imitates modern detectors more faithfully than previous work. Furthermore, it is able to produce realistic noisy inputs that increase the planners collision rate by up to 85 %, demonstrating its utility as a valuable tool for a more complete evaluation of self-driving planners.

Abstract:
Object shape and pose estimation is a foundational robotics problem, supporting tasks from manipulation to scene understanding and navigation. We present a fast local solver for shape and pose estimation which requires only category-level object priors and admits an efficient certificate of global optimality. Given an RGB-D image of an object, we use a learned front-end to detect sparse, category-level semantic keypoints on the target object. We represent the target object's unknown shape using a linear active shape model and pose a maximum a posteriori optimization problem to solve for position, orientation, and shape simultaneously. Expressed in unit quaternions, this problem admits first-order optimality conditions in the form of an eigenvalue problem with eigenvector nonlinearities. Our primary contribution is to solve this problem efficiently with self-consistent field iteration, which only requires computing a 4-by-4 matrix and finding its minimum eigenvalue-vector pair at each iterate. Solving a linear system for the corresponding Lagrange multipliers gives a simple global optimality certificate. One iteration of our solver runs in about 100 microseconds, enabling fast outlier rejection. We test our method on synthetic data and a variety of real-world settings, including two public datasets and a drone tracking scenario.

Abstract:
End-to-end autonomous driving has emerged as a pivotal direction in the field of autonomous systems. Recent works have demonstrated impressive performance by incorpo-rating high-level guidance signals to steer low-level trajectory planners. However, their potential is often constrained by inaccurate high-level guidance and the computational overhead of complex guidance modules. To address these limitations, we propose Mimir, a novel hierarchical dual-system framework capable of generating robust trajectories relying on goal points with uncertainty estimation: (1) Unlike previous approaches that deterministically model, we estimate goal point uncertainty with a Laplace distribution to enhance robustness; (2) To overcome the slow inference speed of the guidance system, we introduce a multi-rate guidance mechanism that predicts extended goal points in advance. Validated on challenging Navhard and Navtest benchmarks, Mimir surpasses previous state-of-the-art methods with a 20% improvement in the driving score EPDMS, while achieving 1.6× improvement in high-level module inference speed without compromising accuracy. The code and models will be released soon to promote reproducibil-ity and further development.

Abstract:
Prehensile autonomous manipulation, such as peg insertion, tool use, or assembly, require precise in-hand understanding of the object pose and the extrinsic contacts made during interactions. Providing accurate estimation of pose and contacts is challenging. Tactile sensors can provide local geometry at the sensor and force information about the grasp, but the locality of sensing means resolving poses and contacts from tactile alone is often an ill-posed problem, as multiple configurations can be consistent with the observations. Adding visual feedback can help resolve ambiguities, but can suffer from noise and occlusions. In this work, we propose a method that pairs local observations from sensing with the physical constraints of contact. We propose a set of factors that ensure local consistency with tactile observations as well as enforcing physical plausibility, namely, that the estimated pose and contacts must respect the kinematic and force constraints of quasi-static rigid body interactions. We formalize our problem as a factor graph, allowing for efficient estimation. In our experiments, we demonstrate that our method outperforms existing geometric and contact-informed estimation pipelines, especially when only tactile information is available.

Abstract:
Semantic mapping aims to construct a 3D semantic representation of the environment, providing essential knowledge for robots operating in complex outdoor settings. While Bayesian Kernel Inference (BKI) addresses discontinuities of map inference from sparse sensor data, existing semantic mapping methods suffer from various sources of uncertainties in challenging outdoor environments. To address these issues, we propose an uncertainty-aware semantic mapping framework that handles multiple sources of uncertainties, which significantly degrade mapping performance. Our method estimates uncertainties in semantic predictions using Evidential Deep Learning and incorporates them into BKI for robust semantic inference. It further aggregates noisy observations into coherent Gaussian representations to mitigate the impact of unreliable points, while employing geometry-aligned kernels that adapt to complex scene structures. These Gaussian primitives effectively fuse local geometric and semantic information, enabling robust, uncertainty-aware mapping in complex outdoor scenarios. Comprehensive evaluation across diverse off-road and urban outdoor environments demonstrates consistent improvements in mapping quality, uncertainty calibration, representational flexibility, and robustness, while maintaining real-time efficiency. Our project website: https://e2-bki.github.io/

Abstract:
In this paper, we introduce Semi-SMD, a novel metric depth estimation framework tailored for surrounding cameras equipment in autonomous driving. In this work, the input data consists of adjacent surrounding frames and camera parameters. We propose a unified spatial-temporal-semantic fusion module to construct the visual fused features. Cross-attention components for surrounding cameras and adjacent frames are utilized to focus on metric scale information refinement and temporal feature matching. Building on this, we propose a pose estimation framework using surrounding cameras, their corresponding estimated depths, and extrinsic parameters, which effectively address the scale ambiguity in multi-camera setups. Moreover, semantic world model and monocular depth estimation world model are integrated to supervise the depth estimation, which improve the quality of depth estimation. We evaluate our algorithm on DDAD and nuScenes datasets, and the results demonstrate that our method achieves state-of-the-art performance in terms of surrounding camera based depth estimation quality. The source code is available on GitHub.

Abstract:
We propose Point2Act, which directly retrieves the 3D action point relevant to a contextually described task, leveraging Multimodal Large Language Models (MLLMs). Foundation models have opened the possibility for generalist robots that can perform a zero-shot task following natural language descriptions within an unseen environment. While the semantics from large-scale image and language datasets provide contextual understanding in 2D images, existing methods that leverage foundation models for 3D reconstruction struggle to accurately interpret complex compositional queries and require extensive computation. Our proposed 3D relevancy fields bypass the high-dimensional features, instead efficiently imbuing lightweight 2D point-level guidance tailored to the task-specific action. The multi-view aggregation effectively compensates for misalignments caused by geometric ambiguities, such as occlusion, or semantic uncertainties inherent in the language descriptions. The output region is highly localized, leveraging fine-grained 3D spatial context to directly identify an explicit position for a physical action in the on-the-fly reconstruction of the scene. Our full-stack pipelinewhich includes capturing, MLLM querying, 3D reconstruction, and grasp pose extractiongenerates spatially grounded responses in 16.5 seconds, facilitating practical manipulation tasks.

Abstract:
Scaling real robot data is a key bottleneck in imitation learning, leading to the use of auxiliary data for policy training. While other aspects of robotic manipulation such as image or language understanding may be learned from internet-based datasets, acquiring motion knowledge remains challenging. Human data, with its rich diversity of manipulation behaviors, offers a valuable resource for this purpose. While previous works show that using human data can bring benefits, such as improving robustness and training efficiency, it remains unclear whether it can realize its greatest advantage: enabling robot policies to directly learn new motions for task completion. In this paper, we systematically explore this potential through multi-task human-robot cotraining. We introduce MotionTrans, a framework that includes a data collection system, a human data transformation pipeline, and a weighted cotraining strategy. By cotraining 30 human-robot tasks simultaneously, we direcly transfer motions of 13 tasks from human data to deployable end-to-end robot policies. Notably, 9 tasks achieve non-trivial success rates in zero-shot manner. MotionTrans also significantly enhances pretraining-finetuning performance (+40% success rate). These findings unlock the potential of motion-level learning from human data, offering insights into its effective use for training robotic manipulation policies. All data, code, and model weights will be open-sourced.

Abstract:
Conventional robot social behavior generation has been limited in flexibility and autonomy, relying on predefined motions or human feedback. This study proposes CRISP (Critique-and-Replan for Interactive Social Presence), an autonomous framework where a robot critiques and replans its own actions by leveraging a Vision-Language Model (VLM) as a `human-like social critic.' CRISP integrates (1) extraction of movable joints and constraints by analyzing the robot's description file (e.g., MJCF), (2) generation of step-by-step behavior plans based on situational context, (3) generation of low-level joint control code by referencing visual information (joint range-of-motion visualizations), (4) VLM-based evaluation of social appropriateness and naturalness, including pinpointing erroneous steps, and (5) iterative refinement of behaviors through reward-based search. This approach is not tied to a specific robot API; it can generate subtly different, human-like motions on various platforms using only the robot's structure file. In a user study involving five different robot types and 20 scenarios, including mobile manipulators and humanoids, our proposed method achieved significantly higher preference and situational appropriateness ratings compared to previous methods. This research presents a general framework that minimizes human intervention while expanding the robot's autonomous interaction capabilities and cross-platform applicability. Detailed result videos and supplementary information regarding this work are available at https://limjiyu99.github.io/inner-critic/.

Abstract:
Global navigation information and local scene understanding are two crucial components of autonomous driving systems. However, our experimental results indicate that many end-to-end autonomous driving systems tend to over-rely on local scene understanding while failing to utilize global navigation information. These systems exhibit weak correlation between their planning capabilities and navigation input, and struggle to perform navigation-following in complex scenarios. To overcome this limitation, we propose the Sequential Navigation Guidance (SNG) framework, an efficient representation of global navigation information based on real-world navigation patterns. The SNG encompasses both navigation paths for constraining long-term trajectories and turn-by-turn (TBT) information for real-time decision-making logic. We constructed the SNG-QA dataset, a visual question answering (VQA) dataset based on SNG that aligns global and local planning. Additionally, we introduce an efficient model SNG-VLA that fuses local planning with global planning. The SNG-VLA achieves state-of-the-art performance through precise navigation information modeling without requiring auxiliary loss functions from perception tasks.

Abstract:
LiDAR-based 3D single object tracking (3D SOT) is a critical task in robotics and autonomous systems. Existing methods typically follow frame-wise motion estimation or a sequence-based paradigm. However, the two-frame methods are efficient but lack long-term temporal context, making them vulnerable in sparse or occluded scenes, while sequence-based methods that process multiple point clouds gain robustness at a significant computational cost. To resolve this dilemma, we propose a novel trajectory-based paradigm and its instantiation, TrajTrack. Traj- Track is a lightweight framework that enhances a base two-frame tracker by implicitly learning motion continuity from historical bounding box trajectories alonewithout requiring additional, costly point cloud inputs. It first generates a fast, explicit motion proposal and then uses an implicit motion modeling module to predict the future trajectory, which in turn refines and corrects the initial proposal. Extensive experiments on the large-scale NuScenes benchmark show that TrajTrack achieves new state-of-the-art performance, dramatically improving tracking precision by 4.48% over a strong baseline while running at 56 FPS. Besides, we also demonstrate the strong generalizability of TrajTrack across different base trackers. Code will be available.

Abstract:
The Euclidean Signed Distance Field (ESDF) is widely used in visibility evaluation to prevent occlusions and collisions during tracking. However, frequent ESDF updates introduce considerable computational overhead. To address this issue, we propose Eva-Tracker, a visibility-aware trajectory planning framework for aerial tracking that eliminates ESDF updates and incorporates a recovery-capable path generation method for target reacquisition. First, we design a target trajectory prediction method and a visibility-aware initial path generation algorithm that maintain an appropriate observation distance, avoid occlusions, and enable rapid replanning to reacquire the target when it is lost. Then, we propose the Field of View ESDF (FoV-ESDF), a precomputed ESDF tailored to the tracker's field of view, enabling rapid visibility evaluation without requiring updates. Finally, we optimize the trajectory using differentiable FoV-ESDF-based objectives to ensure continuous visibility throughout the tracking process. Extensive simulations and real-world experiments demonstrate that our approach delivers more robust tracking results with lower computational effort than existing state-of-the-art methods.

Abstract:
Mobile robot navigation in dynamic human environments requires policies that balance adaptability to diverse behaviors with compliance to safety constraints. We hypothesize that integrating data-driven rewards with rule-based objectives enables navigation policies to achieve a more effective balance of adaptability and safety. To this end, we develop a framework that learns a density-based reward from positive and negative demonstrations and augments it with rule-based objectives for obstacle avoidance and goal reaching. A sampling-based lookahead controller produces supervisory actions that are both safe and adaptive, which are subsequently distilled into a compact student policy suitable for real-time operation with uncertainty estimates. Experiments in synthetic and elevator co-boarding simulations show consistent gains in success rate and time efficiency over baselines, and real-world demonstrations with human participants confirm the practicality of deployment. A video illustrating this work can be found on our project page https://chanwookim971024.github.io/PioneeR/.

Abstract:
Robot imitation learning is often hindered by the high cost of collecting large-scale, real-world data. This challenge is especially significant for low-cost robots designed for home use, as they must be both user-friendly and affordable. To address this, we propose the EasyMimic framework, a lowcost and replicable solution that enables robots to quickly learn manipulation policies from human video demonstrations captured with standard RGB cameras. Our method first extracts 3D hand trajectories from the videos. An action alignment module then maps these trajectories to the gripper control space of a low-cost robot. To bridge the human-to-robot domain gap, we introduce a simple and user-friendly hand visual augmentation strategy. We then use a co-training method, fine-tuning a model on both the processed human data and a small amount of robot data, enabling rapid adaptation to new tasks. Experiments on the low-cost LeRobot platform demonstrate that EasyMimic achieves high performance across various manipulation tasks. It significantly reduces the reliance on expensive robot data collection, offering a practical path for bringing intelligent robots into homes. Project website: urlhttps://zt375356.github.io/EasyMimic-Project/.

Abstract:
Monocular depth estimation (MDE) provides a useful tool for robotic perception, but its predictions are often uncertain and inaccurate in challenging environments such as surgical scenes where textureless surfaces, specular reflections, and occlusions are common. To address this, we propose ProbeMDE, a cost-aware active sensing framework that combines RGB images with sparse proprioceptive measurements for MDE. Our approach utilizes an ensemble of MDE models to predict dense depth maps conditioned on both RGB images and a sparse set of known depth measurements obtained via proprioception, where the robot has touched the environment in a known configuration. We quantify predictive uncertainty via the ensemble's variance and measure the gradient of the uncertainty with respect to candidate measurement locations. To prevent mode collapse while selecting maximally informative locations to propriocept (touch), we leverage Stein Variational Gradient Descent (SVGD) over this gradient map. We validate our method in both simulated and physical experiments on central airway obstruction surgical phantoms. Our results demonstrate that our approach outperforms baseline methods across standard depth estimation metrics, achieving higher accuracy while minimizing the number of required proprioceptive measurements.

Abstract:
Developing efficient and accurate visuomotor policies poses a central challenge in robotic imitation learning. While recent rectified flow approaches have advanced visuomotor policy learning, they suffer from a key limitation: After iterative distillation, generated actions may deviate from the ground-truth actions corresponding to the current visual observation, leading to accumulated error as the reflow process repeats and unstable task execution. We present Selective Flow Alignment (SeFA), an efficient and accurate visuomotor policy learning framework. SeFA resolves this challenge by a selective flow alignment strategy, which leverages expert demonstrations to selectively correct generated actions and restore consistency with observations, while preserving multimodality. This design introduces a consistency correction mechanism that ensures generated actions remain observation-aligned without sacrificing the efficiency of one-step flow inference. Extensive experiments across both simulated and real-world manipulation tasks show that SeFA surpasses state-of-the-art diffusion-based and flow-based policies, achieving superior accuracy and robustness while reducing inference latency by over 98%. By unifying rectified flow efficiency with observation-consistent action generation, SeFA provides a scalable and dependable solution for real-time visuomotor policy learning.

Abstract:
Planning with learned dynamics models offers a promising approach toward versatile real-world manipulation, particularly in nonprehensile settings such as pushing or rolling, where accurate analytical models are difficult to obtain. However, collecting training data for learning-based methods can be costly and inefficient, as it often relies on randomly sampled interactions that are not necessarily the most informative. Furthermore, learned models tend to exhibit high uncertainty in underexplored regions of the skill space, undermining the reliability of long-horizon planning. To address these challenges, we propose ActivePusher, a novel framework that combines residual-physics modeling with kernel-based active learning, to focus data acquisition on the most informative skill parameters. Additionally, ActivePusher seamlessly integrates with model-based kinodynamic planners, leveraging uncertainty estimates to bias control sampling toward more reliable actions. We evaluate our approach in both simulation and real-world environments, and demonstrate that it consistently improves data efficiency and achieves higher planning success rates in comparison to baseline methods.

Abstract:
We introduce DreamControl, a novel methodology for learning autonomous whole-body humanoid skills. DreamControl leverages the strengths of diffusion models and Reinforcement Learning (RL): our core innovation is the use of a diffusion prior trained on human motion data, which subsequently guides an RL policy in simulation to complete specific tasks of interest (e.g., opening a drawer or picking up an object). We demonstrate that this human motion-informed prior allows RL to discover solutions unattainable by direct RL, and that diffusion models inherently promote natural-looking motions, aiding in sim-to-real transfer. We validate DreamControls effectiveness on a Unitree G1 robot across a diverse set of challenging tasks involving simultaneous lower and upper body control and object interaction. Project website: genrobo.github.io/DreamControl/ Appendix: genrobo.github.io/DreamControl/Appendix.pdf

Abstract:
Soft pneumatic actuators (SPA) made from elastomeric materials can provide large strain and large force. The behavior of locally strain-restricted hyperelastic materials under inflation has been investigated thoroughly for shape reconfiguration, but requires further investigation for trajectories involving external force. In this work we model force-pressure-height relationships for a concentrically strain-limited class of soft pneumatic actuators and demonstrate the use of this model to design SPA response for object lifting. We predict relationships under different loadings by solving energy minimization equations and verify this theory by using an automated test rig to collect rich data for n=22 Ecoflex 00-30 membranes. We collect data using an active learning pipeline to efficiently model the design space. We show that this learned model outperforms the theory-based model and a naive regression. We use our model to optimize membrane design for different lift tasks and compare this performance to other designs. These contributions represent a step towards understanding the natural response for this class of actuator and embodying intelligent lifts in a single-pressure input actuator system.

Abstract:
Fine-grained understanding of human actions is essential for safe and intuitive humanrobot interaction. We study the challenge of recognizing nearly symmetric actions, such as picking up vs. placing down a tool or opening vs. closing a drawer. These actions are common in close human-robot collaboration, yet they are rare and largely overlooked in mainstream vision frameworks. Pretrained vision foundation models (VFMs) are often adapted using probing, valued in robotics for its efficiency and low data needs, or parameter-efficient fine-tuning (PEFT), which adds temporal modeling through adapters or prompts. However, our analysis shows that probing is permutation-invariant and blind to frame order, while PEFT is prone to overfitting on smaller HRI datasets, and less practical in real-world robotics due to compute constraints. To address this, we introduce STEP (Self-attentive Temporal Embedding Probing), a lightweight extension to probing that models temporal order via frame-wise positional encodings, a global CLS token, and a simplified attention block. Compared to conventional probing, STEP improves accuracy by 410% on nearly symmetric actions and 615% overall across action recognition benchmarks in human-robot-interaction, industrial assembly, and driver assistance. Beyond probing, STEP surpasses heavier PEFT methods and even outperforms fully fine-tuned models on all three benchmarks, establishing a new state of the art. Code and models will be made publicly available

Abstract:
Reinforcement learning (RL), while powerful and expressive, can often prioritize performance at the expense of safety. Yet safety violations can lead to catastrophic outcomes in real-world deployments. Control Barrier Functions (CBFs) offer a principled method to enforce dynamic safetytraditionally deployed online via safety filters. While the result is safe behavior, the fact that the RL policy does not have knowledge of the CBF can lead to conservative behaviors. This paper proposes CBF-RL, a framework for generating safe behaviors with RL by enforcing CBFs in training. CBF-RL has two key attributes: (1) minimally modifying a nominal RL policy to encode safety constraints via a CBF term, and (2) safety filtering of the policy rollouts in training. Theoretically, we prove that continuous-time safety filters can be deployed via closed-form expressions on discrete-time roll-outs. Practically, we demonstrate that CBF-RL internalizes the safety constraints in the learned policyboth enforcing safer actions and biasing towards safer rewardsenabling safe deployment without the need for an online safety filter. We validate our framework through ablation studies on navigation tasks and on the Unitree G1 humanoid robot, where CBF-RL enables safer exploration, faster convergence, and robust performance under uncertainty, enabling the humanoid robot to avoid obstacles and climb stairs safely in real-world settings without a runtime safety filter.

Abstract:
Embodied agents need to plan and act reliably in real and complex 3D environments. Classical planning (e.g., PDDL) offers structure and guarantees, but in practice it fails under noisy perception and incorrect predicate grounding. On the other hand, Large Language Models (LLMs)-based planners leverage commonsense reasoning, yet frequently propose actions that are unfeasible or unsafe. Following recent works that combine the two approaches, we introduce ContextMatters, a framework that fuses LLMs and classical planning to perform hierarchical goal relaxation: the LLM helps ground symbols to the scene and, when the target is unreachable, it proposes functionally equivalent goals that progressively relax constraints, adapting the goal to the context of the agent's environment. Operating on 3D Scene Graphs, this mechanism turns many nominally unfeasible tasks into tractable plans and enables context-aware partial achievement when full completion is not achievable. Our experimental results show a +52.45% Success Rate improvement over state-of-the-art LLMs+PDDL baseline, demonstrating the effectiveness of our approach. Moreover, we validate the execution of ContextMatter in a real world scenario by deploying it on a TIAGo robot. Anonymized code, dataset, and supplementary materials are available to the community at https://anonymous.4open.science/r/context-matters-13ED.

Abstract:
Robot-assisted therapy can deliver high-dose, task-specific training after neurologic injury, but most systems act primarily at the limb level - engaging the impaired neural circuits only indirectly - which remains a key barrier to truly contingent, neuroplasticity-targeted rehabilitation. We address this gap by implementing online, dual-state motor imagery control of an upper-limb exoskeleton, enabling goal-directed reaches to be both initiated and terminated directly from noninvasive EEG. Eight participants used EEG to initiate assistance and then volitionally halt the robot mid-trajectory. Across two online sessions, group-mean hit rates were 61.5% for onset and 64.5% for offset, demonstrating reliable startstop command delivery despite instrumental noise and passive arm motion. Methodologically, we reveal a systematic, class-driven bias induced by common task-based recentering using an asymmetric margin diagnostic, and we introduce a class-agnostic fixation-based recentering method that tracks drift without sampling command classes while preserving class geometry. This substantially improves threshold-free separability (AUC gains: onset +56%, p=0.0117; offset +34%, p=0.0251) and reduces bias within and across days. Together, these results help bridge offline decoding and practical, intention-driven startstop control of a rehabilitation exoskeleton, enabling precisely timed, contingent assistance aligned with neuroplasticity goals while supporting future clinical translation.

Abstract:
Continuum robots are advancing bronchoscopy procedures by accessing complex lung airways and enabling targeted interventions. However, their development is limited by the lack of realistic testing environments: Real data is difficult to collect due to ethical constraints and patient safety concerns, and developing autonomy algorithms requires realistic imaging and physical feedback. We present ROOM (Realistic Optical Observation in Medicine), a comprehensive simulation framework designed for generating photorealistic bronchoscopy training data. By leveraging patient CT scans, our pipeline renders multi-modal sensor data including RGB images with realistic noise and light specularities, metric depth maps, surface normals, optical flow and point clouds at medically relevant scales. We validate the data generated by ROOM in two canonical tasks for medical robotics---multi-view pose estimation and monocular depth estimation, demonstrating diverse challenges that state-of-the-art methods must overcome to transfer to these medical settings. Furthermore, we show that the data produced by ROOM can be used to fine-tune existing depth estimation models to overcome these challenges, also enabling other downstream applications such as navigation. We expect that ROOM will enable large-scale data generation across diverse patient anatomies and procedural scenarios that are challenging to capture in clinical settings. Our code and data will be publicly released.

Abstract:
We introduce PAPRLE (Plug-And-Play Robotic Limb Environment), a modular ecosystem that enables flexible placement and control of robotic limbs. With PAPRLE, a user can re-configure the arrangement of the robotic limbs, and control them with various kinds of devices, including puppeteers, handheld controllers, and VR device. To support diverse multi-limb setups, we also develop a pluggable puppeteer device that can be easily mounted and adapted to different robot configurations. PAPRLE unifies control signals across heterogeneous input devices and supports both task-space and joint-space control modalities, enabling control using puppeteers with different kinematic structures or devices without joint information, such as VR and handheld devices. It also offers feedback mechanisms for teleoperation, including force feedback between structurally dissimilar leader-follower pairs. The modular design of the system facilitates novel spatial arrangements of limbs and enables scalable data collection, thereby advancing research in embodied AI and learning-based control. We validate PAPRLE in real-world settings, demonstrating its versatility across diverse combinations of leader dev

Abstract:
Benchmarking autonomous driving planners to align with human judgment remains a critical challenge, as state-of-the-art metrics like the Extended Predictive Driver Model Score (EPDMS) lack context awareness in nuanced scenarios. To address this, we introduce DriveCritic, a novel framework featuring two key contributions: the DriveCritic dataset, a curated collection of challenging scenarios where context is critical for correct judgment and annotated with pairwise human preferences, and the DriveCritic model, a Vision-Language Model (VLM) based evaluator. Fine-tuned using a two-stage supervised and reinforcement learning pipeline, the DriveCritic model learns to adjudicate between trajectory pairs by integrating visual and symbolic context. Experiments show DriveCritic significantly outperforms existing metrics and baselines in matching human preferences and demonstrates strong context awareness. Overall, our work provides a more reliable, human-aligned foundation to evaluating autonomous driving systems. The project page for DriveCritic is https://song-jingyu.github.io/DriveCritic.

Abstract:
A major bottleneck in off-road autonomous driving research lies in the scarcity of large-scale, high-quality datasets and benchmarks. To bridge this gap, we present ORAD-3D, which, to the best of our knowledge, is the largest dataset specifically curated for off-road autonomous driving. ORAD-3D covers a wide spectrum of terrainsincluding woodlands, farmlands, grasslands, riversides, gravel roads, cement roads, and rural areaswhile capturing diverse environmental variations across weather conditions (sunny, rainy, foggy, and snowy) and illumination levels (bright daylight, daytime, twilight, and nighttime). Building upon this dataset, we establish a comprehensive suite of benchmark evaluations spanning five fundamental tasks: 2D free-space detection, 3D occupancy prediction, rough GPS-guided path planning, visionlanguage modeldriven autonomous driving, and world model for off-road environments. Together, the dataset and benchmarks provide a unified and robust resource for advancing perception and planning in challenging off-road scenarios. The dataset is publicly available at https://github.com/chaytonmin/ORAD-3D-Dataset-For-Off-Road-AD.

Abstract:
Real-world data collection for robotics is costly and resource-intensive, requiring skilled operators and expensive hardware. Simulations offer a scalable alternative but often fail to achieve sim-to-real generalization due to geometric and visual gaps. To address these challenges, we propose a 3D-photorealistic real-to-sim system, namely, Re^3Sim, addressing geometric and visual sim-to-real gaps. Re^3Simemploys advanced 3D reconstruction and rendering techniques to faithfully recreate real-world scenarios, enabling real-time rendering of simulated cross-view cameras within a physics-based simulator. By utilizing privileged information to collect expert demonstrations efficiently in simulation, and train robot policies with imitation learning, we validate the effectiveness of the real-to-sim-to-real system across various manipulation task scenarios. Notably, with only simulated data, we can achieve zero-shot sim-to-real transfer with an average success rate exceeding 58%. To push the limit of real-to-sim, we further generate a large-scale simulation dataset, demonstrating how a robust policy can be built from simulation data that generalizes across various objects.

Abstract:
Dynamic manipulation is a key capability for advancing robot performance, enabling skills such as tossing. While recent learning-based approaches have pushed the field forward, most methods still rely on manually designed action parameterizations, limiting their ability to produce the highly coordinated motions required in complex tasks. Motion planning can generate feasible trajectories, but the dynamics gapstemming from control inaccuracies, contact uncertainties, and aerodynamic effectsoften causes large deviations between planned and executed trajectories. In this work, we propose Dynamics-Aware Motion Manifold Primitives (DA-MMP), a motion generation framework for goal-conditioned dynamic manipulation, and instantiate it on a challenging real-world ring-tossing task. Our approach extends motion manifold primitives to variable-length trajectories through a compact parameterization and learns a high-quality manifold from a large-scale dataset of planned motions. Building on this manifold, a conditional flow matching model is trained in the latent space with a small set of real-world trials, enabling the generation of throwing trajectories that account for execution dynamics. Experiments show that our method can generate coordinated and smooth motion trajectories for the ring-tossing task. In real-world evaluations, it achieves high success rates and even surpasses the performance of trained human experts. Moreover, it generalizes to novel targets beyond the training range, indicating that it successfully learns the underlying trajectorydynamics mapping.

Abstract:
Egocentric action recognition enables robots to facilitate human-robot interactions and monitor task progress. Existing methods often rely solely on RGB videos, although additional modalities, such as audio, can improve accuracy under challenging conditions. However, most multimodal approaches assume that all modalities are available at inference time, leading to significant accuracy drops, or even failure, when inputs are missing. To address this limitation, we introduce KARMMA, a multimodal Knowledge distillation framework for egocentric Action Recognition robust to Missing ModAlities that does not require modality alignment across all samples during training or inference. KARMMA distills knowledge from a multimodal teacher into a multimodal student that leverages all available modalities while remaining robust to missing ones, enabling deployment across diverse sensor configurations without retraining. Our student uses approximately 50% fewer computational resources than the teacher, resulting in a lightweight and fast model that is well suited for on-robot deployment. Experiments on Epic-Kitchens and Something-Something demonstrate that our student achieves competitive accuracy while significantly reducing performance degradation under missing modality conditions.

Abstract:
Imitation learning has proven effective for training robots to perform complex tasks from expert human demonstrations. However, it remains limited by its reliance on high-quality, task-specific data, restricting adaptability to the diverse range of real-world object configurations and scenarios. In contrast, non-expert data---such as play data, suboptimal demonstrations, partial task completions, or rollouts from suboptimal policies---can offer broader coverage and lower collection costs. However, conventional imitation learning approaches fail to utilize this data effectively. To address these challenges, we posit that with right design decisions, offline reinforcement learning can be used as a tool to harness non-expert data to enhance the performance of imitation learning policies. We show that while standard offline RL approaches can be ineffective at actually leveraging non-expert data under the sparse data coverage settings typically encountered in the real world, simple algorithmic modifications can allow for the utilization of this data, without significant additional assumptions. Our approach shows that broadening the support of the policy distribution can allow imitation algorithms augmented by offline RL to solve tasks robustly, showing considerably enhanced recovery and generalization behavior. In manipulation tasks, these innovations significantly increase the range of initial conditions where learned policies are successful when non-expert data is incorporated. Moreover, we show that these methods are able to leverage all collected data, including partial or suboptimal demonstrations, to bolster task-directed policy performance. This underscores the importance of algorithmic techniques for using non-expert data for robust policy learning in robotics.

Abstract:
Imitation learning advances robot capabilities by enabling the acquisition of diverse behaviors from human demonstrations. However, large-scale datasets used for policy training often introduce substantial variability in quality, which can negatively impact performance. As a result, automatically curating datasets by filtering low-quality samples to improve quality becomes essential. Existing robotic curation approaches rely on costly manual annotations and perform curation at a coarse granularity, such as the dataset or trajectory level, failing to account for the quality of individual state-action pairs. To address this, we introduce SCIZOR, the first self-supervised transition-level curation framework that requires no annotations and scales to large-scale datasets to improve the performance of imitation learning policies and modern Vision-Language-Action (VLA) models. SCIZOR targets two complementary sources of low-quality data: suboptimal data, which hinders learning with undesirable actions, and redundant data, which dilutes training with repetitive patterns. SCIZOR leverages a self-supervised task progress predictor for suboptimal data to remove samples lacking task progression, and a deduplication module operating on joint state-action representation for samples with redundant patterns. Empirically, we show that SCIZOR enables imitation learning policies and modern VLA models to achieve higher performance with less data, yielding an average improvement of 15.4% across multiple benchmarks. More information is available at: https://scizor-icra2026.github.io

Abstract:
The use of robotics in humanitarian demining increasingly involves computer vision techniques to improve landmine detection capabilities. However, in the absence of diverse and realistic datasets, the reliable validation of algorithms remains a challenge for the research community. In this paper, we introduce MineInsight, a publicly available multi-sensor, multi-spectral dataset designed for off-road landmine detection. The dataset features 35 different targets (15 landmines and 20 commonly found objects) distributed along three distinct tracks, providing a diverse and realistic testing environment. MineInsight is, to the best of our knowledge, the first dataset to integrate dual-view sensor scans from both an Unmanned Ground Vehicle and its robotic arm, offering multiple viewpoints to mitigate occlusions and improve spatial awareness. It features two LiDARs, as well as images captured at diverse spectral ranges, including visible (RGB, monochrome), visible short-wave infrared (VIS-SWIR), and long-wave infrared (LWIR). Additionally, the dataset provides bounding boxes generated by an automated pipeline and refined with human supervision. We recorded approximately one hour of data in both daylight and nighttime conditions, resulting in around 38,000 RGB frames, 53,000 VIS-SWIR frames, and 108,000 LWIR frames. MineInsight serves as a benchmark for developing and evaluating landmine detection algorithms. Our dataset is available at https://github.com/mariomlz99/mineinsight .

Abstract:
In this work, we introduce SpikeATac, a multimodal tactile finger combining a taxelized and highly sensitive dynamic response (PVDF) with a static transduction method (capacitive) for multimodal touch sensing. Named for its `spiky' response, SpikeATac's 16-taxel PVDF film sampled at 4 kHz provides fast, sensitive dynamic signals to the very onset and breaking of contact. We characterize the sensitivity of the different modalities, and show that SpikeATac provides the ability to stop quickly and delicately when grasping fragile, deformable objects. Beyond parallel grasping, we show that SpikeATac can be used in a learning-based framework to achieve new capabilities on a dexterous multifingered robot hand. We use a learning recipe that combines reinforcement learning from human feedback with tactile-based rewards to fine-tune the behavior of a policy to modulate force. Our hardware platform and learning pipeline together enable a difficult dexterous and contact-rich task that has not previously been achieved: in-hand manipulation of fragile objects. Videos are available at roamlab.github.io/spikeatac.

Abstract:
In earthwork and construction, excavators often encounter large rocks mixed with various soil conditions, requiring skilled operators. This paper presents a framework for achieving autonomous excavation using reinforcement learning (RL) through a rock excavation simulator. In the simulation, resolution can be defined by the particle size/number in the whole soil space. Fine-resolution simulations closely mimic real-world behavior but demand significant calculation time and challenging sample collection, while coarse-resolution simulations enable faster sample collection but deviate from real-world behavior. To combine the advantages of both resolutions, we explore using policies developed in coarse-resolution simulations for pre-training in fine-resolution simulations. To this end, we propose a novel policy learning framework called Progressive-Resolution Policy Distillation (PRPD), which progressively transfers policies through some middle-resolution simulations with conservative policy transfer to avoid domain gaps that could lead to policy transfer failure. Validation in a rock excavation simulator and nine real-world rock environments demonstrated that PRPD reduced sampling time to less than 1/7 while maintaining task success rates comparable to those achieved through policy learning in a fine-resolution simulation.

Abstract:
Recent advances in quadrupedal locomotion have focused on improving stability and performance across diverse environments. However, existing methods often lack adequate safety analysis and struggle to adapt to varying payloads and complex terrains, typically requiring extensive tuning. To overcome these challenges, we propose a Chance-Constrained Model Predictive Control (CCMPC) framework that explicitly models payload and terrain variability as distributions of parametric and additive disturbances within the single rigid body dynamics model. Our approach ensures safe and consistent performance under uncertain dynamics by expressing the models friction cone constraints, which define the feasible set of ground reaction forces, as chance constraints. Moreover, we solve the resulting stochastic control problem using a computationally efficient quadratic programming formulation. Extensive Monte Carlo simulations of quadrupedal locomotion across varying payloads and complex terrains demonstrate that CCMPC significantly outperforms two competitive benchmarks: Linear MPC and MPC with hand-tuned safety mar- gins to maintain stability, reduce foot slippage, and track the center of mass. Hardware experiments on the Unitree Go1 robot show successful locomotion across various indoor and outdoor terrains with unknown loads exceeding 50% of the robots body weight, despite no additional parameter tuning.

Abstract:
Musculoskeletal humanoids are robots that closely mimic the human musculoskeletal system, offering various advantages such as variable stiffness control, redundancy, and flexibility. However, their body structure is complex, and muscle paths often significantly deviate from geometric models. To address this, numerous studies have been conducted to learn body schema, particularly the relationships among joint angles, muscle tension, and muscle length. These studies typically rely solely on data collected from the actual robot, but this data collection process is labor-intensive, and learning becomes difficult when the amount of data is limited. Therefore, in this study, we propose a method that applies the concept of Physics-Informed Neural Networks (PINNs) to the learning of body schema in musculoskeletal humanoids, enabling high-accuracy learning even with a small amount of data. By utilizing not only data obtained from the actual robot but also the physical laws governing the relationship between torque and muscle tension under the assumption of correct joint structure, more efficient learning becomes possible. We apply the proposed method to both simulation and an actual musculoskeletal humanoid and discuss its effectiveness and characteristics.

Abstract:
We propose enforcing constraints on Model-Based Diffusion by introducing emerging barrier functions inspired by interior point methods. We show that constraints on Model-Based Diffusion can lead to catastrophic performance degradation, even on simple 2D systems due to sample inefficiency in the Monte Carlo approximation of the score function. We introduce Emerging-Barrier Model-Based Diffusion (EB-MBD) which uses progressively introduced barrier constraints to avoid these problems, significantly improving solution quality, without expensive projection operations such as projections. We analyze the sampling liveliness of samples at each iteration to inform barrier parameter scheduling choice. We demonstrate results for 2D collision avoidance and a 3D underwater manipulator system and show that our method achieves lower cost solutions than Model-Based Diffusion, and requires orders of magnitude less computation time than projection based methods.

Abstract:
Zero-shot object navigation (ZSON) in large-scale outdoor environments faces many challenges; we specifically address a coupled one: long-range targets that reduce to tiny projections and intermittent visibility due to partial or complete occlusion. We present a unified, lightweight closed-loop system built on an aligned multi-scale image tile hierarchy. Through hierarchical targetsaliency fusion, it summarizes localized semantic contrast into a stable coarse-layer regional saliency that provides the target direction and indicates target visibility. This regional saliency supports visibility-aware heading maintenance through keyframe memory, saliency-weighted fusion of historical headings, and active search during temporary invisibility. The system avoids whole-image rescaling, enables deterministic bottom-up aggregation, supports zero-shot navigation, and runs efficiently on a mobile robot. Across simulation and real-world outdoor trials, the system detects semantic targets beyond 150 m, maintains a correct heading through visibility changes with 82.6% probability, and improves overall task success by 17.5% compared with the SOTA methods, demonstrating robust ZSON toward distant and intermittently observable targets.

Abstract:
This paper investigates training better visual world models for robot manipulation, i.e., models that can predict future visual observations by conditioning on past frames and robot actions. Specifically, we consider world models that operate on RGB-D frames (RGB-D world models). As opposed to canonical approaches that handle dynamics prediction mostly implicitly and reconcile it with visual rendering in a single model, we introduce FlowDreamer, which adopts 3D scene flow as explicit motion representations. FlowDreamer first predicts 3D scene flow from past frame and action conditions with a U-Net, and then a diffusion model will predict the future frame utilizing the scene flow. FlowDreamer is trained end-to-end despite its modularized nature. We conduct experiments on 4 different benchmarks, covering both video prediction and visual planning tasks. The results demonstrate that FlowDreamer achieves better performance compared to other baseline RGB-D world models by 7% on semantic similarity, 11% on pixel quality, and 6% on success rate in various robot manipulation domains.

Abstract:
Non-prehensile manipulation of diverse objects remains a core challenge in robotics, driven by unknown physical properties and the complexity of contact-rich interactions. Recent advances in contact-implicit model predictive control (CI-MPC), with contact reasoning embedded directly in the trajectory optimization, have shown promise in tackling the task efficiently and robustly. However, demonstrations have been limited to narrowly curated examples. In this work, we showcase the broader capabilities of CI-MPC through precise planar pushing tasks over a wide range of object geometries, including multi-object domains. These scenarios demand reasoning over numerous inter-object and object-environment contacts to strategically manipulate and de-clutter the environment, which was intractable for prior CI-MPC methods. To achieve this, we introduce Consensus Complementarity Control Plus (C3+), an enhanced CI-MPC algorithm integrated into a complete pipeline spanning object scanning, mesh reconstruction, and hardware execution. Compared to its predecessor C3, C3+ achieves substantially faster solve times, enabling real-time performance even in multi-object pushing tasks. On hardware, our system achieves overall 98% success rate across 33 objects, reaching pose goals within tight tolerances. The average timeto- goal is approximately 0.5, 1.6, 3.2, and 5.3 minutes for 1-, 2-, 3-, and 4-object tasks, respectively. Project page: https: //dairlab.github.io/push-anything.

Abstract:
Zero-shot Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires an agent to navigate unseen environments based on natural language instructions without any prior training. Current methods face a critical trade-off: either rely on environment-specific waypoint predictors that limit scene generalization, or underutilize the reasoning capabilities of large models during navigation. We introduce LaViRA, a simple yet effective zero-shot framework that addresses this dilemma by decomposing action into a coarse-to-fine hierarchy: Language Action for high-level planning, Vision Action for middle-level perceptual grounding, and Robot Action for low-level control. This modular decomposition allows us to leverage the distinct strengths of different scales of Multimodal Large Language Models (MLLMs) at each stage, creating a system that is powerful in its reasoning, grounding and practical control. LaViRA significantly outperforms existing state-of-the-art methods on the VLN-CE benchmark, demonstrating superior generalization capabilities in unseen environments, while maintaining transparency and efficiency for real-world deployment.

Abstract:
High-fidelity simulation is essential for robotics research, enabling safe and efficient testing of perception, control, and navigation algorithms. However, achieving both photorealistic rendering and accurate physics modeling remains a challenge. This paper presents a novel simulation framework, the Unreal Robotics Lab (URL), that integrates the advanced rendering capabilities of the Unreal Engine with MuJoCos high-precision physics simulation. Our approach enables realistic robotic perception while maintaining accurate physical interactions, facilitating benchmarking and dataset generation for vision-based robotics applications. The system supports complex environmental effects, such as smoke, fire, and water dynamics, which are critical to evaluating robotic performance under adverse conditions. We benchmark visual navigation and SLAM methods within our framework, demonstrating its utility for testing real-world robustness in controlled yet diverse scenarios. By bridging the gap between physics accuracy and photorealistic rendering, our framework provides a powerful tool for advancing robotics research and sim-to-real transfer. Our open-source framework is available at https://unrealroboticslab.github.io.

Abstract:
Diffusion policies are a powerful paradigm for robot learning, but their training is often inefficient. A key reason is that networks must relearn fundamental spatial concepts, such as translations and rotations, from scratch for every new task. To alleviate this redundancy, we propose embedding geometric inductive biases directly into the network architecture using Projective Geometric Algebra (PGA). PGA provides a unified algebraic framework for representing geometric primitives and transformations, allowing neural networks to reason about spatial structure more effectively. In this paper, we introduce hPGA-DP, a novel hybrid diffusion policy that capitalizes on these benefits. Our architecture leverages the Projective Geometric Algebra Transformer (P-GATr) as a state encoder and action decoder, while employing established U-Net or Transformer-based modules for the core denoising process. Through extensive experiments and ablation studies in both simulated and real-world environments, we demonstrate that hPGA-DP significantly improves task performance and training efficiency. Notably, our hybrid approach achieves substantially faster convergence compared to both standard diffusion policies and architectures that rely solely on P-GATr.

Abstract:
Planner evaluation in closed-loop simulation often uses rule-based traffic agents, whose simplistic and passive behavior can hide planner deficiencies and bias rankings. Widely used IDM agents simply follow a lead vehicle and cannot react to vehicles in adjacent lanes, hindering tests of complex interaction capabilities. We address this issue by integrating the state-of-the-art learned traffic agent model SMART into nuPlan. Thus, we are the first to evaluate planners under more realistic conditions and quantify how conclusions shift when narrowing the sim-to-real gap. Our analysis covers 14 recent planners and established baselines and shows that IDM-based simulation overestimates planning performance: nearly all scores deteriorate. In contrast, many planners interact better than previously assumed and even improve in multi-lane, interaction-heavy scenarios like lane changes or turns. Methods trained in closed-loop demonstrate the best and most stable driving performance. However, when reaching their limits in augmented edge-case scenarios, all learned planners degrade abruptly, whereas rule-based planners maintain reasonable basic behavior. Based on our results, we suggest SMART-reactive simulation as a new standard closed-loop benchmark in nuPlan and release the SMART agents as a drop-in alternative to IDM. Code, models, and scripts will be released upon acceptance.

Abstract:
With the rise of pre-trained models in the 3D point cloud domain for a wide range of real-world applications, adapting them to downstream tasks has become increasingly important. However, conventional full fine-tuning methods are computationally expensive and storage-intensive. Although prompt tuning has emerged as an efficient alternative, it often suffers from overfitting, thereby compromising generalization capability. To address this issue, we propose Prototypical Point-level Prompt Tuning (P3T), a parameter-efficient prompt tuning method designed for pre-trained 3D vision-language models (VLMs). P3T consists of two components: 1) Point Prompter, which generates instance-aware point-level prompts for the input point cloud, and 2) Text Prompter, which employs learnable prompts into the input text instead of hand-crafted ones. Since both prompters operate directly on input data, P3T enables task-specific adaptation of 3D VLMs without sacrificing generalizability. Furthermore, to enhance embedding space alignment, which is key to fine-tuning 3D VLMs, we introduce a prototypical loss that reduces intra-category variance. Extensive experiments demonstrate that our method matches or outperforms full fine-tuning in classification and few-shot learning, and further exhibits robust generalization under data shift in the cross-dataset setting. The code is available at https://github.com/gyjung975/P3T.

Abstract:
Choosing appropriate fabrics is critical for meeting functional and quality demands in robotic textile manufacturing, apparel production, and smart retail. We propose MLLM-Fabric, a robotic framework leveraging multimodal large language models (MLLMs) for fabric sorting and selection. Built on a multimodal robotic platform, the system is trained through supervised fine-tuning and explanation-guided distillation to rank fabric properties. We also release a dataset of 220 diverse fabrics, each with RGB images and synchronized visuotactile and pressure data. Experiments show that our Fabric-Llama-90B consistently outperforms pretrained vision-language baselines in both attribute ranking and selection reliability.

Abstract:
Reasoning Vision Language Action (VLA) models improve robotic instruction-following by generating step-by- step textual plans before low-level actions, an approach inspired by Chain-of-Thought (CoT) reasoning in language models. Yet even with a correct textual plan, the generated actions can still miss the intended outcomes in the plan, especially in out-of-distribution (OOD) scenarios. We formalize this phenomenon as a lack of embodied CoT faithfulness, and introduce a training-free, runtime policy steering method for reasoning-action alignment. Given a reasoning VLAs intermediate textual plan, our framework samples multiple candidate action sequences from the same model, predicts their outcomes via simulation, and uses a pre-trained Vision-Language Model (VLM) to select the sequence whose outcome best aligns with the VLAs own textual plan. Only executing action sequences that align with the textual reasoning turns our base VLAs natural action diversity from a source of error into a strength, boosting robustness to semantic and visual OOD perturbations and enabling novel behavior composition without costly re-training. We also contribute a reasoning-annotated extension of LIBERO-100, environment variations tailored for OOD evaluation, and demonstrate up to 15% performance gain over prior work on behavior composition tasks. The overall framework scales with compute (347ms at K = 10 samples) and data diversity. Project Website at: https://yilin-wu98.github.io/steering-reasoning-vla

Abstract:
We present OCRA, an Object-Centric framework for video-based human-to-Robot Action transfer that learns directly from human demonstration videos to enable robust manipulation. Object-centric learning emphasizes task-relevant objects and their interactions while filtering out irrelevant background, providing a natural and scalable way to teach robots. OCRA leverages multi-view RGB videos, the state-of-the-art 3D foundation model VGGT, and advanced detection and segmentation models to reconstruct object-centric 3D point clouds, capturing rich interactions between objects. To handle properties not easily perceived by vision alone, we incorporate tactile priors via a large-scale dataset of over one million tactile images. These 3D and tactile priors are fused through a multimodal module (ResFiLM) and fed into a Diffusion Policy to generate robust manipulation actions. Extensive experiments on both vision-only and visuo-tactile tasks show that OCRA significantly outperforms existing baselines and ablations, demonstrating its effectiveness for learning from human demonstration videos.

Abstract:
Self-supervised learning has made substantial strides in image processing, while visual pre-training for autonomous driving is still in its infancy. Existing methods often focus on learning geometric scene information while neglecting texture or treating both aspects separately, hindering comprehensive scene understanding. In this context, we are excited to introduce GaussianPretrain, a novel pre-training paradigm that achieves a holistic understanding of the scene by uniformly integrating geometric and texture representations. Conceptualizing 3D Gaussian anchors as volumetric LiDAR points, our method learns a deepened understanding of scenes to enhance pre-training performance with detailed spatial structure and texture, achieving that 40.6% faster than NeRF-based method UniPAD with 70% GPU memory only. We demonstrate the effectiveness of GaussianPretrain across multiple 3D perception tasks, showing significant performance improvements, such as a 7.05% increase in NDS for 3D object detection, boosts mAP by 1.9% in HD map construction and 0.8% improvement on Occupancy prediction. These significant gains highlight GaussianPretrains theoretical innovation and strong practical potential, promoting visual pre-training development for autonomous driving. Source code is available at https://github.com/Public-BOTs/GaussianPretrain.

Abstract:
Reliable off-road navigation requires accurate estimation of traversable regions and robust perception under diverse terrain and sensing conditions. However, existing datasets lack both scalability and multi-modality, which limits progress in 3D traversability prediction. In this work, we introduce STONE, a large-scale multi-modal dataset for off-road navigation. STONE provides (1) trajectory-guided 3D traversability maps generated by a fully automated, annotation-free pipeline, and (2) comprehensive surround-view sensing with synchronized 128-channel LiDAR, six RGB cameras, and three 4D imaging radars. The dataset covers a wide range of environments and conditions, including day and night, grasslands, farmlands, construction sites, and lakes. Our auto-labeling pipeline reconstructs dense terrain surfaces from LiDAR scans, extracts geometric attributes such as slope, elevation, and roughness, and assigns traversability labels beyond the robots trajectory using a Mahalanobis-distance-based criterion. This design enables scalable, geometry-aware ground-truth construction without manual annotation. Finally, we establish a benchmark for voxel-level 3D traversability prediction and provide strong baselines under both single-modal and multi-modal settings.

Abstract:
Skateboards offer a compact and efficient means of transportation as a type of personal mobility device. However, controlling them with legged robots poses several challenges for policy learning due to perception-driven interactions and multi-modal control objectives across distinct skateboarding phases. To address these challenges, we introduce Phase-Aware Policy Learning (PAPL), a reinforcement-learning framework tailored for skateboarding with quadruped robots. PAPL leverages the cyclic nature of skateboarding by integrating phase-conditioned Feature-wise Linear Modulation layers into actorcritic networks, enabling a unified policy that captures phase-dependent behaviors while sharing robot-specific knowledge across phases. Our evaluations in simulation validate command-tracking accuracy and conduct ablation studies quantifying each components contribution. We also compare locomotion efficiency against leg and wheelleg baselines and show the real-world transferability.

Abstract:
We study view-invariant imitation learning by explicitly conditioning policies on camera extrinsics. Using Plücker embeddings of per-pixel rays, we show that conditioning on extrinsics significantly improves generalization across viewpoints for standard behavior cloning policies, including ACT, Diffusion Policy, and SmolVLA. To evaluate policy robustness under realistic viewpoint shifts, we introduce six manipulation tasks in robosuite and ManiSkill that pair "fixed" and "randomized" scene variants, decoupling background cues from camera pose. Our analysis reveals that policies without extrinsics often infer camera pose using visual cues from static backgrounds in fixed scenes, a shortcut that collapses when workspace geometry or camera placement shifts. Conditioning on extrinsics restores performance and yields robust RGB-only control without depth. We release the tasks, demonstrations, and code to facilitate reproducibility and future research.

Abstract:
Verification and validation of autonomous driving (AD) systems and components is of increasing importance, as such technology increases in real-world prevalence. Safety-critical scenario generation is a key approach to robustify AD policies through closed-loop training. However, existing approaches for scenario generation rely on simplistic objectives, resulting in overly-aggressive or non-reactive adversarial behaviors. To generate diverse adversarial yet realistic scenarios, we propose SEAL, a scenario perturbation approach which leverages learned objective functions and adversarial, human-like skills. SEAL-perturbed scenarios are more realistic than SOTA baselines, leading to improved ego task success across real-world, in-distribution, and out-of-distribution scenarios, of more than 20%. To facilitate future research, we release our code and tools: https://navars.xyz/seal/

Abstract:
Human Trajectory Forecasting (HTF) predicts future human movements from past trajectories and environmental context, with applications in Autonomous Driving, Smart Surveillance, and Human-Robot Interaction. While prior work has focused on accuracy, social interaction modeling, and diversity, little attention has been paid to uncertainty modeling, calibration, and forecasts from short observation periods, which are crucial for downstream tasks such as path planning and collision avoidance. We propose DD-MDN, an end-to-end probabilistic HTF model that combines high positional accuracy, calibrated uncertainty, and robustness to short observations. Using a few-shot denoising diffusion backbone and a dual mixture density network, our method learns self-calibrated residence areas and probability-ranked anchor paths, from which diverse trajectory hypotheses are derived, without predefined anchors or endpoints. Experiments on the ETH/UCY, SDD, inD, and IMPTC datasets demonstrate state-of-the-art accuracy, robustness at short observation intervals, and reliable uncertainty modeling. The code is available at: urlhttps://github.com/kav-institute/ddmdn.

Abstract:
This paper presents an approach for pretraining robotic manipulation Vision-Language-Action (VLA) models using a large corpus of unscripted real-life video recordings of human hand activities. Treating human hand as dexterous robot end-effector, we show that "in-the-wild" egocentric human videos without any annotations can be transformed into data formats fully aligned with existing robotic V-L-A training data in terms of task granularity and labels. This is achieved by the development of a fully-automated holistic human activity analysis approach for arbitrary human hand videos. This approach can generate atomic-level hand activity segments and their language descriptions, each accompanied with framewise 3D hand motion and camera motion. We process a large volume of egocentric videos and create a hand-VLA training dataset containing 1M episodes and 26M frames. This training data covers a wide range of objects and concepts, dexterous manipulation tasks, and environment variations in real life, vastly exceeding the coverage of existing robot data. We also design a dexterous hand VLA model architecture and pretrain the model on this dataset. The model exhibits strong zero-shot capabilities on completely unseen real-world observations. Additionally, fine-tuning it on a small amount of real robot action data significantly improves task success rates and generalization to novel objects in real robotic experiments. We believe this work establishes a promising pathway for scaling up VLA pretraining.

Abstract:
Embodied Visual Tracking (EVT) is a fundamental ability that underpins practical applications, such as companion robots, guidance robots and service assistants, where continuously following moving targets is essential. Recent advances have enabled language-guided tracking in complex and unstructured scenes. However, existing approaches lack explicit spatial reasoning and effective temporal memory, causing failures under severe occlusions or in the presence of similar-looking distractors. To address these challenges, we present TrackVLA++, a novel VisionLanguageAction (VLA) model that enhances embodied visual tracking with two key modules: a spatial reasoning mechanism and a Target Identification Memory (TIM). The reasoning module introduces a Chain-of-Thought paradigm, termed Polar-CoT, which infers the targets relative position and encodes it as a compact polar-coordinate token for action prediction. Guided by these spatial priors, the TIM employs a gated update strategy to preserve long-horizon target memory, ensuring spatiotemporal consistency and mitigating target loss during extended occlusions. Extensive experiments show that TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings. On the challenging EVT-Bench DT split, TrackVLA++ surpasses the previous leading approach by 5.1% and 12% respectively. Furthermore, TrackVLA++ exhibits strong zero-shot generalization, enabling robust real-world tracking in dynamic and occluded scenarios.

Abstract:
Recent open-vocabulary robot mapping methods enrich dense geometric maps with pre-trained visual-language features, achieving a high level of detail and guiding robots to find objects specified by open-vocabulary language queries. While the issue of scalability for such approaches has received some attention, another fundamental problem is that high-detail object mapping quickly becomes outdated, as objects get moved around a lot. In this work, we develop a mapping and navigation system for object-goal navigation that, from the ground up, considers the possibilities that a queried object can have moved, or may not be mapped at all. Instead of striving for high-fidelity mapping detail, we consider that the main purpose of a map is to provide environment grounding and context, which we combine with the semantic priors of LLMs to reason about object locations and deploy an active, online approach to navigate to the objects. Through simulated and real-world experiments we find that our approach tends to have higher retrieval success at shorter path lengths for static objects and by far outperforms prior approaches in cases of dynamic or unmapped object queries.

Abstract:
Wheeled-legged robots, which have wheels at their feet and achieve high mobility by coordinating wheel drive and leg drive, have been developed. These robots have been developed purely as platforms specialized for locomotion. Therefore, they do not have a means to repurpose their legs for roles other than locomotion, such as object manipulation or tool utilization. In this paper, we address the problem of how to draw out the potential task-execution capability of the legs by freeing them from the roles of locomotion through external body support. To this end, we propose and develop a new robot, WiXus, which fuses a wheeled-legged mechanism with a wire-driven mechanism that utilizes the external environment. The developed WiXus demonstrates not only planar locomotion with wheeled-legged drive, but also three-dimensional mobility such as cliff climbing by coordinating wire-driven and wheeled-legged actuation. Furthermore, by suspending the body with wire-driven actuation, WiXus successfully repurpose its legs as arms to perform object manipulation, (e.g., rescuing a dog (stuffed animal)), and tool utilization (e.g., harvesting an apple (mockup) with loppers). This study demonstrates that the approach of utilizing the environment with wire-driven actuation is a new design principle that extends the operational domain of wheeled-legged robots.

Abstract:
Accurate motion control in the face of disturbances within complex environments remains a major challenge in robotics. Classical model-based approaches often struggle with nonlinearities and unstructured disturbances, while reinforcement learning (RL)-based methods can be fragile when encountering unseen scenarios. In this paper, we propose a novel framework, Neural Internal Model Control (NeuralIMC), which integrates model-based control with RL-based control to enhance robustness. Our framework streamlines the predictive model by applying Newton-Euler equations for rigid-body dynamics, eliminating the need to capture complex high-dimensional nonlinearities. This internal model combines model-free RL algorithms with predictive error feedback. Such a design enables a closed-loop control structure to enhance the robustness and generalizability of the control system. We demonstrate the effectiveness of our framework on both quadrotors and quadrupedal robots, achieving superior performance compared to state-of-the-art methods. Furthermore, real-world deployment on a quadrotor with rope-suspended payloads highlights the frameworks robustness in sim-to-real transfer. Our code is released at urlhttps://github.com/thu-uav/NeuralIMC.

Abstract:
We present PIRATR, an end-to-end 3D object detection framework for robotic use cases in point clouds. Extending PI3DETR, our method streamlines parametric 3D object detection by jointly estimating multi-class 6-DoF poses and class-specific parametric attributes directly from occlusion-affected point cloud data. This formulation enables not only geometric localization but also the estimation of task-relevant properties for parametric objects, such as a grippers opening, where the 3D model is adjusted according to simple, predefined rules. The architecture employs modular, class-specific heads, making it straightforward to extend to novel object types without re-designing the pipeline. We validate PIRATR on an automated forklift platform, focusing on three structurally and functionally diverse categories: crane grippers, loading platforms, and pallets. Trained entirely in a synthetic environment, PIRATR generalizes effectively to real outdoor LiDAR scans, achieving a detection mAP of 0.919 without additional fine-tuning. PIRATR establishes a new paradigm of pose-aware, parameterized perception. This bridges the gap between low-level geometric reasoning and actionable world models, paving the way for scalable, simulation-trained perception systems that can be deployed in dynamic robotic environments.

Abstract:
Robotic manipulation policies often fail to generalize because they must simultaneously learn where to attend, what actions to take, and how to execute them. We argue that high-level reasoning about where and what can be offloaded to vision-language models (VLMs), leaving policies to specialize in how to act. We present PEEK (Point-based End-effector and Entity Keying), which fine-tunes VLMs to predict a unified point-based intermediate representation: (1) end-effector paths specifying what actions to take, and (2) task-relevant masks indicating where to focus. These annotations are directly overlaid onto robot observations, making the representation policy-agnostic and transferable across architectures. To enable scalable training, we introduce an automatic annotation pipeline, generating labeled data across 20+ robot datasets spanning 9 embodiments. In real-world evaluations, PEEK consistently boosts zero-shot generalization, including a 41.4× real-world improvement for a 3D policy trained only in simulation, and 23.5× gains for both large VLAs and small manipulation policies. By letting VLMs absorb semantic and visual complexity, PEEK equips manipulation policies with the minimal cues they needwhere, what, and how.

Abstract:
Implicit representations have been widely applied in robotics for obstacle avoidance and path planning. In this paper, we explore the problem of constructing an implicit distance representation from a single image. Past methods for implicit surface reconstruction, such as NeuS and its variants generally require a large set of multi-view images as input, and require long training times. In this work, we propose Fast Image-to-Neural Surface (FINS), a lightweight framework that can reconstruct high-fidelity surfaces and SDF fields based on a single or a small set of images. FINS integrates a multi-resolution hash grid encoder with lightweight geometry and color heads, making the training via an approximate second-order optimizer highly efficient and capable of converging within a few seconds. Additionally, we achieve the construction of a neural surface requiring only a single RGB image, by leveraging pre-trained foundation models to estimate the geometry inherent in the image. Our experiments demonstrate that under the same conditions, our method outperforms state-of-the-art baselines in both convergence speed and accuracy on surface reconstruction and SDF field estimation. Moreover, we demonstrate the applicability of FINS for robot surface following tasks and show its scalability to a variety of benchmark datasets.

Abstract:
3D semantic occupancy prediction is an emerging perception paradigm in autonomous driving, providing a voxel-level representation of both geometric details and semantic categories. However, despite its fine-grained scene understanding, its effectiveness is inherently constrained in single-vehicle setups by occlusions, restricted sensor range, and narrow viewpoints. To address these limitations, collaborative perception enables the exchange of complementary information, thereby enhancing the completeness and accuracy of predictions. Despite its potential, research on collaborative 3D semantic occupancy prediction is hindered by the lack of dedicated datasets. To bridge this gap, we design a high-resolution semantic voxel sensor in CARLA to produce dense and comprehensive annotations for V2X scenarios. We further develop a baseline model that performs inter-agent feature fusion via spatial alignment and attention aggregation. In addition, we establish benchmarks with varying prediction ranges designed to systematically assess the impact of spatial extent on collaborative prediction. Experimental results demonstrate the superior performance of our baseline enabled by vehicle collaboration, with increasing gains observed as the prediction range expands. Our codes and data are available at https://github.com/tlab-wide/Co3SOP.

Abstract:
Reliable radar inertial odometry (RIO) requires mitigating IMU bias drift, a challenge that intensifies in subterranean environments due to extreme temperatures and gravity induced accelerations. Cost-effective IMUs such as the Pixhawk, when paired with FMCW TI IWR6843AOP EVM radars, suffer from drift induced degradation compounded by sparse, noisy, and flickering radar returns, making fusion less stable than LiDAR based odometries. Yet, LiDAR fails under smoke, dust and aerosols, whereas FMCW radars remain compact, lightweight, cost-effective, and robust to these situations. To address these challenges, we propose a two stage MRIO framework that combines an IMU bias estimator for resilient localization and mapping in GPS-denied subterranean environments affected by smoke. In this, radar's ego velocity estimation is formulated through a least square approach and incorporated into an EKF for online IMU bias correction, thus, the corrected IMU accelerations are fused with heterogeneous measurements from multiple radars and IMU to refine odometry. The proposed framework further supports radar only mapping by exploiting the robots estimated translational and rotational displacements. In subterranean field trials, MRIO delivers robust localization and mapping, outperforming single stage EKF-RIO. It maintains accuracy across cost-efficient FMCW radar setups and different IMUs, with resilience on Pixhawk and using higher-grade units like VectorNav. The implementation will be provided as an open-source resource to the community:urlhttps://github.com/LTU-RAI/MRIO

Abstract:
Neural networkbased visuomotor policies enable robots to perform manipulation tasks but remain susceptible to perceptual attacks. For example, conventional 2D adversarial patches are effective under fixed-camera setups, where appearance is relatively consistent; however, their efficacy often diminishes under dynamic viewpoints from moving cameras, such as wrist-mounted setups, due to perspective distortions. To proactively investigate potential vulnerabilities beyond 2D patches, this work proposes a viewpoint-consistent adversarial texture optimization method for 3D objects through differentiable rendering. As optimization strategies, we employ Expectation over Transformation (EOT) with a Coarse-to-Fine (C2F) curriculum, exploiting distance-dependent frequency characteristics to induce textures effective across varying cameraobject distances. We further integrate saliency-guided perturbations to redirect policy attention and design a targeted loss that persistently drives robots toward adversarial objects. Our comprehensive experiments show that the proposed method is effective under various environmental conditions, while confirming its black-box transferability and real-world applicability.

Abstract:
Large-scale robot datasets have facilitated the learning of a wide range of robot manipulation skills, but these datasets remain difficult to collect and scale further, owing to the intractable amount of human time, effort, and cost required. Simulation and synthetic data generation have proven to be an effective alternative to fuel this need for data, especially with the advent of recent work showing that such synthetic datasets can dramatically reduce real-world data requirements and facilitate generalization to novel scenarios unseen in real-world demonstrations. However, this paradigm has been limited to rigid-body tasks, which are easy to simulate. Deformable object manipulation encompasses a large portion of real-world manipulation and remains a crucial gap to address towards increasing adoption of the synthetic simulation data paradigm. In this paper, we introduce SoftMimicGen, an automated data generation pipeline for deformable object manipulation tasks. We introduce a suite of high-fidelity simulation environments that encompasses a wide range of deformable objects (stuffed animal, rope, tissue, towel) and manipulation behaviors (high-precision threading, dynamic whipping, folding, pick-and-place), across four robot embodiments: a single-arm manipulator, bimanual arms, a humanoid, and a surgical robot. We apply SoftMimicGen to generate datasets across the task suite, train high-performing policies from the data, and systematically analyze the data generation system. Project website: softmimicgen.github.io

Abstract:
Dexterous manipulation requires careful reasoning over extrinsic contacts. The prevalence of deforming tools in human environments, the use of deformable sensors, and the increasing number of soft robots yields a need for approaches that enable dexterous manipulation through contact reasoning where not all contacts are well characterized by classical rigid body contact models. Here, we consider the case of a deforming tool dexterously manipulating a rigid object. We propose a hybrid learning and first-principles approach to the modeling of simultaneous motion and force transfer of tools and objects. The learned module is responsible for jointly estimating the rigid object's motion and the deformable tool's imparted contact forces. We then propose a Contact Quadratic Program to recover forces between the environment and object subject to quasi-static equilibrium and Coulomb friction. The results is a system capable of modeling both intrinsic and extrinsic motions, contacts, and forces during dexterous deformable manipulation. We train our method in simulation and show that our method outperforms baselines under varying block geometries and physical properties, during pushing and pivoting manipulations, and demonstrate transfer to real world interactions.

Abstract:
Imitation learning method has shown immense promise for robotic manipulation, yet its practical deployment is fundamentally constrained by the data scarcity. Despite prior work on collecting large-scale datasets, there still remains a significant gap to robust spatial generalization. We identify a key limitation: individual trajectories, regardless of their length, are typically collected from a emphsingle, static spatial configuration of the environment. This includes fixed object and target spatial positions as well as unchanging camera viewpoints, which significantly restricts the diversity of spatial information available for learning. To address this critical bottleneck in data efficiency, we propose textbfMOtion-Based Variability Enhancement (emphMOVE), a simple yet effective data collection paradigm that enables the acquisition of richer spatial information from dynamic demonstrations. Our core contribution is an augmentation strategy that injects motion into any movable objects within the environment for each demonstration. This process implicitly generates a dense and diverse set of spatial configurations within a single trajectory. We conduct extensive experiments in both simulation and real-world environments to validate our approach. For example, in simulation tasks requiring strong spatial generalization, emphMOVE achieves an average success rate of 39.1%, a 76.1% relative improvement over the static data collection paradigm (22.2%), and yields up to 2--5times gains in data efficiency on certain tasks.

Abstract:
We present AutoFocus-IL, a simple yet effective method to improve data efficiency and generalization in visual imitation learning by guiding policies to attend to task-relevant features rather than distractors and spurious correlations. Saliency regularization has emerged as a promising way to achieve this, but existing approaches typically require costly supervision such as human gaze data or manual saliency annotations. In contrast, AutoFocus-IL leverages vision-language models (VLMs) to automatically identify and track key objects in demonstrations, generating temporal saliency maps that highlight causal visual signals while suppressing distractors. These maps are then used to regularize behavior cloning policies, yielding stronger alignment between visual attention and task-relevant cues. Our findings highlight that VLM-driven saliency provides a scalable, annotation-free path toward robust imitation learning in robotics. Particularly, our experiments in both the CARLA simulator and real-robot manipulation tasks demonstrate that AutoFocus-IL not only outperforms standard behavior cloning but also surpasses state-of-the-art baselines that assume privileged access to human supervision, such as gaze data. The supplementary materials, including code, datasets, and trained policy videos, are publicly available at https://AutoFocus-IL.github.io/ .

Abstract:
Learning a universal manipulation policy encompassing doors with diverse categories, geometries and mechanisms, is crucial for future embodied agents to effectively work in complex and broad real-world scenarios. Due to the limited datasets and unrealistic simulation environments, previous studies fail to achieve good performance across various doors. In this work, we build a novel door manipulation environment reflecting different realistic door manipulation mechanisms, and further equip this environment with a large-scale door dataset covering 6 door categories with hundreds of door bodies and handles, making up thousands of different door instances. To learn a universal policy over diverse doors, we propose a novel framework disentangling the whole manipulation process into three stages, and integrating them through conditional training. Extensive experiments validate the effectiveness of our designs and demonstrate our framework's strong performance in simulation and real world.

Abstract:
Navigating unknown environments to find a target object is a significant challenge. While semantic information is crucial for navigation, relying solely on it for decision-making may not always be efficient, especially in environments with weak semantic cues. Additionally, many methods are susceptible to misdetections, especially in environments with visually similar objects. To address these limitations, we propose ApexNav, a zero-shot object navigation framework that is both more efficient and reliable. For efficiency, ApexNav adaptively utilizes semantic information by analyzing its distribution in the environment, guiding exploration through semantic reasoning when cues are strong, and switching to geometry-based exploration when they are weak. For reliability, we propose a target-centric semantic fusion method that preserves long-term memory of the target object and similar objects, reducing false detections and minimizing task failures. We evaluate ApexNav on the HM3Dv1, HM3Dv2, and MP3D datasets, where it outperforms state-of-the-art methods in both SR and SPL metrics. Comprehensive ablation studies further demonstrate the effectiveness of each module. Furthermore, real-world experiments validate the practicality of ApexNav in physical environments.

Abstract:
We introduce DualMap, an online open-vocabulary mapping system that enables robots to understand and navigate dynamically changing environments through natural language queries. Designed for efficient semantic mapping and adaptability to changing environments, DualMap meets the essential requirements for real-world robot navigation applications. Our proposed hybrid segmentation frontend and object-level status check eliminate the costly 3D object merging required by prior methods, enabling efficient online scene mapping. The dual-map representation combines a global abstract map for high-level candidate selection with a local concrete map for precise goal-reaching, effectively managing and updating dynamic changes in the environment. Through extensive experiments in both simulation and real-world scenarios, we demonstrate state-of-the-art performance in 3D open-vocabulary segmentation, efficient scene mapping, and online language-guided navigation. Project page: https://eku127.github.io/DualMap/

Abstract:
Neural implicit surface reconstruction with signed distance function has made significant progress, but recovering fine details such as thin structures and complex geometries remains challenging due to unreliable or noisy geometric priors. Existing approaches rely on implicit uncertainty that arises during optimization to filter these priors, which is indirect and inefficient, and masking supervision in high-uncertainty regions further leads to under-constrained optimization. To address these issues, we propose GPU-SDF, a neural implicit framework for indoor surface reconstruction that leverages geometric prior uncertainty and complementary constraints. We introduce a self-supervised module that explicitly estimates prior uncertainty without auxiliary networks. Based on this estimation, we design an uncertainty-guided loss that modulates prior influence rather than discarding it, thereby retaining weak but informative cues. To address regions with high prior uncertainty, GPU-SDF further incorporates two complementary constraints: an edge distance field that strengthens boundary supervision and a multi-view consistency regularization that enforces geometric coherence. Extensive experiments confirm that GPU-SDF improves the reconstruction of fine details and serves as a plug-and-play enhancement for existing frameworks. The source code will be released upon acceptance.

Abstract:
As the prediction horizon increases, predicting the future evolution of traffic scenes becomes increasingly difficult due to the multi-modal nature of agent motion. Most state-of-the-art (SotA) prediction models primarily focus on forecasting the most likely future. However, for the safe operation of autonomous vehicles, it is equally important to cover the distribution for plausible motion alternatives. To address this, we introduce EP-Diffuser, a novel parameter-efficient diffusion-based generative model designed to capture the distribution of possible traffic scene evolutions. Conditioned on road layout and agent history, our model acts as a predictor and generates diverse, plausible scene continuations. We benchmark EP-Diffuser against two SotA models in terms of plausibility, diversity, and accuracy of predictions on the Argoverse 2 dataset. Despite its significantly smaller model size, our approach achieves both highly plausible and diverse traffic scene predictions with comparable accuracy. We further evaluate model generalization in an out-of-distribution (OoD) test setting using Waymo Open dataset and show superior robustness of our approach. The code and model checkpoints will be made publicly available to ensure reproducibility.

Abstract:
As learning-based methods for legged robots rapidly grow in popularity, it is important that we can provide safety assurances efficiently across different controllers and environments. Existing works either rely on a priori knowledge of the environment and safety constraints to ensure system safety or provide assurances for a specific locomotion policy. To address these limitations, we propose an observation-conditioned reachability-based (OCR) safety-filter framework. Our key idea is to use an OCR value network (OCR-VN) that predicts the optimal control-theoretic safety value function for new failure regions and dynamic uncertainty during deployment time. Specifically, the OCR-VN facilitates rapid safety adaptation through two key components: a LiDAR-based input that allows the dynamic construction of safe regions in light of new obstacles and a disturbance estimation module that accounts for dynamic uncertainty in the wild. The predicted safety value function is used to construct an adaptive safety filter that overrides the nominal quadruped controller when necessary to maintain safety. Through simulation studies and hardware experiments on a Unitree Go1 quadruped in agile planar navigation tasks, we demonstrate that the proposed framework can automatically safeguard a wide range of hierarchical quadruped controllers, adapts to novel environments, and is robust to unmodeled dynamics without a priori access to the controllers or environments - hence, "One Filter to Deploy Them All."

Abstract:
In this work we propose a novel joint training method for Visual Place Recognition (VPR), which simultaneously learns a global descriptor and a pair classifier for re-ranking. The pair classifier can predict whether a given pair of images are from the same place or not. The network only comprises Vision Transformer components for both the encoder and the pair classifier, and both components are trained using their respective class tokens. In existing VPR methods, typically the network is initialized using pre-trained weights from a generic image dataset such as ImageNet. In this work we propose an alternative pre-training strategy, by using Siamese Masked Image Modelling as a pre-training task. We propose a Place-aware image sampling procedure from a collection of large VPR datasets for pre-training our model, to learn visual features tuned specifically for VPR. By re-using the Mask Image Modelling encoder and decoder weights in the second stage of training, Pair-VPR can achieve state-of-the-art VPR performance across five benchmark datasets with a ViT-B encoder, along with further improvements in localization recall with larger encoders.

Abstract:
Autonomous agents capable of diverse object manipulations should be able to acquire a wide range of manipulation skills with high reusability. Although advances in deep learning have made it increasingly feasible to replicate the dexterity of human teleoperation in robots, generalizing these acquired skills to previously unseen scenarios remains a significant challenge. In this study, we propose a novel algorithm, Gaze-based Bottleneck-aware Robot Manipulation (GazeBot), which enables high reusability of learned motions without sacrificing dexterity or reactivity. By leveraging gaze information and motion bottlenecks, both crucial features for object manipulation, GazeBot achieves high success rates compared with state-of-the-art imitation learning methods, particularly when the object positions and end-effector poses differ from those in the provided demonstrations. Furthermore, the training process of GazeBot is entirely data-driven once a demonstration dataset with gaze data is provided.

Abstract:
In the evolving field of robotics, the challenge of Object Navigation (ON) in household environments has attracted significant interest. Existing ON benchmarks typically place objects in locations guided by general scene priors, without accounting for the specific placement habits of individual users. This omission limits the adaptability of navigation agents in personalized household environments. To address this, we introduce User-centric Object Navigation (UcON), a new benchmark that incorporates user-specific object placement habits, referred to as user habits. This benchmark requires agents to leverage these user habits for more informed decision-making during navigation. UcON encompasses approximately 22,600 user habits across 489 object categories. To the best of our knowledge, UcON is the first object navigation benchmark that takes user habits into account and covers the widest range of target object categories. Additionally, we propose a habit retrieval module to extract and utilize habits related to target objects, enabling agents to infer their likely locations more effectively. Experimental results demonstrate that while current state-of-the-art ON methods struggle with UcON's challenges, integrating user habits significantly improves the success rate in locating objects.

Abstract:
We address robust and resilient object insertion using a passively compliant soft wrist that permits large deformations and safely absorbs contacts, without high-frequency control or force sensing. To improve robustness, we structure the task as compliance-enabled contact formations: a sequence of contact states that progressively constrain specific degrees of freedom. While this segmentation mitigates moderate uncertainty, failures still occur under severe pose errors or environmental variations (e.g., friction changes, peg geometry), which traditionally require retuning goals or retraining controllers. To achieve both robustness and resilience, we therefore integrate compliance-enabled failure recovery into the contact-formation framework. Our key insight is that wrist compliance permits safe, repeated recovery attempts. A pre-trained vision-language model (VLM) assesses each skill execution from terminal poses and images, identifies failure modes, and proposes recovery actions by selecting skills and updating goals. In simulation, our method achieved an 83% success rate, recovering from failures induced by randomized conditionsincluding grasp misalignments up to 5 degrees, hole-pose errors up to 20 mm, fivefold increases in friction, and previously unseen square/rectangular pegsand we further validate the approach on a real robot.

Abstract:
This letter extends the exactly sparse Gaussian variational inference (ESGVI) algorithm for state estimation in two complementary directions. First, ESGVI is generalized to operate on matrix Lie groups, enabling the estimation of states with orientation components while respecting the underlying group structure. Second, factors are introduced to accommodate heavy-tailed and skewed noise distributions, as commonly encountered in ultra-wideband (UWB) localization due to non-line-of-sight (NLOS) and multipath effects. Both extensions are shown to integrate naturally within the ESGVI framework while preserving its sparse and derivative-free structure. The proposed approach is validated in a UWB localization experiment with NLOS-rich measurements, demonstrating improved accuracy and comparable consistency. Finally, a Python implementation within a factor-graph-based estimation framework is made open-source to support broader research use.

Abstract:
The increasing use of robots in unstructured environments necessitates the development of effective perception and navigation strategies to enable field robots to successfully perform their tasks. In particular, it is key for such robots to understand where in their environment they can and cannot travel---a task known as traversability estimation. However, existing geometric approaches to traversability estimation may fail to capture nuanced representations of traversability, whereas vision-based approaches typically either involve manually annotating a large number of images or require robot experience. In addition, existing methods can struggle to address domain shifts as they typically do not learn during deployment. To this end, we propose a human-in-the-loop (HiL) method for traversability estimation that prompts a human for annotations as-needed. Our method uses a foundation model to enable rapid learning on new annotations and to provide accurate predictions even when trained on a small number of quickly-provided HiL annotations. We extensively validate our method in simulation and on real-world data, and demonstrate that it can provide state-of-the-art traversability prediction performance.

Abstract:
The low-light conditions are challenging to the vision-centric perception systems for autonomous driving in the dark environment. In this paper, we propose a new benchmark dataset (named DarkDriving) to investigate the low-light enhancement for autonomous driving. The existing real-world low-light enhancement benchmark datasets can be collected by controlling various exposures only in small-ranges and static scenes. The dark images of the current nighttime driving datasets do not have the precisely aligned daytime counterparts. The extreme difficulty to collect a real-world day and night aligned dataset in the dynamic driving scenes significantly limited the research in this area. With a proposed automatic day-night Trajectory Tracking based Pose Matching (TTPM) method in a large real-world closed driving test field (area: 69 acres), we collect the first real-world day and night aligned dataset for autonomous driving in the dark environment. The DarkDriving dataset has 11,408 day and night image pairs precisely aligned in location and spatial contents, whose alignment error is in just several centimeters. For each pair, we also manually label the object 2D bounding boxes. DarkDriving introduces four perception related tasks, including low-light enhancement, generalized low-light enhancement, and low-light enhancement for 2D detection and 3D detection of autonomous driving in the dark environment. The experimental results show that our DarkDriving dataset provides a comprehensive benchmark for evaluating low-light enhancement for autonomous driving and it can also be generalized to enhance dark images and promote detection in some other low-light driving environment, such as nuScenes.

Abstract:
Safety remains one of the most critical challenges in autonomous driving systems. In recent years, the end-to-end driving has shown great promise in advancing vehicle autonomy in a scalable manner. However, existing approaches often face safety risks due to the lack of explicit behavior constraints. To address this issue, we uncover a new paradigm by introducing the corridor as the intermediate representation. Widely adopted in robotics planning, the corridors represents spatio-temporal obstacle-free zones for the vehicle to traverse. To ensure accurate corridor prediction in diverse traffic scenarios, we develop a comprehensive learning pipeline including data annotation, architecture refinement and loss formulation. The predicted corridor is further integrated as the constraint in a trajectory optimization process. By extending the differentiability of the optimization, we enable the optimized trajectory to be seamlessly trained within the end-to-end learning framework, improving both safety and interpretability. Experimental results on the nuScenes dataset demonstrate state-of-the-art performance of our approach, showing a 66.7% reduction in collisions with agents and a 46.5% reduction with curbs, significantly enhancing the safety of end-to-end driving. Additionally, incorporating the corridor contributes to higher success rates in closed-loop evaluations.

Abstract:
Accurate scene perception is critical for vision-based robotic manipulation. Existing approaches typically follow either a Vision-to-Action V-A paradigm, predicting actions directly from visual inputs, or a Vision-to-3D-to-Action V-3D-A paradigm, leveraging intermediate 3D representations. However, these methods often struggle with action inaccuracies due to the complexity and dynamic nature of manipulation scenes. In this paper, we adopt a V-4D-A framework that enables direct action reasoning from motion-aware 4D representations via a Gaussian Action Field (GAF). GAF extends 3D Gaussian Splatting (3DGS) by incorporating learnable motion attributes, allowing 4D modeling of dynamic scenes and manipulation actions. To learn time-varying scene geometry and action-aware robot motion, GAF provides three interrelated outputs: reconstruction of the current scene, prediction of future frames, and estimation of init action via Gaussian motion. Furthermore, we employ an action-vision-aligned denoising framework, conditioned on a unified representation that combines the init action and the Gaussian perception, both generated by the GAF, to further obtain more precise actions. Extensive experiments demonstrate significant improvements, with GAF achieving +11.5385 dB PSNR, +0.3864 SSIM and -0.5574 LPIPS improvements in reconstruction quality, while boosting the average +7.3% success rate in robotic manipulation tasks over state-of-the-art methods.

Abstract:
Human vision is a highly active process driven by gaze, which directs attention to task-relevant regions through foveation, dramatically reducing visual processing. In contrast, robot learning systems typically rely on passive, uniform processing of raw camera images. In this work, we explore how incorporating human-like active gaze into robotic policies can enhance efficiency and robustness. We develop GIAVA (Gaze Integrated Active-Vision ALOHA), a robot vision system that emulates human head and neck movement, and gaze adjustment for foveated processing. Extending the AV-ALOHA robot platform, we introduce a framework for simultaneously collecting eye-tracking, perspective control, and robot manipulation demonstration data from a human operator. We also open-source a simulation benchmark and dataset for training robot policies that incorporate human gaze. Inspired by recent work in foveated image segmentation and given the widespread use of Vision Transformers (ViTs) in robot learning, we integrate gaze information into ViTs using a foveated patch tokenization scheme. Compared to uniform patch tokenization, this significantly reduces the number of tokens, and thus computation. For this purpose, we explore two approaches to gaze estimation: The first is a two-stage model that predicts gaze independently to guide foveation and subsequently action. The second integrates gaze into the action space, allowing the policy to jointly estimate gaze and actions end-to-end. Our results show that our method for foveated robot vision drastically reduces computational overhead, and enhances robustness to background distractors. Notably, on certain high-precision tasks, foveated vision also improves performance, as reflected in higher success rates. Together, these findings suggest that human-inspired foveated visual processing offers untapped potential and should be further considered as a useful inductive bias in robotic vision systems. https://soltanilara.github.io/giava

Abstract:
Accurately recovering the full 9-DoF pose of unseen instances within specific categories from a single RGB image remains a core challenge for robotics and automation. Most existing solutions still rely on pseudo-depth, CAD models, or multi-stage cascades that separate 2D detection from pose estimation. Motivated by the need for a simpler, RGB-only alternative that learns directly at the category level, we revisit a longstanding question: textitCan object detection and 9-DoF pose estimation be unified with high performance, without any additional data? We show that they can be achieved with our method, YOPO, a single-stage, query-based framework that treats category-level 9-DoF estimation as a natural extension of 2D detection. YOPO augments a transformer detector with a lightweight pose head, a bounding-boxconditioned translation module, and a 6D-aware Hungarian matching cost. The model is trained end-to-end only with RGB images and category-level pose labels. Despite its minimalist design, YOPO sets a new state of the art on three benchmarks. On the REAL275 dataset, it achieves 79.6% IoU_50 and 54.1% under the 10^circ10rmcm metric, surpassing all prior RGB-only methods and closing much of the gap to RGB-D systems. The code, models, and additional qualitative results can be found on our project pagefootnoteurlhttps://mikigom.github.io/YOPO-project-page/.

Abstract:
Autonomous driving systems require a comprehensive understanding of the environment, achieved by extracting visual features essential for perception, planning, and control. However, models trained solely on single-task objectives or generic datasets often lack the contextual information needed for robust performance in complex driving scenarios. In this work, we present a unified encoder trained across a diverse set of computer vision tasks essential for urban driving, including depth estimation, pose estimation, 3D scene flow estimation, and semantic, instance, panoptic, and motion segmentation. This single-encoder approach not only integrates these complementary visual cues, inspired by the diversity of visual cues used in human driving perception, but also enables a compact and inference-efficient model that embeds a rich, navigation-relevant latent space. Indeed, the unified encoder learns to embed multi-task knowledge into a shared representation, allowing for better downstream task adaptation, particularly for steering estimation. To ensure the efficient learning across tasks within a unified encoder, we propose a multi-scale pose decoder and employ knowledge distillation from a multi-backbone teacher model. Our experiments demonstrate that (1) the unified encoder achieves strong generalization across all visual tasks, comparable to state-of-the-art dedicated models, and (2) its frozen latent representations significantly outperform both fine-tuned models and ImageNet-pretrained baselines for steering estimation. These results underscore how multi-task feature learning, inspired by the diversity of perceptual cues used in human driving, offers an efficient and context-rich foundation for autonomous driving systems.

Abstract:
Diffusion policies have emerged as a powerful framework for robotic visuomotor control, yet they often lack the robustness to recover from subtask failures in long-horizon, multi-stage tasks and their learned representations of observations are often difficult to interpret. In this work, we propose the Mixture of Experts-Enhanced Diffusion Policy (MoE-DP), where the core idea is to insert a Mixture of Experts (MoE) layer between the visual encoder and the diffusion model. This layer decomposes the policy's knowledge into a set of specialized experts, which are dynamically activated to handle different phases of a task. We demonstrate through extensive experiments that MoE-DP exhibits a strong capability to recover from disturbances, significantly outperforming standard baselines in robustness. On a suite of 6 long-horizon simulation tasks, this leads to a 36% average relative improvement in success rate under disturbed conditions. This enhanced robustness is further validated in the real world, where MoE-DP also shows significant performance gains. We further show that MoE-DP learns an interpretable skill decomposition, where distinct experts correspond to semantic task primitives (e.g., approaching, grasping). This learned structure can be leveraged for inference-time control, allowing for the rearrangement of subtasks without any re-training.

Abstract:
We introduce IndustryShapes, a new RGB-D benchmark dataset of industrial tools and components, designed for both instance-level and novel object 6D pose estimation approaches. The dataset provides a realistic and application relevant testbed for benchmarking these methods in the context of industrial robotics bridging the gap between lab-based research and deployment in real-world manufacturing scenarios. Unlike many previous datasets that focus on household or consumer products or use synthetic, clean tabletop datasets, or objects captured solely in controlled lab environments, IndustryShapes introduces five new object types with challenging properties, also captured in realistic industrial assembly settings. The dataset has diverse complexity, from simple to more challenging scenes, with single and multiple objects, including scenes with multiple instances of the same object and it is organized in two parts: the classic set and the extended set. The classic set includes a total of 4,6k images and 6k annotated poses. The extended set introduces additional data modalities to support the evaluation of model-free and sequence-based approaches. To the best of our knowledge, IndustryShapes is the first dataset to offer RGB-D static onboarding sequences. We further evaluate the dataset on a representative set of state-of-the art methods for instance-based and novel object 6D pose estimation, including also object detection, segmentation, showing that there is room for improvement in this domain. The dataset page can be found in https://pose-lab.github.io/IndustryShapes

Abstract:
We present LightVLA, a simple yet effective differentiable token pruning framework for vision-language-action (VLA) models. While VLA models have shown impressive capability in executing real-world robotic tasks, their deployment on resource-constrained platforms is often bottlenecked by the heavy attention-based computation over large sets of visual tokens. LightVLA addresses this challenge through adaptive, performance-driven pruning of visual tokens: it generates dynamic queries to evaluate visual token importance, and adopts Gumbel softmax to enable differentiable token selection. Through fine-tuning, LightVLA learns to preserve the most informative visual tokens while pruning tokens which do not contribute to task execution, thereby improving efficiency and performance simultaneously. Notably, LightVLA requires no heuristic magic numbers and introduces no additional trainable parameters, making it compatible with modern inference frameworks. Experimental results demonstrate that LightVLA outperforms different VLA models and existing token pruning methods across diverse tasks on the LIBERO benchmark, achieving higher success rates with substantially reduced computational overhead. Specifically, LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.9% improvement in task success rate. Meanwhile, we also investigate the training-aware token pruning method LightVLA with additional trainable parameters, which also achieves satisfactory performance. Our work reveals that as VLA pursues optimal performance, it spontaneously learns to prune tokens from a performance-driven perspective. To the best of our knowledge, LightVLA is the first work to apply adaptive visual token pruning to VLA tasks with the collateral goals of efficiency and performance, marking a significant step toward more efficient, powerful and practical real-time robotic systems. Project site: https://liauto-research.github.io/LightVLA.

Abstract:
Differential drive mobile manipulators combine the mobility of wheeled bases with the manipulation capability of multi-joint arms, enabling versatile applications but posing considerable challenges for trajectory planning due to their high-dimensional state space and nonholonomic constraints. This paper introduces TopAY, an optimization-based planning framework designed for efficient and safe trajectory generation for differential drive mobile manipulators. The framework employs a hierarchical initial value acquisition strategy, including topological paths search for the base and parallel sampling for the manipulator. A polynomial trajectory representation with arc lengthyaw parameterization is also proposed to reduce optimization complexity while preserving dynamic feasibility. Extensive simulation and real-world experiments validate that TopAY achieves higher planning efficiency and success rates than state-of-the-art method in dense and complex scenarios. The source code is released at https://github.com/TopAY-Planner/TopAY.

Abstract:
Imagine the future construction site, hospital, or office with dozens of robots bought from different manufacturers. How can we enable these different systems to effectively move in a shared environment, given that each robot may have its own independent motion planning system? This work show how we can get efficient collision-free movements between algorithmically heterogeneous agents by using Conflict-Based Search (Sharon et al. 2015) as a protocol. At its core, the CBS Protocol requires one specific single-agent motion planning API; finding a collision-free path that satisfies certain space-time constraints. Given such an API, CBS uses a central planner to find collision-free paths - independent of how the API is implemented. We demonstrate how this protocol enables multi-agent motion planning for a heterogeneous team of agents completing independent tasks with a variety of single-agent planners including: Heuristic Search (e.g., A), Sampling Based Search (e.g., RRT), Optimization (e.g., Direct Collocation), Diffusion, and Reinforcement Learning.

Abstract:
Elasticity is one of the representative parameters that reflect the mechanical properties of soft materials. Detecting the underneath elasticity distribution called elastography is a key step for understanding and interacting with objects. Existing solutions for capturing the interior elasticity distribution typically rely on expensive apparatus. In this work, the dense tactile signal captured by the high-resolution vision-based tactile sensor is introduced as a new modality for reconstructing 3D elasticity distribution. We propose a model-based method, which exploit the tactile maps from active pressing trials for the elastography task. The interior elasticity distribution for non-rigid objects is reconstructed from an inverse physics model. We analyze the credibility of the estimated elasticity distribution obtained from our method. Varying design factors are also discussed. We experiment our method on a set of synthesized 3D models and physical models in robot-assisted scenes. Various experimental results have been gathered, demonstrating the efficacy of our approach in perceiving elasticity distribution.

Abstract:
We present a framework for generating convex approximations of complex contact models, incorporating experimentally validated models like Hunt & Crossley coupled with Coulomb's law of friction alongside the principle of maximum dissipation. Our approach is robust across a wide range of stiffness values, making it suitable for both compliant surfaces and rigid approximations. We evaluate these approximations across a wide variety of test cases, detailing properties and limitations. We implement a fully differentiable solution in the open-source robotics toolkit, Drake. Our novel hybrid approach enables computation of gradients for complex geometric models while reusing factorizations from contact resolution. We demonstrate robust simulation of robotic tasks at interactive rates, with accurately resolved stiction and contact transitions, supporting effective sim-to-real transfer.

Abstract:
This article presents an impact-aware manipulation framework for logistics, where growing e-commerce demands have increased the need for faster and more flexible package handling solutions. The proposed framework addresses swiftly grabbing and placing objects in depalletizing tasks with dual-arm robotic systems. Impact-aware robotics leverages intentional collisions to achieve dynamic interactions and has the potential to be faster and more energy efficient than classical quasi-static approaches. The generation of desired impacts brings multiple challenges encompassing robust motion generation, managing impacts with objects, enforcing hardware and safety constraints, contact state sensing, and simulation of contact behavior. To tackle these challenges, we developed within the EU-funded Impact-Aware Manipulation (I.AM.) project an integrated solution exploiting nonsmooth mechanics for impact modeling, dynamical systems for motion generation, QP-based control for constraint enforcement, internal state sensing without external force transducers, and batch-capable impact simulation. This article highlights the benefits of the proposed approach in terms of speed (a 29% decrease in average task time) and energy efficiency (a 35% decrease) through a systematic comparison between classical grabbing and impact-aware swift grabbing and tossing. In summary, our article underscores the transformative potential of impact-aware technologies in revolutionizing robotic logistics operations.

Abstract:
During robotic disaster relief missions, state estimation still faces significant challenges, especially when GNSS is denied or sensor perception undergoes degradation. In this paper, we introduce a degradation-aware LiDAR-Thermal-Inertial SLAM, DaLiTI, that leverages the complementary nature of multi-modal information to achieve robust and precise state estimation in perceptually challenging environments. The system utilizes an iterated error state Kalman filter (IESKF) to loosely integrate LiDAR, thermal infrared camera, and IMU measurements. We propose an adaptive fusion mechanism that dynamically weights and fuses LiDAR and thermal measurements based on real-time modal quality to prevent failure information from propagating throughout the system. Experimental results demonstrate that, compared with state-of-the-art methods, DaLiTI maintains competitive performance in conventional environments and exhibits superior robustness and accuracy in degraded scenarios such as fire scenes or chemical plants with gas leaks. Our implementation is available at https://github.com/HITSZ-NRSL/DaLiTI.

Abstract:
Can interactive vision-and-language agents learn not just what to say but also when to say it? Current language models rarely plan over whether and when to realize a real-time response to a user. However, providing accurate and timely support for human decision-making, such as when guiding visually impaired individuals through urban environments, requires careful real-time responsiveness poorly timed responses can distract users or add unnecessary cognitive load. As a machine intelligence challenge for Multimodal Large Language Model (MLLM)-based agents, we introduce a large-scale multimodal benchmark for an egocentric, assistive navigation task in complex outdoor environments. Using this benchmark, we uncover a fundamental limitation of off-the-shelf MLLMs in delivering safe and time-sensitive navigation instructions, even with model finetuning on substantial amounts of data. We then demonstrate that a simple yet effective modification of the model, including direct supervision to predict the underlying reason for each instruction, yields significant performance gains across open-loop, closed-loop, and sim-to-real generalization settings. Nevertheless, our analysis highlights persistent challenges in temporal reasoning, safety-critical object awareness, and relational and distance understanding. To advance the development of scalable assistive agents, we will release our simulation, benchmark, and code (available at project website: https://timeli-icra.github.io/).

Abstract:
Collecting demonstrations through human teleoperation is an effective approach for learning complex manipulation skills. However, challenges such as morphology gaps, control latency, and limited feedback make high-quality data collection costly and inefficient. In this paper, we introduce Neural Teleoperation, a shared-autonomy system that integrates human guidance with a robust grasping policy using a learning-based policy switcher. This hybrid framework allows users to focus on high-level planning while delegating fine-grained control to an autonomous policy when needed. Our system supports both immersive VR devices and lightweight 6-DoF controllers, making dexterous hand teleoperation more accessible. Real-world experiments across six manipulation tasks show that Neural Teleop increases success rates and reduces demonstration collection time compared to state-of-the-art baselines.

Abstract:
When users work with AI agents, they form conscious or subconscious expectations of them. Meeting user expectations is crucial for such agents to engage in successful interactions and teaming. However, users may form expectations of an agent that differ from the agents planned behaviors. These differences lead to the consideration of two separate decision models in the planning process to generate explicable behaviors. However, little has been done to incorporate safety considerations, especially in a learning setting. We present Safe Explicable Policy Search (SEPS), which aims to provide a learning approach to explicable behavior generation while minimizing the safety risk, both during and after learning. We formulate SEPS as a constrained optimization problem where the agent aims to maximize an explicability score subject to constraints on safety and a suboptimality criterion based on the agents model. SEPS innovatively combines the capabilities of Constrained Policy Optimization and Explicable Policy Search to introduce the capability of generating safe explicable behaviors to domains with continuous state and action spaces, which is critical for robotic applications. We evaluate SEPS in safety-gym environments and with a physical robot experiment to show its efficacy and relevance in human-AI teaming.

Abstract:
Robot learning primarily relies on centralized training. While it provides the infrastructure, centralization limits parallel and collaborative learning among robots and place significant computational load on the central server, indicating the need for federated learning (FL) in context of multi-robot training. However, robots trained in a federated setup are subjected to non-independent and identically distributed data (non-IID), resulting in degraded model performance. This extended abstract presents the current state of research aimed at improving robot learning under non-IID conditions in FL. In this regard, this work provides an initial comparative analysis of robot learning methods in centralized and federated training setups, with an emphasis on the impact of non-IID data on learning behaviour in a simulation environment. The results highlight the differences in learning stability across algorithms and present the influence of non-IID goal distributions on performance.

Abstract:
Batch planning is increasingly necessary to quickly produce diverse and quality motion plans for downstream learning applications, such as distillation and imitation learning. This paper presents Global Tensor Motion Planning (GTMP)---a sampling-based motion planning algorithm comprising only tensor operations. We introduce a novel discretization structure represented as a random multipartite graph, enabling efficient vectorized sampling, collision checking, and search. We provide a theoretical investigation showing that GTMP exhibits probabilistic completeness while supporting modern GPU/TPU. Additionally, by incorporating smooth structures into the multipartite graph, GTMP directly plans smooth splines without requiring gradient-based optimization. Experiments on lidar-scanned occupancy maps and the MotionBenchMarker dataset demonstrate GTMP's computation efficiency in batch planning compared to baselines, underscoring GTMP's potential as a robust, scalable planner for diverse applications and large-scale robot learning tasks.

Abstract:
Robots are improving their autonomy with minimal human supervision. However, auditable actions, transparent decision processes, and new human-robot interaction models are still missing requirements to achieve extended robot autonomy. To tackle these challenges, we propose RODEO (RObotic DEcentralized Organization), a blockchain-based framework that integrates trust and accountability mechanisms for robots. This paper formalizes Decentralized Autonomous Organizations (DAOs) for service robots. First, it provides a ROSETH bridge between the DAO and the robots. Second, it offers templates that enable organizations (e.g., companies, universities) to integrate service robots into their operations. Third, it provides proof-verification mechanisms that allow robot actions to be auditable. In our experimental setup, a mobile robot was deployed as a trash collector in a lab scenario. The robot collects trash and uses a smart bin to sort and dispose of it correctly. Then, the robot submits a proof of the successful operation and is compensated in DAO tokens. Finally, the robot re-invests the acquired funds to purchase battery charging services. Data collected in a three day experiment show that the robot doubled its income and reinvested funds to extend its operating time. The proof-validation times of approximately one minute ensured verifiable task execution, while the accumulated robot income successfully funded up to 88 hours of future autonomous operation. The results of this research give insights about how robots and organizations can coordinate tasks and payments with auditable execution proofs and on-chain settlement.

Abstract:
In this paper, we study adaptive locomotion for inchworm-like robots that move using friction-based pads inspired by snake scales. The robot moves by alternating extension and contraction phases, which require enough friction at the rear and front pads. Locomotion speed depends on how far the body extends, but it is limited by the friction available on the pads. Since the pads are passive, friction can only be controlled by shifting body weight onto them. The challenge is to balance friction and extension length to achieve fast movement on different terrains. Previous work relied on offline tuning for specific terrain types. We propose a new adaptive controller that automatically adjusts the friction requirements by detecting slippage. We demonstrate the approach using a six-degree-of-freedom inchworm-like robot and three locomotion strategies adapted from the literature, which were tested across four terrain types. The locomotion performance is measured by the achieved average locomotion speed, cost of transport, and reliability. Based on the experimental results, the proposed adaptive locomotion achieves performance similar to offline-tuned locomotion and even surpasses it on certain terrains.

Abstract:
Hamilton-Jacobi Reachability offers a framework for generating safe value functions and policies in the face of adversarial disturbance, but is limited by the curse of dimensionality. Physics-informed deep learning is able to overcome this infeasibility, but itself suffers from slow and inaccurate convergence, primarily due to weak PDE gradients and the complexity of self-supervised learning. Recent works have demonstrated that enriching the self-supervision process with regular supervision (based on the nature of the optimal control problem) greatly accelerates convergence and solution quality; however, these have been limited to single-player problems and simple games. In this work, we introduce MADR: MPC-guided Adversarial DeepReach, a general framework to robustly approximate the two-player, zero-sum differential game value function. In doing so, MADR yields the corresponding optimal strategies for both players in zero-sum games as well as safe policies for worst-case robustness. We test MADR on a multitude of high-dimensional simulated and real robotic agents with varying dynamics and games, finding that our approach significantly outperforms state-of-the-art baselines in simulation and produces impressive results in hardware.

Abstract:
In industrial and medical environments, robotic manipulation frequently encounters dual challenges of spatially constrained workspaces and visual obstructions. Wireless actuation methodologies, leveraging non-contact energy transmission and adaptive control mechanisms, offer an innovative solution to address the limitations of physical interconnections and enhance operational adaptability in structurally confined and uncertain obstructed environments. Here, we implemented a non-contact continuum robotic arm based on microwave-driving with polarization-directed guidance. The system incorporates customized flexible printed circuit (FPC) antennas and spatial microwave energy field modulation to enable wireless actuation and multi-degree-of-freedom (DOF) motion control of the robotic end-effector. By integrating spring-supported mechanisms and shape memory alloy (SMA) spring deformation responses, it accomplishes tasks such as obstacle penetration, component grasping, and retrievalall without physical connections. This study demonstrates precise coupling between directionally controlled microwave energy and mechanical motion in obscured settings, offering a novel approach for non-contact, multi-DOF, and multi-structure robotic operations in sealed or unstructured environments. The proposed methodology significantly expands the potential applications and operational capabilities of robots in complex real-world conditions.

Abstract:
While success in many robotics tasks can be determined by only observing the final state and how it differs from the initial state - e.g., if an apple is picked up - many tasks require observing the full motion of the robot to correctly determine success. For example, brushing hair requires repeated strokes that correspond to the contours and type of hair. Prior works often use off-the-shelf vision-language models (VLMs) as success detectors; however, when success depends on the full trajectory, VLMs struggle to make correct judgments for two reasons. First, modern VLMs often use single frames, and cannot capture changes over a full trajectory. Second, even if we provide state-of-the-art VLMs with an input of multiple frames, they still fail to detect success due to a lack of robot data. Our key idea is to fine-tune VLMs using abstract representations that are able to capture trajectory-level information such as the path the robot takes by overlaying keypoint trajectories on the final image. We propose motion instruction fine-tuning (MotIF), a method that fine-tunes VLMs using the aforementioned abstract representations to semantically ground the robot's behavior in the environment. To benchmark and fine-tune VLMs for robotic motion understanding, we introduce the MotIF-1K dataset containing 653 human and 369 robot demonstrations across 13 task categories. MotIF assesses the success of robot motion given task and motion instructions. Our model significantly outperforms state-of-the-art VLMs and video LMs by at least twice in F1 score with high precision and recall, generalizing across unseen motions, tasks, and environments. Finally, we demonstrate practical applications of MotIF in ranking trajectories on how they align with task and motion descriptions.

Abstract:
Planning under partial observability is an essential capability of autonomous robots. The Partially Observable Markov Decision Process (POMDP) provides a powerful framework for planning under partial observability problems, capturing the stochastic effects of actions and the limited information available through noisy observations. POMDP solving could benefit tremendously from massive parallelization on today's hardware, but parallelizing POMDP solvers has been challenging. Most solvers rely on interleaving numerical optimization over actions with the estimation of their values, which creates dependencies and synchronization bottlenecks between parallel processes that can offset the benefits of parallelization. In this paper, we propose Vectorized Online POMDP Planner (VOPP), a novel parallel online solver that leverages a recent POMDP formulation which analytically solves part of the optimization component, leaving numerical computations to consist of only estimation of expectations. VOPP represents all data structures related to planning as a collection of tensors, and implements all planning steps as fully vectorized computations over this representation. The result is a massively parallel online solver with no dependencies or synchronization bottlenecks between concurrent processes. Experimental results indicate that VOPP is at least 20X more efficient in computing near-optimal solutions compared to an existing state-of-the-art parallel online solver. Moreover, VOPP outperforms state-of-the-art sequential online solvers, while using a planning budget that is 1000X smaller.

Abstract:
Robust manipulation often hinges on a robot's ability to perceive extrinsic contactscontacts between a grasped object and its surrounding environment. However, these contacts are difficult to observe through vision alone due to occlusions, limited resolution, and ambiguous near-contact states. In this paper, we propose a visual-auditory method for extrinsic contact estimation that integrates global scene information from vision with local contact cues obtained through active audio sensing. Our approach equips a robotic gripper with contact microphones and conduction speakers, enabling the system to emit and receive acoustic signals through the grasped object to detect external contacts. We train our perception pipeline entirely in simulation and zero-shot transfer to the real-world. To bridge the sim-to-real gap, we introduce a real-to-sim audio hallucination technique, injecting real-world audio samples into simulated scenes with ground-truth contact labels. The resulting multimodal model accurately estimates both the location and size of extrinsic contacts across a range of cluttered and occluded scenarios. We further demonstrate that explicit contact prediction significantly improves policy learning for downstream contact-rich manipulation tasks. Project webpage: https://va2contact.github.io

Abstract:
This paper addresses the urgent need for rapid synthesis of georeferenced orthoimages in post-disaster scenarios, where pre-disaster satellite maps cannot be directly reused due to significant urban changes. Drone swarms offer advantages of large scale, wide aerial view and rapid coverage, of disaster-stricken areas. However, synthesizing georeferenced orthoimages within limited time remains challenging without camera calibration, primarily due to inevitable inconsistencies in intrinsics and extrinsics across different cameras, as well as sensor errors. To tackle this issue, we propose OrthoSwarm, a parallelizable calibration-free system architecture that leverages drone swarms rectilinear path planning and pre-disaster satellite maps for efficient orthoimage synthesis. OrthoSwarm's performance is validated on a self-constructed benchmark dataset, generated by drone swarms in a digital twin city covering 3 natural disaster scenarios(debris, waterlogging, haze), with real-world validation using real single-drone aerial videos split into segments to simulate swarm acquisition. Experimental results from both simulated and real-captured data confirm the effectiveness of the proposed approach, enabling fast and visually consistent georeferenced orthoimage synthesis in stable post-disaster environments to support first responders promptly.

Abstract:
Flexible robotic manipulators are rapidly gaining traction in automotive assembly to boost productivity and adaptability. Conventional end-effector systems depend heavily on custom tooling engineered for individual curved parts, a strategy that drives up reconfiguration costs, limits interoperability across different product lines, and increases downtime between production runs. We propose an underactuated endeffector that integrates a metasheet of dome-shaped bistable units interconnected into actuation groups, individually addressable via a pneumatic inflation. This arrangement permits transitions between multiple stable configurations, each corresponding to a distinct curvature profile, allowing the manipulator to accommodate different objects found on assembly lines. By tuning the geometry of the proposed endeffector, the system triggers transitions in targeted groups, reconfiguring the systems overall shape to conform to diverse part geometries. This flexibility enables a single manipulator platform to handle a broad family of components without the expense and downtime associated with bespoke tooling changes. By leveraging intrinsic compliance and multistability, the proposed approach strikes an effective balance between mechanical complexity and operational simplicity.

Abstract:
Autonomous manipulation of powders remains a significant challenge for robotic automation in scientific laboratories. The inherent variability and complex physical interactions of powders in flow, coupled with variability in laboratory conditions necessitates adaptive automation. This work introduces FLIP, a flowability-informed powder weighing framework designed to enhance robotic policy learning for granular material handling. Our key contribution lies in using material flowability, quantified by the angle of repose, to optimise physics-based simulations through Bayesian inference. This yields material-specific simulation environments capable of generating accurate training data, which reflects diverse powder behaviours, for training robot chemists. Building on this, FLIP integrates quantified flowability into a curriculum learning strategy, fostering efficient acquisition of robust robotic policies by gradually introducing more challenging, less flow able powders. We validate the efficacy of our method on a robotic powder weighing task under real-world laboratory conditions. Experimental results show that FLIP with a curriculum strategy achieves a low dispensing error of 2.12 ± 1.53mg, outperforming methods that do not leverage flowability data, such as domain randomisation (6.11 ± 3.92mg). These results demonstrate FLIPs improved ability to generalise to previously unseen, more cohesive powders and to new target masses.

Abstract:
Grey-box methods for system identification combine deep learning with physics-informed constraints, capturing complex dependencies while improving out-of-distribution generalization. Despite the growing importance of floating-base systems such as humanoids and quadrupeds, current grey-box models ignore their specific physical constraints. For instance, the inertia matrix is not only positive definite but also exhibits branch-induced sparsity and input independence. Moreover, the 6×6 composite spatial inertia of the floating base inherits properties of single-rigid-body inertia matrices. As we show, this includes the triangle inequality on the eigenvalues of the composite rotational inertia. To address the lack of physical consistency in deep learning models of floating-base systems, we introduce a parameterization of inertia matrices that satisfies all these constraints. Inspired by Deep Lagrangian Networks (DeLaN), we train neural networks to predict physically plausible inertia matrices that minimize inverse dynamics error under Lagrangian mechanics. For evaluation, we collected and released a dataset on multiple quadrupeds and humanoids. In these experiments, our Floating-Base Deep Lagrangian Networks (FeLaN) achieve better overall performance on both simulated and real robots, while providing greater physical interpretability.

Abstract:
Robot simulators are indispensable tools across many fields, and recent research has significantly improved their functionality by incorporating additional gradient information. However, existing differentiable robot simulators suffer from non-differentiable singularities, when robots undergo substantial shape changes. To address this, we present the Shape-Differentiable Robot Simulator (SDRS), designed to be differentiable under significant robot shape changes. The core innovation of SDRS lies in its representation of robot shapes using a set of convex polyhedrons. This approach allows us to generalize smooth, penalty-based contact mechanics for interactions between any pair of convex polyhedrons. Using the separating hyperplane theorem, SDRS introduces a separating plane for each pair of contacting convex polyhedrons. This separating plane functions as a zero-mass auxiliary entity, with its state determined by the principle of least action. This setup ensures global differentiability, even as robot shapes undergo significant geometric and topological changes. To demonstrate the practical value of SDRS, we provide examples of robot co-design scenarios,

Abstract:
This paper introduces Dr-PoGO, a method for Simultaneous Localization And Mapping (SLAM) using a 2D spinning radar. Unlike cameras or lidars that require line-of-sight, millimetre-wave radars can `see' through dust, falling snow, rain, etc. Accordingly, it is a great modality for robust perception regardless of the weather conditions. While most existing radar-based SLAM methods rely on the extraction of point clouds or features to perform ego-motion estimation, Dr-PoGO leverages direct registration techniques for odometry (DRO) and loop-closure registration. An off-the-shelf radar-focused place recognition algorithm, RaPlace, provides loop-closure candidates. As RaPlace does not provide relative transformations, Dr-PoGO introduces a coarse-to-fine registration that uses visual features and descriptors to obtain an initial guess for the direct transformation refinement. The global trajectory is optimized in a pose-graph optimization. Dr-PoGO demonstrates state-of-the-art performance over 300km of data in various real-world automotive environments. Our implementation is publicly available: https://github.com/utiasASRL/dr_pogo.

Abstract:
Dynamic object segmentation plays a critical role in many visual applications such as static scene reconstruction from dynamic videos. However, existing optical flow-based methods fail to ensure consistent static/dynamic segmentation along object boundaries, while 3D reconstruction-based approaches are highly sensitive to reconstruction errors. To address these limitations, we present a dynamic object segmentation framework that can generate both precise and complete dynamic masks by integrating multimodal cues including 2D point tracks, 3D reconstruction, and semantic information. We design a network combining Transformer architectures with feature clustering aggregation modules to perform static/dynamic classification of multimodal feature trajectories. It enables the model to adaptively determine which type of feature should dominate based on the characteristics of each scene, while also mitigating the impact of feature degradation. Additionally, we introduce a novel point-query-based SAM post-processing method capable of handling multiple objects within a single mask. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in both dynamic object segmentation and static scene reconstruction tasks.

Abstract:
Autonomous driving requires forecasting both geometry and semantics over time to effectively reason about future environment states. Existing vision-based occupancy forecasting methods focus on motion-related categories such as static and dynamic objects, while semantic information remains largely absent. Recent semantic occupancy forecasting approaches address this gap but rely on past occupancy predictions obtained from separate networks. This makes current methods sensitive to error accumulation and prevents learning spatio-temporal features directly from images. In this work, we present ForecastOcc, the first framework for vision-based semantic occupancy forecasting that jointly predicts future occupancy states and semantic categories. Our framework yields semantic occupancy forecasts for multiple horizons directly from past camera images, without relying on externally estimated maps. We evaluate ForecastOcc in two complementary settings: multi-view forecasting on the Occ3D-nuScenes dataset and monocular forecasting on SemanticKITTI, where we establish the first benchmark for this task. We introduce the first baselines by adapting two 2D forecasting modules within our framework. Importantly, we propose a novel architecture that incorporates a temporal cross-attention forecasting module, a 2D-to-3D view transformer, a 3D encoder for occupancy prediction, and a semantic occupancy head for voxel-level forecasts across multiple horizons. Extensive experiments on both datasets show that ForecastOcc consistently outperforms baselines, yielding semantically rich, future-aware predictions that capture scene dynamics and semantics critical for autonomous driving.

Abstract:
Event cameras promise low latency and high dynamic range, yet their sparse output challenges integration into standard robotic pipelines. We introduce EECVS (Efficient Event Camera Volume System), a novel framework that models event streams as continuous-time Dirac impulse trains, enabling artifact-free compression through direct transform evaluation at event timestamps. Our key innovation combines density-driven adaptive selection among DCT, DTFT, and DWT transforms with transform-specific coefficient pruning strategies tailored to each domain's sparsity characteristics. The framework eliminates temporal binning artifacts while automatically adapting compression strategies based on real-time event density analysis. On EHPT-XC and MVSEC datasets, our framework achieves superior reconstruction fidelity with DTFT delivering the lowest earth mover distance. In downstream segmentation tasks, EECVS demonstrates robust generalization. Notably, our approach demonstrates exceptional cross-dataset generalization: when evaluated with EventSAM segmentation, EECVS achieves mean IoU 0.87 on MVSEC versus 0.44 for voxel grids at 24 channels, while remaining competitive on EHPT-XC. Our ROS2 implementation provides real-time deployment with DCT processing achieving 1.5 ms latency and 2.7× higher throughput than alternative transforms, establishing the first adaptive event compression framework that maintains both computational efficiency and superior generalization across diverse robotic scenarios.

Abstract:
In nature, birds exhibit outstanding attitude control, enabling flexible and efficient takeoff, hovering and landing capabilities that have not been fully replicated. Thus, we introduce the lightweight bio-inspired rotary-wing drone (L-BIRD). It incorporates a spherical structure, which can imitate birds' attitude variation and land on complex surfaces adaptively. L-BIRD employs a model predictive control (MPC) framework to enable real-time tracking of bird-like attitude trajectories derived from bio-inspired parameter pairs. To facilitate lightweight deployment on resource-constrained hardware platforms, we improve MPC framework by multi-path primal-dual neural network (PDNN), matrix sparsity and multiplicative optimization. Experimental results, both in simulations and real-world deployments, demonstrate that L-BIRD realizes accurate and efficient biomimetic attitude control and diverse environmental adaptability. The attitude trajectory mean-square error (MSE) decreases to 0.0042 rad, random access memory (RAM) usage reduces by 39.3%.

Abstract:
Recent advancements in legged robot locomotion have facilitated traversal over increasingly complex terrains. Despite this progress, many existing approaches rely on end- to-end deep reinforcement learning (DRL), which poses limi- tations in terms of safety and interpretability, especially when generalizing to novel terrains. To overcome these challenges, we introduce VOCALoco, a modular skill-selection framework that dynamically adapts locomotion strategies based on per- ceptual input. Given a set of pre-trained locomotion policies, VOCALoco evaluates their viability and energy-consumption by predicting both the safety of execution and the anticipated cost of transport over a fixed planning horizon. This joint assessment enables the selection of policies that are both safe and energy- efficient, given the observed local terrain. We evaluate our approach on staircase locomotion tasks, demonstrating its performance in both simulated and real-world scenarios using a quadrupedal robot. Empirical results show that VOCALoco achieves improved robustness and safety during stair ascent and descent compared to a conventional end-to-end DRL policy.

Abstract:
In autonomous robot navigation, terrain cost assignment is typically performed using a semantics-based paradigm in which terrain is first labeled using a pre-trained semantic classifier and costs are then assigned according to a user-defined mapping between label and cost. While this approach is rapidly adaptable to changing user preferences, only preferences over the types of terrain that are already known by the semantic classifier can be expressed. In this paper, we hypothesize that a machine-learning-based alternative to the semantics-based paradigm above will allow for rapid cost assignment adaptation to preferences expressed over new terrains at deployment time without the need for additional training. To investigate this hypothesis, we introduce and study PACER, a novel approach to costmap generation that accepts as input a single birds-eye view (BEV) image of the surrounding area along with a user-specified preference context and generates a corresponding BEV costmap that aligns with the preference context. Using both real and synthetic data along with a combination of proposed training tasks, we find that PACER is able to adapt quickly to new user preferences while also exhibiting better generalization to novel terrains compared to both semantics-based and representation-learning approaches.

Abstract:
In recent years, the illicit use of un- manned aerial vehicles (UAVs) for deliveries in re- stricted area such as prisons became a significant security challenge. While numerous studies have fo- cused on UAV detection or localization, little atten- tion has been given to delivery events identification. This study presents the first acoustic package deliv- ery detection algorithm using a ground-based micro- phone array. The proposed method estimates both the drones propeller speed and the delivery event using solely acoustic features. A deep neural network detects the presence of a drone and estimates the propellers rotation speed or blade passing frequency (BPF) from a mel spectrogram. The algorithm ana- lyzes the BPFs to identify probable delivery moments based on sudden changes before and after a specific time. Results demonstrate a mean absolute error of the blade passing frequency estimator of 16 Hz when the drone is less than 150 meters away from the microphone array. The drone presence detection esti- mator has a accuracy of 97%. The delivery detection algorithm correctly identifies 96% of events with a false positive rate of 8 %. This study shows that deliveries can be identified using acoustic signals up to a range of 100 meters.

Abstract:
We present a hierarchical framework for city-scale autonomous ride-hailing that integrates vehicle prepositioning, request matching, charging, and facility ingress. A fine-grained mixed-integer program (MIP) coordinates prepositioning and matching on short horizons, while a coarse-grained Deployment+Summoning decomposition enforces charger/parking capacities at scale. On ride-hail traces, the method increases coverage and reduces wait relative to greedy and decoupled baselines, while keeping charger overuse near zero under rolling-horizon execution. We detail boundary-condition handling for 24/7 operations and specify a concrete RL training/validation protocol for a constraint-aware hybrid in which learned policies act tactically under a MIP-based safety shield.

Abstract:
The interest in combining model-based control approaches with diffusion models has been growing. Although we have seen many impressive robotic control results in difficult tasks, the performance of diffusion models is highly sensitive to the choice of scheduling parameters, making parameter tuning one of the most critical challenges. We introduce Linear Path Model-Based Diffusion (LP-MBD), which replaces the variance-preserving schedule with a flow-matchinginspired linear probability path. This yields a geometrically interpretable and decoupled parameterization that reduces tuning complexity and provides a stable foundation for adaptation. Building on this, we propose Adaptive LP-MBD (ALP-MBD), which leverages reinforcement learning to adjust diffusion steps and noise levels according to task complexity and environmental conditions. Across numerical studies, Brax benchmarks, and mobile-robot trajectory tracking, LP-MBD simplifies scheduling while maintaining strong performance, and ALP-MBD further improves robustness, adaptability, and real-time efficiency.

Abstract:
Learning manipulation skills from human demonstration videos presents a promising yet challenging problem, primarily due to the significant embodiment gap between human body and robot manipulators. Existing methods rely on paired datasets or hand-crafted rewards, which limit scalability and generalization. We propose TrajSkill, a framework for Trajectory Conditioned Cross-embodiment Skill Transfer, enabling robots to acquire manipulation skills directly from human demonstration videos. Our key insight is to represent human motions as sparse optical flow trajectories, which serve as embodiment-agnostic motion cues by removing morphological variations while preserving essential dynamics. Conditioned on these trajectories together with visual and textual inputs, TrajSkill jointly synthesizes temporally consistent robot manipulation videos and translates them into executable actions, thereby achieving cross-embodiment skill transfer. Extensive experiments are conducted, and the results on simulation data (MetaWorld) show that TrajSkill reduces FVD by 39.6% and KVD by 36.6% compared with the state-of-the-art, and improves cross-embodiment success rate by up to 16.7%. Real-robot experiments in kitchen manipulation tasks further validate the effectiveness of our approach, demonstrating practical human-to-robot skill transfer across embodiments.

Abstract:
Recent text-to-4D generation methods have achieved remarkable progress thanks to advances in text-to-video models. Existing approaches typically reconstruct 4D scenes from generated videos or distill them from pre-trained text-to-video models. However, these methods often restrict the scene to a local region or lack spatial controllability. TC4D pioneered trajectory-controllable 4D asset generation by decomposing motion into global transformation and local deformation. While it achieves high visual quality, TC4D suffers from extremely low generation efficiency due to its NeRF-based framework. To overcome this limitation, we propose Efficient TC4DGS, which replaces NeRF with 4D Gaussian Splatting (4DGS) to significantly improve efficiency. Nevertheless, the discrete representation of 4DGS makes optimization challenging, leading to noticeable degradation in visual and motion quality. Thus, we propose a HexPlane-based 4D representation combined with a key-node control scheme. By computing the deformation only for the control nodes and getting overall deformation through interpolation, we greatly improve generation efficiency while maintaining quality. Compared with TC4D, the previous SOTA, we have improved the generation efficiency by 13times (reducing the generation time from 26 hours to 2 hours), while also achieving superior performance in terms of the dynamic quality of the generated objects.

Abstract:
This paper introduces MipSLAM, a frequency-aware 3D Gaussian Splatting (3DGS) SLAM framework capable of high-fidelity anti-aliased novel view synthesis and robust pose estimation under varying camera configurations. Existing 3DGS-based SLAM systems often suffer from aliasing artifacts and trajectory drift due to inadequate filtering and purely spatial optimization. To overcome these limitations, we propose an Elliptical Adaptive Anti-aliasing (EAA) algorithm that approximates Gaussian contributions via geometry-aware numerical integration, avoiding costly analytic computation. Furthermore, we present a Spectral-Aware Pose Graph Optimization (SA-PGO) module that reformulates trajectory estimation in the frequency domain, effectively suppressing high-frequency noise and drift through graph Laplacian analysis. Extensive evaluations on Replica and TUM datasets demonstrate that MipSLAM achieves state-of-the-art rendering quality and localization accuracy across multiple resolutions.

Abstract:
Autonomous inspection tasks require path-planning algorithms to efficiently gather observations from points of interest (POIs). However, localization errors in urban environments introduce execution uncertainty, posing challenges to successfully completing such tasks. Existing inspection-planning algorithms do not explicitly address this uncertainty, which can hinder their performance. To overcome this, we introduce IRIS-Under-Uncertainty (IRIS-U²), an inspection-planning algorithm that provides statistical assurances regarding coverage, path length, and collision probability. Our approach builds upon IRISour framework for deterministic, highly efficient, and provably asymptotically-optimal framework. This extension adapts IRIS to uncertain settings using a refined search procedure that estimates POI coverage probabilities through Monte Carlo (MC) sampling. We demonstrate IRIS-U² through a case study on bridge inspections, achieving improved expected coverage, reduced collision probability, and increasingly precise statistical guarantees as MC samples grow. Additionally, we explore bounded suboptimal solutions to reduce computation time while preserving statistical assurances.

Abstract:
Bipeds have demonstrated high agility and mobility in unstructured environments such as sand. The yielding of such granular media brings significant sinkage and slip of the bipedal feet, leading to uncertainty and instability of walking locomotion. We present a new dynamics-modeling approach to capture and predict bipedal-walking locomotion on granular media. A dynamic foot-terrain interaction model is integrated to compute the ground reaction force (GRF). The proposed granular dynamic model has three additional degree-of-freedom (DoF) to estimate foot sinkage and slip that are critical to capturing robot-walking kinematics and kinetics such as cost of transport (CoT). Using the new model, we analyze bipedal kinetics, CoT, and foot-terrain rolling and intrusion affects. Experiments are conducted using a biped robotic walker on sand to validate the proposed dynamic model with robot-gait profiles, media-intrusion prediction, and GRF estimations. This new dynamics model can further serve as an enabling tool for locomotion control and optimization of bipedal robots to efficiently walk on granular terrains.

Abstract:
Collaborative perception is pivotal for the large-scale deployment of autonomous driving, yet it has long grappled with the trade-off between perception accuracy and bandwidth consumption. Existing methods fail to analyze the fine-grained characteristics of Field of View (FoV), leading to inefficient bandwidth utilization. To address this, we propose a Context-adaptive Collaborative Perception framework, termed CaCP. This method optimizes bandwidth usage by employing distinct collaboration strategies for FoV under varying contexts, thereby reducing communication overhead while maintaining perception accuracy.Additionally, CaCP introduces a novel spatial fusion of intermediate and late fusion strategies, yielding a more flexible collaborative scheme. Extensive experiments across multiple datasets encompassing both simulated (OPV2V) and real-world (V2V4Real) scenarios demonstrate that CaCP establishes a new state-of-the-art trade-off between accuracy and bandwidth. Notably, it reduces bandwidth consumption by up to 17% compared to previous works while achieving competitive or superior perception performance.

Abstract:
Bimanual grasping is essential for robots to handle large and complex objects. However, existing methods either focus solely on single-arm grasping or employ separate grasp generation and bimanual evaluation stages, leading to coordination problems including collision risks and unbalanced force distribution. To address these limitations, we propose BiGraspFormer, a unified end-to-end transformer framework that directly generates coordinated bimanual grasps from object point clouds. Our key idea is the Single-Guided Bimanual (SGB) strategy, which first generates diverse single grasp candidates using a transformer decoder, then leverages their learned features through specialized attention mechanisms to jointly predict bimanual poses and quality scores. This conditioning strategy reduces the complexity of the 12-DoF search space while ensuring coordinated bimanual manipulation. Comprehensive simulation experiments and real-world validation demonstrate that BiGraspFormer consistently outperforms existing methods while maintaining efficient inference speed (<0.05s), confirming the effectiveness of our framework. Code and supplementary materials are available at https://sites.google.com/bigraspformer

Authors: Dejanira Araiza Illan, Kevin Baum, Helen Beebee, Raja Chatila, Sarah Moth-Lund Christensen, Simon Coghlan, Emily Charlotte Collins, Susannah Kate Devitt, Alcino Cunha, Anna Dobrosovestnova, Duijf Hein, Vanessa Evers, Michael Fisher, Nadin Kökciyan, Nico Hochgeschwender, Séverin Lemaignan, Francisco Javier Rodríguez Lera, Sara Ljungblad, Martin Magnusson, Masoumeh Mansouri, Michael J Milford, AJung Moon, Thomas M. Powers, Pericle Salvini, Teresa Scantamburlo, Nick Schuster, Slavkovik Marija, Ufuk Topcu, Daniel Fernando Preciado Vanegas, Andrzej Wasowski, Yi Yang

Affiliations: Johnson & Johnson; German Research Center for Artificial Intelligence (DFKI); University of Leeds; University of Melbourne; Queensland University of Technology; INESC TEC and Universidade Do Minho; TU Wien; Utrecht University; University of Twente; University of Manchester; University of Edinburgh; University of Bremen; Universidad De León; Lots Design; Örebro University; Birmingham University; McGill University; University of Delaware; Responsible Technology Institute, University of Oxford; Ca Foscari University; Australian National University; University of Bergen; The University of Texas at Austin; Vrije Universiteit Amsterdam; IT University of Copenhagen; KU Leuven

Abstract:
This document presents the outcomes of the Dagstuhl Seminar "Roadmap for Responsible Robotics," held in September 2023 at the Leibniz Centre for Informatics, Schloss Dagstuhl, Germany. The seminar brought together researchers from Robotics, Computer Science, Social and Cognitive Sciences, and Philosophy with the aim of charting a path towards improving responsibility in robotic systems. Through intensive interdisciplinary discussions centered on the various values at stake as robotics increasingly integrates into human life, the participants identified key priorities to guide future research and regulatory efforts. The resulting roadmap outlines actionable steps to ensure that robotic systems co-evolve with human societies, promoting human agency and humane values rather than undermining them. Designed for diverse stakeholders---researchers, policymakers, industry leaders, practitioners, NGOs, and civil society groups---this roadmap provides a foundation for collaborative efforts toward responsible robotics.

Abstract:
Multi-robot systems are widely used for coverage tasks that require efficient coordination across large environments. In Multi-Robot Coverage Path Planning (MCPP), the objective is typically to minimize the makespan by generating non-overlapping paths for full-area coverage. However, most existing methods assume uniform importance across regions, limiting their effectiveness in scenarios where some zones require faster attention. We introduce the Priority-Aware MCPP (PA-MCPP) problem, where a subset of the environment is designated as prioritized zones with associated weights. The goal is to minimize, in lexicographic order, the total priority-weighted latency of zone coverage and the overall makespan. To address this, we propose a scalable two-phase framework combining (1) greedy zone assignment with local search, spanning-tree-based path planning, and (2) Steiner-tree-guided residual coverage. Experiments across diverse scenarios demonstrate that our method significantly reduces priority-weighted latency compared to standard MCPP baselines, while maintaining competitive makespan. Sensitivity analyses further show that the method scales well with the number of robots and that zone coverage behavior can be effectively controlled by adjusting priority weights.

Abstract:
Isoperimetric robots are large scale, untethered inflatable robots that can undergo large shape changes, but have only been demonstrated in one 3D shape- an octahedron. These robots consist of independent triangles that can change shape while maintaining their perimeter by moving the relative position of their joints. We introduce an optimization routine that determines if an arbitrary graph can be partitioned into unique triangles, and thus be constructed as an isoperimetric robotic system. We enumerate all minimally rigid graphs that can be constructed with unique triangles up to 9 nodes (7 triangles), and characterize the workspace of one node of each of these robots. We also present a method for constructing larger graphs that can be partitioned by assembling subgraphs that are already partitioned into triangles. This enables a wide variety of isoperimetric robot configurations.

Abstract:
Open-vocabulary 3D segmentation enables the exploration of 3D spaces using free-form text descriptions. Existing methods for open-vocabulary 3D instance segmentation primarily focus on identifying object-level instances in a scene. However, they face challenges when it comes to understanding more fine-grained scene entities such as object parts, or regions described by generic attributes. In this work, we introduce Search3D, an approach that builds a hierarchical open-vocabulary 3D scene representation, enabling the search for entities at varying levels of granularity: fine-grained object parts, entire objects, or regions described by attributes like materials. Our method aims to expand the capabilities of open-vocabulary instance-level 3D segmentation by shifting towards a more flexible open-vocabulary 3D search setting less anchored to explicit object-centric queries, compared to prior work. To ensure a systematic evaluation, we also contribute a scene-scale open-vocabulary 3D part segmentation benchmark based on MultiScan, along with a set of open-vocabulary fine-grained part annotations on ScanNet++. We verify the effectiveness of Search3D across several tasks, demonstrating that our approach outperforms baselines in scene-scale open-vocabulary 3D part segmentation, while maintaining strong performance in segmenting 3D objects and materials.

Abstract:
Achieving high-precision control for robotic systems is hindered by the low-fidelity dynamical model and external disturbances. Especially, the intricate coupling between internal uncertainties and external disturbances further exacerbates this challenge. This study introduces an effective and convergent algorithm enabling accurate estimation of the coupled disturbance via combining control and learning philosophies. Concretely, by resorting to textit Chebyshev series expansion, the coupled disturbance is effectively decomposed into an unknown parameter matrix and two known structures dependent on system state and external disturbance respectively. A regularized least squares process is subsequently formalized to learn the parameter matrix using historical time-series data. Furthermore, a polynomial disturbance observer is specifically devised to achieve a high-precision estimation of the coupled disturbance by utilizing the learned structure portion. Extensive simulations and real flight tests valid the effectiveness of the proposed framework. We believe this work can offer a new pathway to integrate learning approaches into control frameworks for addressing longstanding challenges in robotic applications.

Abstract:
Video flow, depth, and panoptic segmentation are fundamental to diverse robotic perception and computer vision applications. Despite recent advances in specialized approaches, several inherent limitations remain challenging: first, training and inferencing three separate models is computationally costly; second, separate training prohibits learning underlying feature representations and knowledge from other tasks. In this work, we address these challenges by reformulating video flow estimation, depth estimation and panoptic segmentation as a sequence of feature correspondence matching, updating and tracking problems. This approach allows these tasks to be addressed by a single architecture that compares feature similarities across frames. By incorporating a shared feature representation with distinct prediction heads, our model can simultaneously predict consistent and reliable optical flow, depth maps, and object masks for videos. We further demonstrate that this universal model maintains temporal consistency across tasks while requiring no task-specific re-training. Extensive experiments on the FlyingThings, Sintel, VKITTI, KITTI, and VIPSeg benchmarks demonstrates superior performance. Furthermore, the model exhibits zero-shot performance on unseen wild scenes.

Abstract:
Event-based Gaussian splatting (GS) reconstruction approach has recently attracted considerable attention. Existing methods usually assume the camera poses are known as a prior, or struggle to process long event streams due to the robustness of the method while poses are not known. In this work, we present ED-SLAM, an Event-Depth Gaussian Splatting-based simultaneous localization and mapping(SLAM) pipeline, which is robust to long event streams and does not require ground-truth camera poses. The pipeline achieves high-accuracy pose estimation and high-fidelity 3D reconstruction thanks to the impressive 3D representation capability of Gaussian splatting. In particular, we propose a novel patch-based event-depth tracking algorithm and seamlessly integrate it into the Gaussian splatting mapping pipeline. Extensive experiments on both synthetic and real-world datasets demonstrate that our method significantly improves tracking accuracy and robustness, and also delivers improved reconstruction performance.

Abstract:
Imitation learning from human demonstrations has become a dominant approach for training autonomous robot policies. However, collecting demonstration datasets is costly: it often requires access to robots and needs sustained effort in a tedious, long process. These factors limit the scale of data available for training policies. We aim to address this scalability challenge by involving a broader audience in a gamified data collection experience that is both accessible and motivating. Specifically, we develop a gamified remote teleoperation platform, RoboCade, to engage general users in collecting data that is beneficial for downstream policy training. To do this, we embed gamification strategies into the design of the system interface and data collection tasks. In the system interface, we include components such as visual feedback, sound effects, goal visualizations, progress bars, leaderboards, and badges. We additionally propose principles for constructing gamified tasks that have overlapping structure with useful downstream target tasks. We instantiate RoboCade on three manipulation tasksincluding spatial arrangement, scanning, and insertion. To illustrate the viability of gamified robot data collection, we collect a demonstration dataset through our platform, and show that co-training robot policies with this data can improve success rate on non-gamified target tasks (+16-56%). Further, we conduct a user study to validate that novice users find the gamified platform significantly more enjoyable than a standard non-gamified platform (+24%). These results highlight the promise of gamified data collection as a scalable, accessible, and engaging method for collecting demonstration data. Videos are available at robocade.github.io.

Abstract:
Multimodal audiovisual perception can enable new avenues for robotic manipulation, from better material classification to the imitation of demonstrations for which only audio signals are available (e.g., playing a tune by ear). However, to unlock such multimodal potential, robots need to learn the correlations between an objects visual appearance and the sound it generates when they interact with it. Such an active sensorimotor experience requires new interaction capabilities, representations, and exploration methods to guide the robot in efficiently building increasingly rich audiovisual knowledge. In this work, we present CAVER, a novel robot that builds and utilizes rich audiovisual representations of objects. CAVER includes three novel contributions: 1) a novel 3D printed end- effector, attachable to parallel grippers, that excites objects audio responses, 2) an audiovisual representation that combines local and global appearance information with sound features, and 3) an exploration algorithm that uses and builds the audiovisual representation in a curiosity-driven manner that prioritizes interacting with high uncertainty objects to obtain good coverage of surprising audio with fewer interactions. We demonstrate that CAVER builds rich representations in different scenarios more efficiently than several exploration baselines, and that the learned audiovisual representation leads to significant improvements in material classification and the imitation of audio-only human demonstrations.

Abstract:
Modeling and control of physical systems remain challenging for purely data-driven methods, which often lack interpretability and fail to leverage prior knowledge. Model-structured neural networks (MSNNs) embed physical laws into neural architectures; however, their design and implementation can be nontrivial. We present nnodely, an open-source framework that simplifies MSNN development through a modular workflow, improving interpretability, data efficiency, and deployment on resource-constrained platforms. The paper highlights the frameworks features, positions it within the landscape of existing tool, and demonstrates its effectiveness in two case studies. nnodely is released under the MIT license and is available at https://github.com/tonegas/nnodely

Abstract:
Multirotors are usually desired to enter confined narrow tunnels that are barely accessible to humans in various applications including inspection, search and rescue, and so on. This task is extremely challenging since the lack of geometric features and illuminations, together with the limited field of view, cause problems in perception; the restricted space and significant ego airflow disturbances induce control issues. This paper introduces an autonomous aerial system designed for navigation through tunnels as narrow as 0.5 m in diameter. The real-time and online system includes a virtual omni-directional perception module tailored for the mission and a novel motion planner that incorporates perception and ego airflow disturbance factors modeled using camera projections and computational fluid dynamics analyses, respectively. Extensive flight experiments on a custom-designed quadrotor are conducted in multiple realistic narrow tunnels to validate the superior performance of the system, even over human pilots, proving its potential for real applications. Additionally, a deployment pipeline on other multirotor platforms is outlined and open-source packages are provided for future developments.

Abstract:
We present a sampling-based model predictive control (MPC) framework that enables emergent locomotion without relying on handcrafted gait patterns or predefined contact sequences. Our method discovers diverse motion patterns, ranging from trotting to galloping, robust standing policies, jumping, and handstand balancing, purely through the optimization of high-level objectives. Building on model predictive path integral (MPPI), we propose a cubic Hermite spline parameterization that operates on position and velocity control points. Our approach enables contact-making and contact-breaking strategies that adapt automatically to task requirements, requiring only a limited number of sampled trajectories. This sample efficiency enables real-time control on standard CPU hardware, eliminating the GPU acceleration typically required by other state-of-the-art MPPI methods. We validate our approach on the Go2 quadrupedal robot, demonstrating a range of emergent gaits and basic jumping capabilities. In simulation, we further showcase more complex behaviors, such as backflips, dynamic handstand balancing and locomotion on a Humanoid, all without requiring reference tracking or offline pre-training.

Abstract:
Reliable state estimation hinges on accurate specification of sensor noise covariances, which weigh heterogeneous measurements. In practice, these covariances are difficult to identify due to environmental variability, front-end preprocessing, and other reasons. We address this by formulating noise covariance estimation as a bilevel optimization that, from a Bayesian perspective, factorizes the joint likelihood of so-called odometry and supervisory measurements, thereby balancing information utilization with computational efficiency. The factorization converts the nested Bayesian dependency into a chain structure, enabling efficient parallel computation: at the lower level, an invariant extended Kalman filter with state augmentation estimates trajectories, while a derivative filter computes analytical gradients in parallel for upper-level gradient updates. The upper level refines the covariance to guide the lower-level estimation. Experiments on synthetic and real-world datasets show that our method achieves higher efficiency than existing baselines.

Abstract:
Inspired by the human ability to selectively focus on relevant information, this paper introduces relevance, a novel dimensionality reduction process that enables robots to identify relevant scene elements in a scene and generate responses that are seamless, fast, and accurate. To accurately and efficiently quantify relevance, we developed an event-based framework that maintains a continuous perception of the scene, evaluates cue sufficiency within the scene, and selectively triggers relevance determination. Within this framework, we developed a probabilistic methodology that considers various factors and is built on a novel structured scene representation. Both simulations and experimental results demonstrate the effectiveness of our relevance concept, as well as the proposed framework and methods for relevance quantification. Simulation results demonstrate that the relevance framework and methodology accurately predict the relevance of a general Human Robot Collaboration (HRC) setup, achieving a precision of 0.99, a recall of 0.94, an F1 score of 0.96, and an object ratio of 0.94. Relevance demonstrates broad benefits across multiple aspects of HRC, yielding a 79.56% reduction in task planning time compared with a state-of-the-art (SOTA) task planner for a cereal task, a 26.53% decrease in perception latency for object detection, an improvement of up to 13.50% in HRC safety, and an 80.84% reduction in the number of inquiries required during collaboration. A real-world demonstration highlights the effectiveness of the relevance framework, together with its modules, in providing intelligent and seamless assistance to humans during everyday tasks.

Abstract:
Robust robot localization is an important prerequisite for navigation, but it becomes challenging when the map and robot measurements are obtained from different sensors. Prior methods are often tailored to specific environments, relying on closed-set semantics or fine-tuned features. In this work, we extend Monte Carlo Localization with vision-language features, allowing OMCL to robustly compute the likelihood of visual observations given a camera pose and a 3D map created from posed RGB-D images or aligned point clouds These open-vocabulary features enable us to associate observations and map elements from different modalities, and to natively initialize global localization through natural language descriptions of nearby objects. We evaluate our approach using Matterport3D and Replica for indoor scenes and demonstrate generalization on SemanticKITTI for outdoor scenes.

Abstract:
Recent interest in mobile manipulation has resulted in a wide range of new robot designs. A large family of these designs focuses on modular platforms that combine existing mobile bases with static manipulator arms. They combine these modules by mounting the arm in a tabletop configuration. However, the operating workspaces and heights for common mobile manipulation tasks, such as opening articulated objects, significantly differ from tabletop manipulation tasks. As a result, these standard arm mounting configurations can result in kinematics with restricted joint ranges and motions. To address these problems, we present the first Concurrent Design approach for mobile manipulators to optimize key arm-mounting parameters. Our approach directly targets task performance across representative household tasks by training a powerful multitask-capable reinforcement learning policy in an inner loop while optimizing over a distribution of design configurations guided by Bayesian Optimization and HyperBand (BOHB) in an outer loop. This results in novel designs that significantly improve performance across both seen and unseen test tasks, and outperform designs generated by heuristic-based performance indices that are cheaper to evaluate but only weakly correlated with the motions of interest. We evaluate the physical feasibility of the resulting designs and show that they are practical and remain modular, affordable, and compatible with existing commercial components. We open-source the approach and generated designs to facilitate further improvements of these platforms.

Abstract:
Recent progress in large-scale imitation learning for robot manipulation has been driven by leveraging datasets across a wide range of robot embodiments. However, achieving significant cross-embodiment transfer is often still challenging. In this work, we study the role of using behavior-aligned representations (e.g., object bounding boxes, language motions, end-effector traces of robot motion) in vision-language-action (VLA) models to promote cross-embodiment transfer. We hypothesize that by possessing invariances across embodiments while being predictive of robot actions, these representations can help unify large-scale cross-embodiment data to enhance transfer. To assess our hypothesis, we develop a simulation-based benchmark designed to assess transfer with diverse cross-embodiment data to new embodiments. Using this benchmark, we compare different representations and ways of incorporating them. We identify that end-effector traces can be particularly beneficial for transfer, representations are generally more useful with larger prior datasets, and can be used to benefit from action-free data. We also demonstrate that they can enhance sim-to-real cross-embodiment transfer, improving task completion progress of real robot policies pre-trained on simulation data by 28%. We provide videos of our evaluations at our website https://ajaysridhar.com/barx.github.io/.

Abstract:
Robust state estimation is central to robotic autonomy, yet classical Kalman filters struggle with frequency-dependent disturbances and model mismatch such as sensor vibrations, electromagnetic interference, and periodic noise. Although Deep Kalman Filter (DKF) variants extend the Extended Kalman Filtering (EKF) framework by learning latent transitions, they lack explicit mechanisms to suppress band-limited noise components that typically corrupt sensor measurements in real-world scenarios. We introduce the Frequency-Weighted Neural Kalman Filter (FW-NKF), a unified hybrid approach that embeds a causal spectral-shaping operator into the Kalman measurement residual and jointly learns observation, and transition networks. By adapting both the filter spectrum and the latent state representation, FW-NKF attenuates the noise-dominated frequency bands while capturing complex residual structures. We conduct extensive experiments on four heterogeneous benchmarks, including chaotic systems such as multi-dimensional Lorenz systems and full-body inertial pose estimation, and find a reduction in localization error of up to 10% as well as marked improvements in orientation accuracy. Our ablation studies confirm that frequency weighting and deep latent-state modeling contribute to overall performance.

Abstract:
This paper presents Rummaging Using Mutual Information (RUMI), a method for online generation of robot action sequences to gather information about the pose of a known movable object in visually-occluded environments. Focusing on contact-rich rummaging, our approach leverages mutual information between the object pose distribution and robot trajectory for interactive perception. From an observed partial point cloud, RUMI deduces the compatible object pose distribution and approximates the mutual information of it with workspace occupancy in real time. Based on this, we develop an information gain cost function and a reachability cost function to keep the object within the robot's reach. These are integrated into a model predictive control (MPC) framework with a stochastic dynamics model, updating the pose distribution in a closed loop. Key contributions include a new belief framework for object pose estimation, an efficient information gain computation strategy, and a robust MPC-based control scheme. RUMI demonstrates superior performance in both simulated and real tasks compared to baseline methods.

Abstract:
This letter presents empirical research on the non-reproducibility of light detection and ranging sensor (LiDAR)-inertial odometry (LIO) systems. Although the LIO community has made commendable efforts toward reproducible localization accuracy, noteworthy non-reproducibility remains, thus hindering a fair evaluation of method effectiveness. To better understand such non-reproducibility, we first define non-reproducibility and introduce a quantitative criterion to identify noteworthy non-reproducibility. We then propose five significant non-deterministic implementations that are included in state-of-the-art LIO systems and present solutions for modifying these non-deterministic implementations into deterministic ones. A general procedure is also introduced to identify and pinpoint non-deterministic implementations, regardless of whether they are covered in this letter. Extensive experiments demonstrate that the non-deterministic implementations are the major or potentially sole causes of non-reproducibility under constant experimental conditions. Additionally, the non-reproducibility is noteworthy in datasets obtained from low-vertical-resolution LiDARs or recorded in geometrically degenerate scenes.

Abstract:
Off-road autonomous driving poses significant challenges such as navigating unmapped, variable terrain with uncertain and diverse dynamics. Addressing these challenges requires effective long-horizon planning and adaptable control. Reinforcement Learning (RL) offers a promising solution by learning control policies directly from interaction. However, because off-road driving is a long-horizon task with low-signal rewards, standard RL methods are challenging to apply in this setting. We introduce TADPO, a novel policy gradient formulation that extends Proximal Policy Optimization (PPO), leveraging off-policy trajectories for teacher guidance and on-policy trajectories for student exploration. Building on this, we develop a vision-based, end-to-end RL system for high-speed off-road driving, capable of navigating extreme slopes and obstacle-rich terrain. We demonstrate our performance in simulation and, importantly, zero-shot sim-to-real transfer on a full-scale off-road vehicle. To our knowledge, this work represents the first deployment of RL-based policies on a full-scale off-road platform.

Abstract:
Multi-robot systems in automated warehouses must manage continuous streams of pickup-and-delivery tasks while ensuring efficiency and safety. Prior work on Multi-Agent Pickup-and-Delivery (MAPD) has largely focused on the one-to-one variant, where each task has a fixed pickup and delivery location. In contrast, real warehouses often present many-to-many MAPD scenarios, where items, tracked by stock keeping unit (SKU) identifiers, can be retrieved from or stored at multiple locations, resulting in an NP-hard four-dimensional assignment problem. To solve the many-to-many MAPD problem, we contribute our algorithm: Many-to-Many Multi-Agent Pickup and Delivery (M2M). We experiment with two variants of our algorithm: one that minimizes estimated task durations (M2M), and one which incorporates SKU distribution into the objective function (M2M-wSKU). Simulation results over 8-hour warehouse operations show that our method consistently matches or outperforms prior state of the art, with M2M completing up to 22,000 more tasks on average across different environments and warehouse inventory densities.

Abstract:
Unlike standard camera that relies on exposure to obtain output frame by frame, event camera only output an event when the change of brightness intensity in a pixel exceed a threshold, and the outputs of different pixels are independent to each other. Benefited from its bio-inspired design, event camera has the advantages of low latency and high dynamic range. The researches on multi-sensor fusion with event camera are few so far. In this paper, we propose FAST-LIEO, a framework for fast and real-time LiDAR-inertial-event odometry. The framework tightly fuses LiDAR and event camera measurements without any feature extraction or matching. Besides, our system supports both LIEO and LIEVO (extended with RGB camera fusion). We design a novel EIO subsystem For LiDAR-event fusion. The EIO subsystem maintains a semi-dense event map and estimates the state by aligning the event representation to map. The semi-dense event map is built from LiDAR points by utilizing the edge information and temporal information provided by event representations. Besides testing our method on public benchmark dataset, we also collected real-world data by utilizing our sensor suite and conducted experiments on our self-captured dataset. The experiment results show the high robustness and accuracy of our method in challenging conditions with high real-time ability. To the best of our knowledge, our FAST-LIEO is the first system that can tightly fuse LiDAR, IMU, event camera and standard camera measurements in simultaneously localization and mapping. The source code of FAST-LIEO and our dataset are available at: https://github.com/wsjpla/FAST-LIEO.

Abstract:
In point-line Simultaneous Localization and Mapping (SLAM) systems, the utilization of line structural information and the optimization of lines are two significant problems. The former is usually addressed through structural regularities, while the latter typically involves using minimal parameter representations of lines in optimization. However, separating these two steps leads to the loss of constraint information to each other. To solve both problems, we anchor lines with similar directions to one principal axis. Precisely, our method models the line-axis probabilistic data association using the Expectation Maximization (EM) algorithm and provides the pipelines for axis creation, updating, and optimization, enhancing the system's robustness and avoiding mismatch. Our system can optimize n co-directional lines with only n+2 parameters, significantly reducing the number of line parameters to be optimized and enabling rapid mapping and tracking. Additionally, considering that most real-world scenes conform to the Atlanta World (AW) hypothesis, we provide an AW constraint by detecting structural lines based on vertical priors and vanishing points. Experimental results and ablation studies on various indoor and outdoor datasets demonstrate the effectiveness of our system.

Abstract:
Effectively integrating diverse sensory modalities is crucial for robotic manipulation. However, the typical approach of feature concatenation is often suboptimal: dominant modalities such as vision can overwhelm sparse but critical signals like touch in contact-rich tasks, and monolithic architectures cannot flexibly incorporate new or missing modalities without retraining. Our method factorizes the policy into a set of diffusion models, each specialized for a single representation (e.g., vision or touch), and employs a router network that learns consensus weights to adaptively combine their contributions, enabling incremental of new representations. We evaluate our approach on simulated manipulation tasks in RLBench, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion, where it significantly outperforms feature-concatenation baselines on scenarios requiring multimodal reasoning. Our policy further demonstrates robustness to physical perturbations and sensor corruption. We further conduct perturbation-based importance analysis, which reveals adaptive shifts between modalities.

Abstract:
Accounting for heterogeneity among robots and tasks adds additional complexity to multi-robot task allocation. While existing task allocation methods effectively handle heterogeneity among robots and tasks, they do not scale well in the number of different robots and tasks. To address this gap, we formulate the Team Formation Markov Decision Process (TF-MDP) for training Teamformer: a scalable, decentralized transformer policy for dynamically forming heterogeneous teams of robots to complete diverse tasks. Combining the TF-MDP with the autoregressive capability of transformers enables Teamformer to scale linearly in the number of robots, tasks, and combinations of different heterogeneous robots. Simulations demonstrate Teamformer generalizing to combinations of 100 different types of robots and tasks. Hardware experiments using Georgia Tech's Robotarium show Teamformer decentrally coordinating up to 20 heterogeneous robots for task completion.

Abstract:
This paper tackles the challenge of enabling real-world humanoid robots to perform expressive and dynamic whole-body motions while maintaining stability. We propose ExBody2, a whole-body tracking framework trained in simulation with Reinforcement Learning and then transferred to the real world. The framework decouples keypoint tracking from velocity control and leverages a privileged teacher policy to distill precise mimic skills into the student policy, enabling robust, high-fidelity reproduction of complex motions such as walking, crouching, and dancing. A significant contribution is the identification of an empirical trade-off between feasibility and diversity in motion datasets, which guides the development of an automatic dataset curation method. This principle facilitates pretraining a versatile model generalizing well across diverse motions and can be fine-tuned for specific tasks to achieve superior tracking accuracy. Extensive experiments show that Exbody2 achieves consistently better performance than strong baselines and provides insights that may inform future work on whole-body humanoid control.

Abstract:
Point cloud single object tracking is critical in autonomous driving. However, current methods heavily rely on frame-by-frame human annotations, which do not scale well with the growing amount of unlabeled LiDAR data. In this paper, we propose the first self-supervised point cloud single object tracking framework, eliminating the need for any manual labels. Our method integrates motion, geometry, and semantic cues to generate plausible object proposals and tracks the target using a predictive filter. Specifically, we generate pseudo labels by clustering local motion patterns from scene flow, while pre-training a proposal network using point cloud forecasting as a proxy task to learn global motion patterns and geometric shape priors. Then, we train the proposal network using the initial pseudo labels and iteratively refine them by treating semantic features as evolving prototypes in each training round. Finally, a simple motion filter is employed to predict the targets current state based on its past dynamics. Evaluated on KITTI, nuScenes, and Waymo, our self-supervised point cloud single object tracking approach is on par withand in some cases outperformsfully supervised trackers, demonstrating that self-supervision is a scalable path forward for 3D single object tracking.

Abstract:
Whole-body control (WBC) of humanoid robots has witnessed remarkable progress in skill versatility, enabling a wide range of applications such as locomotion, teleoperation, and motion tracking. Despite these achievements, existing WBC frameworks remain largely task-specific, relying heavily on labor-intensive reward engineering and demonstrating limited generalization across tasks and skills. These limitations hinder their response to arbitrary control modes and restrict their deployment in complex, real-world scenarios. To address these challenges, we revisit existing WBC systems and identify a shared objective across diverse tasks: the generation of appropriate behaviors that guide the robot toward desired goal states. Building on this insight, we propose the Behavior Foundation Model (BFM), a generative model pretrained on large-scale behavioral datasets to capture broad, reusable behavioral knowledge for humanoid robots. BFM integrates a masked online distillation framework with a Conditional Variational Autoencoder (CVAE) to model behavioral distributions, thereby enabling flexible operation across diverse control modes and efficient acquisition of novel behaviors without retraining from scratch. Extensive experiments in both simulation and on a physical humanoid platform demonstrate that BFM generalizes robustly across diverse WBC tasks while rapidly adapting to new behaviors. These results establish BFM as a promising step toward a foundation model for general-purpose humanoid control.

Abstract:
The development of bio-inspired jellyfish robots holds significant benefits for autonomous aquatic systems due to jellyfishs efficient water jet propulsion. However, the current design of jellyfish robots still faces challenges in balancing high biological fidelity with the demands of lightweight, compact design and high-speed locomotion. This study presents a bio-inspired jellyfish robot that emulates the shape and efficient water jet propulsion of natural jellyfish. The origami-based bell and supporting structure are designed to form an efficient water-jet cavity. As the primary components of the robot, they endow the robot with the ability to self-recover to a stable state during locomotion, thereby reducing energy consumption. Additionally, they replace traditional transmission mechanisms, thereby reducing the weight and complexity of the robot. An optimization model is established to determine the optimal parameters of the robot. Furthermore, a near-field magnetic actuation system is designed to drive the robot, enabling contactless and silent underwater driving without waterproofing requirements. The robot features a diameter of 101.6 mm, a height of 63.8 mm, and a weight of 12.5 g. Experimental results demonstrate a maximum locomotion speed of up to 96.2 mm/s.

Abstract:
Reliable localization in prior maps is essential for autonomous navigation, particularly under adverse weather, where optical sensors may fail. We present CFEAR-TR, a teach-and-repeat localization pipeline using a single spinning radar, which is designed for easily deployable, lightweight, and robust navigation in adverse conditions. Our method localizes by jointly aligning live scans to both stored scans from the teach mapping pass, and to a sliding window of recent live keyframes. This ensures accurate and robust pose estimation across different seasons and weather phenomena. Radar scans are represented using a sparse set of oriented surface points, computed from Doppler-compensated measurements. The map is stored in a pose graph that is traversed during localization. Experiments on the held-out test sequences from the Boreas dataset show that CFEAR-TR can localize with an accuracy as low as 0.117 m and 0.096°, corresponding to improvements of up to 63% over the previous state of the art, while running efficiently at 29 Hz. These results substantially narrow the gap to lidar-level localization, particularly in heading estimation. We make the C++ implementation of our work available to the community.

Abstract:
Egocentric videos capture how humans manipulate objects and tools, providing diverse motion cues for learning object manipulation. Unlike the costly, expert-driven manual teleoperation commonly used in training Vision-Language-Action models (VLAs), egocentric videos offer a scalable alternative. However, prior studies that leverage such videos for training robot policies typically rely on auxiliary annotations, such as detailed hand-pose recordings. Consequently, it remains unclear whether VLAs can be trained directly from raw egocentric videos. In this work, we address this challenge by leveraging EgoScaler, a framework that extracts 6DoF object manipulation trajectories from egocentric videos without requiring auxiliary recordings. We apply EgoScaler to four large-scale egocentric video datasets and automatically refine noisy or incomplete trajectories, thereby constructing a new large-scale dataset for VLA pre-training. Our experiments with a state-of-the-art pi_0 architecture in both simulated and real-robot environments yield three key findings: (i) pre-training on our dataset improves task success rates by over 20% compared to training from scratch, (ii) the performance is competitive with that achieved using real-robot datasets, and (iii) combining our dataset with real-robot data yields further improvements. These results demonstrate that egocentric videos constitute a promising and scalable resource for advancing VLA research.

Abstract:
Generating a coherent 3D scene representation from multi-view images is a fundamental yet challenging task. Existing methods often struggle with multi-view fusion, leading to fragmented 3D representations and sub-optimal performance. To address this, we introduce VG3T, a novel multi-view feed-forward network that predicts a 3D semantic occupancy via a 3D Gaussian representation. Unlike prior methods that infer Gaussians from single-view images, our model directly predicts a set of semantically attributed Gaussians in a joint, multi-view fashion. This novel approach overcomes the fragmentation and inconsistency inherent in view-by-view processing, offering a unified paradigm to represent both geometry and semantics. We also introduce two key components, Grid-Based Sampling and Positional Refinement, to mitigate the distance-dependent density bias common in pixel-aligned Gaussian initialization methods. Our VG3T shows a notable 1.7%p improvement in mIoU while using 46% fewer primitives than the previous state-of-the-art on the nuScenes benchmark, highlighting its superior efficiency and performance.

Abstract:
Robot-assisted surgery (RAS) has become a critical paradigm in modern surgery, promoting patient recovery and reducing the burden on surgeons through minimally invasive approaches. To fully realize its potential, however, a precise understanding of the visual data generated during surgical procedures is essential. Previous studies have predominantly focused on single-task approaches, but real surgical scenes involve complex temporal dynamics and diverse instrument interactions that limit comprehensive understanding. Moreover, the effective application of multi-task learning (MTL) requires sufficient pixel-level segmentation data, which are difficult to obtain due to the high cost and expertise required for annotation. In particular, long-term annotations such as phases and steps are available for every frame, whereas short-term annotations such as surgical instrument segmentation and action detection are provided only for key frames, resulting in a significant temporalspatial imbalance. To address these challenges, we propose a novel framework that combines optical flowbased segmentation label interpolation with multi-task learning. optical flow estimated from annotated key frames is used to propagate labels to adjacent unlabeled frames, thereby enriching sparse spatial supervision and balancing temporal and spatial information for training. This integration improves both the accuracy and efficiency of surgical scene understanding and, in turn, enhances the utility of RAS.

Abstract:
We present KoopCast, a lightweight yet efficient model for trajectory forecasting in general dynamic environments. Our approach leverages Koopman operator theory, which enables a linear representation of nonlinear dynamics by lifting trajectories into a higher-dimensional space. The framework follows a two-stage design: first, a probabilistic neural goal estimator predicts plausible long-term targets, specifying where to go; second, a Koopman operator-based refinement module incorporates intention and history into a nonlinear feature space, enabling linear prediction that dictates how to go. This dual structure not only ensures strong predictive accuracy but also inherits the favorable properties of linear operators while faithfully capturing nonlinear dynamics. As a result, our model offers three key advantages: (i) competitive accuracy, (ii) interpretability grounded in Koopman spectral theory, and (iii) low-latency deployment. We validate these benefits on ETH/UCY, the Waymo Open Motion Dataset, and nuScenes, which feature rich multi-agent interactions and map-constrained nonlinear motion. Across benchmarks, KoopCast consistently delivers high predictive accuracy together with mode-level interpretability and practical efficiency.

Abstract:
Robotic pushing is a fundamental manipulation task that requires tactile feedback to capture subtle contact forces and dynamics between the end-effector and the object. However, real tactile sensors often face hardware limitations and deployment challenges, while vision-only policies struggle with satisfactory performance. Inspired by humans' ability to infer tactile states from vision, we propose ViTacGen, a novel robot manipulation framework designed for visual robotic pushing with vision-to-touch generation in reinforcement learning to eliminate the reliance on high-resolution real tactile sensors, enabling effective zero-shot deployment on visual-only robotic systems. Specifically, ViTacGen consists of an encoder-decoder vision-to-touch generation network that generates contact depth images, a standardized tactile representation, directly from visual image sequence, followed by a reinforcement learning policy that fuses visual-tactile data with contrastive learning based on visual and generated tactile observations. We validate the effectiveness of our approach in both simulation and real world experiments, demonstrating its superior performance and achieving a success rate of up to 86%. Code and data will be open-sourced once the paper is accepted.

Abstract:
Teaching motion skills to robots through demonstrations has becomes widely popular. However, precise execution of start-, via-, and end-poses at given times is often not guaranteed, limiting the technology transfer to industrial application. To address this issue, we propose the novel Constrained Expectation Maximization (CEM) algorithm, which enforces time-sensitive constraints (TSC) when learning Gaussian Mixture Models (GMM). Our approach applies to data on Riemannian manifolds and extends to task-parameterized scenarios. We validate CEM against state-of-the-art methods on handwritten data and real robot applications utilizing the KUKA LBR iiwa. By enforcing constraints within the learning process, CEM achieves improved and more efficient reproduction of the demonstration data.

Abstract:
We want a multi-robot team to complete complex tasks in minimum time where the locations of task-relevant objects are not known. Effective task completion requires reasoning over long horizons about the likely locations of task-relevant objects, how individual actions contribute to overall progress, and how to coordinate team efforts. Planning in this setting is extremely challenging: even when task-relevant information is partially known, coordinating which robot performs which action and when is difficult, and uncertainty introduces a multiplicity of possible outcomes for each action, which further complicates long-horizon decision-making and coordination. To address this, we propose a multi-robot planning abstraction that integrates learning to estimate uncertain aspects of the environment with model-based planning for long-horizon coordination. We demonstrate the efficient multi-stage task planning of our approach for 1, 2, and 3 robot teams over competitive baselines in large ProcTHOR household environments. Additionally, we demonstrate the effectiveness of our approach with a team of two LoCoBot mobile robots in real household settings.

Abstract:
The Crazyflie quadcopter is widely recognized as a leading platform for nano-quadcopter research. In early 2025, the Crazyflie Brushless was introduced, featuring brushless motors that provide around 50% more thrust compared to the brushed motors of its predecessor, the Crazyflie 2.1. This advancement has opened new opportunities for research in agile nano-quadcopter control. To support researchers utilizing this new platform, this work presents a dynamics model of the Crazyflie Brushless and identifies its key parameters. Through simulations and hardware analyses, we assess the accuracy of our model. We furthermore demonstrate its suitability for reinforcement learning applications by training an end-to-end neural network position controller and learning a backflip controller capable of executing two complete rotations with a vertical movement of just 1.8 meters. This showcases the models ability to facilitate the learning of controllers and acrobatic maneuvers that successfully transfer from simulation to hardware. Utilizing this application, we investigate the impact of domain randomization on control performance, offering valuable insights into bridging the sim-to-real gap with the presented model. We have open-sourced the entire project, enabling users of the Crazyflie Brushless to swiftly implement and test their own controllers on an accurate simulation platform.

Abstract:
Autonomous robot navigation systems often rely on hierarchical planning, where global planners compute collision-free paths without considering dynamics, and local planners enforce dynamics constraints to produce executable commands. This discontinuity in dynamics often leads to trajectory tracking failure in highly constrained environments. Recent approaches integrate dynamics within the entire planning process by gradually decreasing its fidelity, e.g., increasing integration steps and reducing collision checking resolution, for real-time planning efficiency. However, they assume that the fidelity of the dynamics should decrease according to a manually designed scheme. Such static settings fail to adapt to environmental complexity variations, resulting in computational overhead in simple environments or insufficient dynamics consideration in obstacle-rich scenarios. To overcome this limitation, we propose Adaptive Dynamics Planning (ADP), a learning-augmented paradigm that uses reinforcement learning to dynamically adjust robot dynamics properties, enabling planners to adapt across diverse environments. We integrate ADP into three different planners and further design a standalone ADP-based navigation system, benchmarking them against other baselines. Experiments in both simulation and real-world tests show that ADP consistently improves navigation success, safety, and efficiency.

Abstract:
Applying micro-patterns to surfaces has been shown to impart useful physical properties such as drag reduction and hydrophobicity. However, current manufacturing techniques cannot produce micro-patterned surfaces at scale due to high-cost machinery and inefficient coverage techniques such as raster-scanning. In this work, we use multiple robots, each equipped with a patterning tool, to manufacture these surfaces. To allow these robots to coordinate during the patterning task, we use the ergodic control algorithm, which specifies coverage objectives using distributions. We demonstrate that robots can divide complicated coverage objectives by communicating compressed representations of their trajectory history both in simulations and experimental trials. Further, we show that robot-produced patterning can lower the coefficient of friction of metallic surfaces. This work demonstrates that distributed multi-robot systems can coordinate to manufacture products that were previously unrealizable at scale.

Abstract:
This paper introduces Knowledge-Guided Massively Multi-task Model-based Policy Optimization (KG-M3PO), a framework for multi-task robotic manipulation in partially observable settings that unifies Perception, Knowledge, and Policy. KG-M3PO leverages a model-based policy optimization method to control backbone with an online 3D scene graph that grounds open-vocabulary detections into a metric, relational representation. A dynamic-relation mechanism updates spatial, containment, and affordance edges at every step, and a graph neural encoder is trained end-to-end through the RL objective so that relational features are shaped directly by control performance. Multiple observation modalities (visual, proprioceptive, linguistic, and graph-based) are encoded into a shared latent space, upon which the RL agent operates to drive the control loop. The policy conditions on lightweight graph queries alongside visual and proprioceptive inputs, yielding a compact, semantically informed state for decision making. Experiments on a suite of manipulation tasks with occlusions, distractors, and layout shifts demonstrate consistent gains over strong baselines: the knowledge-conditioned agent achieves higher success rates, improved sample efficiency, and stronger generalization to novel objects and unseen scene configurations. These results support the premise that structured, continuously maintained world knowledge is a powerful inductive bias for scalable, generalizable manipulation: when the knowledge module participates in the RL computation graph, relational representations align with control, enabling robust long-horizon behavior under partial observability.

Abstract:
Designing reinforcement learning curricula for agile robots traditionally requires extensive manual tuning of reward functions, environment randomizations, and training configurations. We introduce AURA (Autonomous Upskilling with Retrieval-Augmented Agents), a schema-centric curriculum RL framework that leverages Large Language Models (LLMs) as autonomous designers of multi-stage curricula. AURA transforms user prompts into YAML workflows that encode full reward functions, domain randomization strategies, and training configurations. All files are statically validated before any GPU runtime, ensuring reliable and efficient execution with minimal human intervention. A retrieval-augmented feedback loop allows specialized LLM agents to design, execute, and refine curriculum stages based on prior training results stored in a vector database, enabling continual improvement over time. Quantitative experiments show that AURA consistently outperforms LLM-guided baselines in generation success rate, humanoid locomotion, and manipulation tasks. Ablation studies highlight the importance of retrieval for curriculum quality and convergence stability. AURA successfully trains end-to-end policies directly from user prompts and deploys them zero-shot on a custom humanoid robot cross multiple environments, enabling robust locomotion on varied terrain and recovery from strong perturbationscapabilities that did not exist previously with manually designed controllers. By abstracting the complexity of curriculum design, AURA enables scalable and adaptive policy learning pipelines that would be complex to construct by hand.

Abstract:
Affordance grounding is a challenging task that aims to locate functional regions in object images enabling potential human-object interactions. One-shot open affordance grounding leverages the generalization capability of visual foundation models to overcome limitations of training data scale. However, existing methods often fail to locate functional regions in complex scenarios due to the lack of fine-grained perception, function-appearance heterogeneity, and the overfitting of affordance prompts to known categories. To improve generalization to unseen categories, we introduce a category-conditioned affordance prompt learning, which constructs a complete semantic category-affordance prompt from instance-level visual features. To further improve the accuracy of affordance localization for objects with complex structures, we propose a coarse-to-fine semantic-guided Transformer decoder. This design enhances the decoder's ability to understand the semantic mapping between the affordance words and corresponding object part-level regions. On multiple standard benchmarks, our method achieves competitive performance compared to related methods with less than 1% of the training cost. Notably, our approach shows more robust generalization to unseen objects and novel affordances than the recent SOTA baseline methods.

Abstract:
MAVs have great potential to assist humans in complex tasks, with applications ranging from logistics to emergency response. Their agility makes them ideal for operations in complex and dynamic environments. However, achieving precise control in agile flights remains a significant challenge, particularly due to the underactuated nature of quadrotors and the strong coupling between their translational and rotational dynamics. In this work, we propose a novel NMPC framework based on dual-quaternions (DQ-NMPC) for quadrotor flight. By representing both quadrotor dynamics and the pose error directly on the dual-quaternion manifold, our approach enables a compact and globally non-singular formulation that captures the quadrotor coupled dynamics. We validate our approach through simulations and real-world experiments, demonstrating better numerical conditioning and significantly improved tracking performance, with reductions in position and orientation errors of up to 56.11% and 56.77%, compared to a conventional baseline NMPC method. Furthermore, our controller successfully handles aggressive trajectories, reaching maximum speeds up to 13.66 m/s and max accelerations up to 4.2 g, under which the baseline controller fails.

Abstract:
Given the recent significant advancements in the video understanding capabilities of Large Language Models (LLMs), there is growing interest in research that automatically generates executable robot task plans from human demonstration videos. Existing LLM-based symbolic planning approaches often rely on manually defined Problem Domain Definition Language (PDDL) domains or fixed action primitives. This paper proposes GPT-PDDL, a framework that infers step-by-step task procedures from demonstration videos and converts them into robot plans based on PDDL.

Abstract:
While in the past industrial robots were strictly separated from humans, today robots serve humans in a variety of industrial applications that also involve close or even physical human-robot interaction. Hereby, safety is of utmost importance and thus, the design of the control system needs to ensure a stable and safe operation. In this context, safety has been mainly addressed for single interaction points. In this article, we present an energy shaping controller that is capable of ensuring safety even in the case of multiple human contact points that may occur when co-manipulating an object. The presented approach is tested and validated in experiments. Re- sults indicate that for the studied co-manipulation task involving time-invariant multiple human contacts, a safe interaction can be achieved.

Abstract:
We present CUBE-LIO, a LiDAR-inertial odometry framework that leverages direct photometric constraints from LiDAR intensity to improve robustness in geometrically degenerate environments. At its core is an efficient cubemap projection that maps LiDAR intensity onto six cube faces, eliminating pole singularities and severe polar distortion. This yields a more uniform overall distortion while avoiding the costly trigonometric operations typical of equirectangular mappings. Building on this representation, we introduce a semi-dense feature selection and direct optimization strategy based on intensity gradient magnitude. This strategy improves resilience to intensity noise and variations induced by range and incidence angle. Photometric constraints are jointly optimized with geometric measurements in a tightly coupled LIO pipeline. CUBE-LIO is sensor-agnostic and supports both spinning and solid-state LiDARs. Experiments on multiple public benchmarks demonstrate state-of-the-art accuracy and real-time performance, with particularly pronounced gains in scenes where the geometric structure is sparse or weak.

Abstract:
Generating safe, kinodynamically feasible, and optimal trajectories for complex robotic systems is a central challenge in robotics. This paper presents Safe Model Predictive Diffusion (Safe MPD), a training-free diffusion planner that unifies a model-based diffusion framework with a safety shield to generate trajectories that are both kinodynamically feasible and safe by construction. By enforcing feasibility and safety on all samples during the denoising process, our method avoids the common pitfalls of post-processing corrections, such as computational intractability and loss of feasibility. We validate our approach on challenging non-convex planning problems, including kinematic and acceleration-controlled tractor-trailer systems. The results show that it substantially outperforms existing safety strategies in success rate and safety, while achieving sub-second computation times.

Abstract:
Effective robotic systems for long-horizon human-robot collaboration must adapt to a wide range of human partners, whose physical behavior, willingness to assist, and understanding of the robot's capabilities may change over time. This demands a tightly coupled communication loop that grants both agents the flexibility to propose, accept, or decline requests as they coordinate toward completing the task effectively. We propose MICoBot, a system that enables the human and robot, both using natural language, to take initiative in formulating, accepting, or rejecting proposals on who can best complete different steps of a task. To handle diverse, task-directed dialog, and find successful collaborative strategies that minimize human effort, MICoBot makes decisions at three levels: (1) a meta-planner considers human dialog to formulate and code a high-level collaboration strategy, (2) a planner optimally allocates the remaining steps to either agent based on the robot's capabilities (measured by a simulation-pretrained affordance model) and the estimated human's willingness to help, and (3) an action executor decides the low-level actions to perform or words to say to the human. In physical robot trials with 18 unique human participants, MICoBot significantly improves task success and user experience over a pure LLM baseline and standard agent allocation models.

Abstract:
Choosing locomotion learning paradigm in high-DOF system like humanoid robot faces several challenges. Free exploration creates complex reward surfaces that resist efficient exploration, while human motion priors cannot be directly copied due to different mechanical constraints. We present Adaptive Motion Priors with Constrained Optimization (AMPCO), a novel framework that transitions from human reference motions to task-focused optimization within learned behavioral bounds. AMPCO employs a two-phase optimization strategy: (1) Adaptive Imitation Guidance that prioritizes human motion, and (2) Adaptive Reward Weighting for Constrained Optimization that optimizes task objectives while maintaining motion quality within statistically-guaranteed bounds from Phase I. The transition between phases is automatically detected through percentile-based breakout detection from discriminator convergence. AMPCO introduces adaptive weighting mechanisms that smoothly adjust the importance of human imitation based on learning progress. Our experiments on the Unitree G1 humanoid robot simulation demonstrate that AMPCO reduces energy consumption variance by 67-90% across all baseline methods while achieving 70% lower energy consumption than task-focused baseline while maintaining velocity tracking accuracy comparable to the best-performing methods, with minimal computational overhead (<0.012% per training cycle).

Abstract:
We present SIPHON, a Salp-Inspired robot designed to utilize Passive Hydrodynamics, and equipped with soft robotic Origami bellows and soft Nozzles. We reveal the construction, including a novel use of an interlocking origami Kresling pattern, along with duckbill and mammal-heart-inspired valves. We derive a physical model for the coupled dynamics of body displacement and body contraction. We show experimental results of pool free swimming trials, and we compare these results to the model. Compared to other power-autonomous, bioinspired pulsed jet swimmers, SIPHON swims with high speed and efficiency, achieving a mean swimming speed of 16.5 cm/s (0.59 Bl/s) and a cost of transport of 4.9 J/m (1.8W s/(N m)).

Abstract:
Dynamic game theory is an increasingly popular tool for modeling multi-agent, e.g. human-robot, interactions. Game-theoretic models presume that each agent wishes to minimize a private cost function that depends on others actions. These games typically evolve over a fixed time horizon, specifying how far into the future each agent plans. In practical settings, however, decision-makers may vary in foresightedness, or how much they care about their current cost in relation to their past and future costs. We conjecture that quantifying and estimating each agents foresightedness from online data will enable safer and more efficient interactions with other agents. To this end, we frame this inference problem as an inverse dynamic game. We consider a specific objective function parametrization that smoothly interpolates myopic and farsighted planning. Games of this form are readily transformed into parametric mixed complementarity problems; we exploit the directional differentiability of solutions to these problems with respect to their hidden parameters to solve for agents foresightedness. We conduct three experiments: one with synthetically generated delivery robot motion, one with real-world data involving people walking, biking, and driving vehicles, and one using high-fidelity simulators. The results of these experiments demonstrate that explicitly inferring agents foresightedness enables game-theoretic models to make 33% more accurate models for agents behavior.

Abstract:
The fragility of ocular tissues combined with the limited surgical workspace demands precise instrument control and focus, making sensor-integrated robotic systems a promising solution. In this paper, we introduce a surgical system for telemanipulated endolaser photocoagulation that leverages instrument-integrated optical coherence tomography (iiOCT) for accurate distance measurement. We have developed a controller that maintains a specified instrument-to-retina distance, complemented by haptic shared control to assist the ophthalmic surgeon throughout the procedure. We conducted a pilot study involving 12 participants, including an expert vitreoretinal surgeon, to evaluate the system's performance across three levels of user assistance. The distance-based controller demonstrated a significant improvement in axial precison compared to telemanipulated trials, achieving a mean error of 4 micrometers and a standard deviation of 69 micrometers across all subjects. Preliminary experiments conducted on porcine eyes confirmed the feasibility of our approach on ex vivo tissues.

Abstract:
Teaching and learning physical skills often require one-on-one interaction, making it difficult to scale up, as there are not enough human teachers. Robots offer an attractive alternative. This paper presents TeachingBot, an adaptive robotic system that teaches handwriting to human learners through physical interaction. Robot teaching poses two major challenges: (i) adapting to the individual handwriting style of the learner and (ii) maintaining an engaging learning experience. For the first challenge, TeachingBot uses a probabilistic model to capture the learner's writing style from their writing samples. Drawing on the insight that effective teaching balances standardization with individuality, the system generates a personalized teaching trajectory that aligns with the learner's natural writing. For the second challenge, TeachingBot employs variable impedance control to guide the learner, dynamically adjusting the strength of physical guidance based on the learner's performance. Human-subject experiments with 15 participants demonstrate the effectiveness of TeachingBot, showing clear improvement in learners' handwriting and engagement over baseline methods.

Abstract:
Vision-Language-Action (VLA) models often struggle with generalization due to their tendency to memorize training data rather than understanding task semantics. This paper proposes a hierarchical framework that integrates Large Language Models (LLMs) with VLA models to overcome these limitations. By leveraging GPT-4o as a high-level planner, our system decomposes complex instructions into atomic sub tasks executable by a low-level VLA. We introduce a Home Pose Controller between sub-tasks to ensure physical sta bility. Experimental results on the LIBERO-10 benchmark demonstrate that our approach achieves a 90% success rate on decomposable tasks, significantly outperforming the 9% baseline of the standalone VLA model.

Abstract:
Monitoring construction sites requires comparing the as-planned design with the as-built state in real time. Visual SLAM offers a lightweight solution but is prone to trajectory drift in construction environments due to repetitive layouts, textureless surfaces, and occlusions. We augment an existing visual SLAM system with structural priors from the Building Information Model (BIM), associating detected walls with their BIM counterparts and including these correspondences as geometric constraints in the back-end optimization. The system operates in real time and is validated on multiple real construction sites, achieving 25.23% average trajectory error reduction and 7.14% map accuracy improvement over state-of-the-art baselines, with demonstrated resilience to incomplete BIM data and as-planned/as-built discrepancies.

Abstract:
Robots that follow natural-language instructions typically rely on either high-level planners with hand-designed interfaces or large end-to-end models that are difficult to deploy for real-time control. We propose TeNet (Text-to-Network), a framework that instantiates compact, task-specific policies directly from natural language. TeNet conditions a hypernetwork on embeddings from a pretrained language model to generate a fully executable policy, which operates solely on low-dimensional state inputs at high control frequencies. By using language only once at policy instantiation, TeNet combines the expressiveness of large language models with efficient execution. To improve generalization, we optionally ground language in behavior during training, without requiring demonstrations at inference. Experiments on MuJoCo and Meta-World show that TeNet produces policies that are orders of magnitude smaller than sequence-based baselines, while achieving strong performance in both multi-task and meta-learning settings and enabling high-frequency control. These results demonstrate that text-conditioned hypernetworks provide a practical approach for compact, language-driven robot control.

Abstract:
This work presents a motion retargeting approach for legged robots, aimed at transferring the dynamic and agile movements to robots from source motions. In particular, we guide the imitation learning procedures by transferring motions from source to target, effectively bridging the morphological disparities while ensuring the physical feasibility of the target system. In the first stage, we focus on motion retargeting at the kinematic level by generating kinematically feasible whole-body motions from keypoint trajectories. Following this, we refine the motion at the dynamic level by adjusting it in the temporal domain while adhering to physical constraints. This process facilitates policy training via reinforcement learning, enabling precise and robust motion tracking. We demonstrate that our approach successfully transforms noisy motion sources, such as hand-held camera videos, into robot-specific motions that align with the morphology and physical properties of the target robots. Moreover, we demonstrate terrain-aware motion retargeting to perform BackFlip on top of a box. We successfully deployed these skills to four robots with different dimensions and physical properties in the real world through hardware experiments.

Abstract:
Planning over unstructured terrain presents a significant challenge in the field of legged robotics. Although recent works in reinforcement learning have yielded various locomotion strategies, planning over multiple experts remains a complex issue. Existing approaches encounter several constraints: traditional planners are unable to integrate skill-specific policies, whereas hierarchical learning frameworks often lose interpretability and require retraining whenever new policies are added. In this paper, we propose a feasibility-guided planning framework that successfully incorporates multiple terrain-specific policies. Each policy is paired with a Feasibility-Net, which learned to predict feasibility tensors based on the local elevation maps and task vectors. This integration allows classical planning algorithms to derive optimal paths. Through both simulated and real-world experiments, we demonstrate that our method efficiently generates reliable plans across diverse and challenging terrains, while consistently aligning with the capabilities of the underlying policies.

Abstract:
Accurate motion forecasting is critical for autonomous driving, yet most predictors rely on multi-object tracking (MOT) with identity association, assuming that objects are correctly and continuously tracked. When tracking fails due to, e.g., occlusion, identity switches, or missed detections, prediction quality degrades and safety risks increase. We present HiMAP, a tracking-free, trajectory prediction framework that remains reliable under MOT failures. HiMAP converts past detections into spatiotemporally invariant historical occupancy maps and introduces a historical query module that conditions on the current agent state to iteratively retrieve agent-specific history from unlabeled occupancy representations. The retrieved history is summarized by a temporal map embedding and, together with the final query and map context, drives a DETR-style decoder to produce multi-modal future trajectories. This design lifts identity reliance, supports streaming inference via reusable encodings, and serves as a robust fallback when tracking is unavailable. On Argoverse~2, HiMAP achieves performance comparable to tracking-based methods while operating without IDs, and it substantially outperforms strong baselines in the no-tracking setting, yielding relative gains of 11% in FDE, 12% in ADE, and a 4% reduction in MR over a fine-tuned QCNet. Beyond aggregate metrics, HiMAP delivers stable forecasts for all agents simultaneously without waiting for tracking to recover, highlighting its practical value for safety-critical autonomy. The code is available under: https://github.com/XuYiMing83/HiMAP.

Abstract:
Vision-only grasping systems are fundamentally constrained by calibration errors, sensor noise, and grasp pose prediction inaccuracies, leading to unavoidable contact uncertainty in the final stage of grasping. High-bandwidth tactile feedback, when paired with a well-designed tactile-reactive controller, can significantly improve robustness in the presence of perception errors. This paper contributes to controller design by proposing a purely tactile-feedback grasp-adjustment algorithm. The proposed controller requires neither prior knowledge of the objects geometry nor an accurate grasp pose, and is capable of refining a grasp even when starting from a crude, imprecise initial configuration and uncertain contact points. Through simulation studies and real-world experiments on a 15-DoF armhand system (featuring an 8-DoF hand) equipped with fingertip tactile sensors operating at 200 Hz, we demonstrate that our tactile-reactive grasping framework effectively improves grasp stability.

Abstract:
Recent advances in Vision-and-Language Navigation in Continuous Environments (VLN-CE) have leveraged multimodal large language models (MLLMs) to achieve zero-shot navigation. However, existing methods often rely on panoramic observations and two-stage pipelines involving waypoint predictors, which introduce significant latency and limit real-world applicability. In this work, we propose Fast-SmartWay, an end-to-end zero-shot VLN-CE framework that eliminates the need for panoramic views and waypoint predictors. Our approach uses only three frontal RGB-D images combined with natural language instructions, enabling MLLMs to directly predict actions. To enhance decision robustness, we introduce an Uncertainty-Aware Reasoning module that integrates (i) a Disambiguation Module for avoiding local optima, and (ii) a Future-Past Bidirectional Reasoning mechanism for globally coherent planning. Experiments on both simulated and real-robot environments demonstrate that our method significantly reduces per-step latency while achieving competitive or superior performance compared to panoramic-view baselines. These results demonstrate the practicality and effectiveness of Fast-SmartWay for real-world zero-shot embodied navigation.

Abstract:
Post-stroke rehabilitation is often necessary for patients to regain proper walking gait. However, the typical therapy process can be exhausting and physically demanding for therapists, potentially reducing therapy intensity, duration, and consistency over time. We propose a Patient-Therapist Force Field (PTFF) to visualize therapist responses to patient kinematics and a Synthetic Therapist (ST) machine learning model to support the therapist in dyadic robot-mediated physical interaction therapy. The first encodes patient and therapist stride kinematics into a shared low-dimensional latent manifold using a Variational Autoencoder (VAE) and models their interaction through a Gaussian Mixture Model (GMM), which learns a probabilistic vector field mapping patient latent states to therapist responses. This representation visualizes patienttherapist interaction dynamics to inform therapy strategies and robot controller design. The latter is implemented as a Long Short-Term Memory (LSTM) network trained on patienttherapist interaction data to predict therapist-applied joint torques from patient kinematics. Trained and validated using leave-one-out cross-validation across eight post-stroke patients, the model was integrated into a ROS-based exoskeleton controller to generate real-time torque assistance based on predicted therapist responses. Offline results and preliminary testing indicate the potential of their use as an alternative approach to post-stroke exoskeleton therapy. The PTFF provides understanding of the therapists actions while the ST frees the human therapist from the exoskeleton, allowing them to continuously monitor the patients nuanced condition.

Abstract:
Biological neural networks continuously adapt and modify themselves in response to experiences throughout their lifetime - a capability largely absent in artificial neural networks. Hebbian plasticity offers a promising path toward rapid adaptation in changing environments. Here, we introduce Hebbian Attractor Networks (HAN), a class of plastic neural networks in which local weight update normalization induces emergent attractor dynamics. Unlike prior approaches, HANs employ dual-timescale plasticity and temporal averaging of pre- and postsynaptic activations to induce either co-dynamic limit cycles or fixed-point weight attractors. Using simulated locomotion benchmarks, we gain insight into how Hebbian update frequency and activation averaging influence weight dynamics and control performance. Our results show that slower updates, combined with averaged pre- and postsynaptic activations, promote convergence to stable weight configurations, while faster updates yield oscillatory co-dynamic systems. We further demonstrate that these findings generalize to high-dimensional quadrupedal locomotion with a simulated Unitree Go1 robot. These results highlight how the timing of plasticity shapes neural dynamics in embodied systems, providing a principled characterization of the attractor regimes that emerge in self-modifying networks.

Abstract:
Confidence estimation for stereo matching is crucial for enhancing the reliability and accuracy of depth perception in real-world applications. Despite effectively capturing aleatoric uncertainty through probabilistic modeling and statistical aggregation, current regression-based confidence estimation methods neglect uncertainty arising from unstable training dynamics, resulting in over-confident predictions near occlusion boundaries, textureless regions, and reflective surfaces where errors are most severe. We propose a novel epoch-wise consistency accumulation algorithm that explicitly incorporates training dynamics into confidence estimation. Specifically, we design a full-image cross-epoch alignment mechanism to dynamically quantify pixel-wise training consistency between consecutive epochs, thereby significantly enhancing the estimation of confidence. We further propose a consistency-ranked evidential discrepancy loss, which aligns evidential uncertainty estimates with consistency-derived ordinal supervision, aiming to improve the correlation between confidence scores and actual prediction errors. Our approach is incorporated into MonSter, an advanced stereo baseline, achieving SOTA performance in confidence estimation across KITTI 2012, KITTI 2015 and Middlebury benchmarks.

Abstract:
One simplifying assumption in existing and well-performing task allocation methods is that the robots are single-tasking: each robot operates on a single task at any given time. While this assumption is harmless to make in some situations, it can be inefficient or even infeasible in others. In this paper, we consider assigning multi-robot tasks to multitasking robots. The key contribution is a novel task allocation framework that incorporates the consideration of physical constraints introduced by multitasking. This is in contrast to the existing work where such constraints are largely ignored. After formulating the problem, we propose a compilation to weighted MAX-SAT, which allows us to leverage existing solvers for a solution. A more efficient greedy heuristic is then introduced. For evaluation, we first compare our methods with a modern baseline that is efficient for single-tasking robots to validate the benefits of multitasking in synthetic domains. Then, using a site-clearing scenario in simulation, we further illustrate the complex task interaction considered by the multitasking robots in our approach to demonstrate its performance. Finally, we demonstrate a higher-complexity simulation to demonstrate the scalability and applicability of our approach.

Abstract:
Teaching robots novel behaviors typically requires motion demonstrations via teleoperation or kinaesthetic teaching, that is, physically guiding the robot. While recent work has explored using human sketches to specify desired behaviors, data collection remains cumbersome, and demonstration datasets are difficult to scale. In this paper, we introduce an alternative paradigm, Learning from Cross-Modal Instructions, where robots are shaped by demonstrations in the form of rough annotations, which can contain free-form text labels, and are used in lieu of physical motion. We introduce the CrossInstruct framework, which integrates cross-modal instructions as examples into the context input to a foundational visionlanguage model (VLM). The VLM then iteratively queries a smaller, fine-tuned model, and synthesizes the desired motion over multiple 2D views. These are then subsequently fused into a coherent distribution over 3D motion trajectories in the robot's workspace. By incorporating the reasoning of the large VLM with a fine-grained pointing model, CrossInstruct produces executable robot behaviors that generalize beyond the environment of in the limited set of instruction examples. We then introduce a downstream reinforcement learning pipeline that leverages CrossInstruct outputs to efficiently learn policies to complete fine-grained tasks. We rigorously evaluate CrossInstruct on benchmark simulation tasks and real hardware, demonstrating effectiveness without additional fine-tuning and providing a strong initialization for policies subsequently refined via reinforcement learning.

Abstract:
AbstractExterior International Space Station (ISS) visual inspection currently requires astronaut extravehicular activity (EVA), a safety risk. Free-flying space robots can perform visual inspection but risk station collision and high astronaut overhead for teleoperation. Existing inspection planners do not effectively co-optimize inspection coverage and energy consumption with consideration of both orbital dynamics and human supervisor situation awareness. This paper presents an inspection trajectory generation pipeline that integrates orbital dynamics with robot coverage path planning methods to assure collision avoidance and investigate situation awareness. Inspection trajectories meet thrust and space robot dynamics constraints while achieving 98% coverage with 17 grams of fuel on a space station model scaled to the ISS. Pareto front analysis balances fuel consumption with coverage directly. Presented solutions show that paths vary as a function of coverage versus energy prioritization. Methods in this paper contribute towards reducing risk posed to astronaut safety during space station operation and maintenance by providing trajectory generation algorithms towards external semi-autonomous in-orbit inspection of complex space structures.

Abstract:
This paper presents a task-oriented computational framework to enhance Visual-Inertial Navigation (VIN) in robots, addressing challenges such as limited time and energy resources. The framework strategically selects visual features using a Mean Square Error (MSE)-based, non-submodular objective function and a simplified dynamic anticipation model. To address the NP‐hardness of this problem, we introduce four polynomial‐time approximation algorithms: a classic greedy method with constant‐factor guarantees; a low‐rank greedy variant that significantly reduces computational complexity; a randomized greedy sampler that balances efficiency and solution quality; and a linearization‐based selector based on a first‐order Taylor expansion for near‐constant‐time execution. We establish rigorous performance bounds by leveraging submodularity ratios, curvature, and element‐wise curvature analyses. Extensive experiments on both standardized benchmarks and a custom control‐aware platform validate our theoretical results, demonstrating that these methods achieve strong approximation guarantees while enabling real‐time deployment.

Abstract:
Recent work has demonstrated how one can write high-level specifications for swarm behaviors and automatically create controllers for the individual robots to achieve the overall swarm task. In this paper, we address the question of how to modify, during execution, the desired behavior while maintaining guarantees on the behavior, if possible; we define three types of modification: changing the maximum number of robots in a region of the workspace, changing the connectivity of the workspace, and redistributing robots. During execution, if the specification is modified, we update the controller by creating local patches. Given the starting and ending state of the patch, we jointly use a symbolic synthesis tool and a constraint programming solver to synthesize robot control. We demonstrate our approach in simulation.

Abstract:
Safe, precise teleoperation demands a third-person 3D view that reveals collision clearances and task-critical geometry in full detail. Yet most systems still rely on live camera streams that offer tunnel-vision perspectives and weak depth cues, hiding hazards and denying operators the spatial context for precise manipulation. 3D Gaussian Splatting (GS) enables real-time photorealistic streaming, but acquiring the required multi-view imagery safely and efficiently remains a critical bottleneck in cluttered teleoperation environments. We propose Human-in-the-Loop Gaussian Splatting (HIL-GS) that delivers safe, robust, and efficient 3D scene reconstruction for challenging teleoperation environments. HIL-GS combines three modules in a tightly-coupled loop: (1) motion-aware GS reconstruction that fuses RGB-D and proprioceptive sensors for drift-free and robust mapping under aggressive motions; (2) VR-based informative display that renders the GS map with contextual overlays/feedback in real time to ensure situational awareness and reconstruction completeness; and (3) finger- based control interface to guide the robot toward informative viewpoints through safe, non-redundant motions. Through simulation, real-world experiments, and a user study, we demonstrate that HIL-GS outperforms traditional approaches in reconstruction quality, usability, and efficiency.

Abstract:
To efficiently deploy robotic systems in society, mobile robots need to autonomously and safely move through complex environments. Nonlinear model predictive control (MPC) methods provide a natural way to find a dynamically feasible trajectory through the environment without colliding with nearby obstacles. However, the limited computation power available on typical embedded robotic systems, such as quadrotors, poses a challenge to running MPC in real-time, including its most expensive tasks: constraints generation and optimization. To address this problem, we propose a novel hierarchical MPC scheme that consists of a planning and a tracking layer. The planner constructs a trajectory with a long prediction horizon at a slow rate, while the tracker ensures trajectory tracking at a relatively fast rate. We prove that the proposed framework avoids collisions and is recursively feasible. Furthermore, we demonstrate its effectiveness in simulations and lab experiments with a quadrotor that needs to reach a goal position in a complex static environment. The code is efficiently implemented on the quadrotor's embedded computer to ensure real-time feasibility. Compared to a state-of-the-art single-layer MPC formulation, this allows us to increase the planning horizon by a factor of 5, which results in significantly better performance.

Abstract:
Terrestrial-aerial bimodal vehicles, which integrate the high mobility of aerial robots with the long endurance of ground robots, offer significant potential for autonomous exploration. Given the inherent energy and time constraints in practical exploration tasks, we present a hierarchical framework for the bimodal vehicle to utilize its flexible locomotion modalities for exploration. Beginning with extracting environmental information to identify informative regions, we generate a set of potential bimodal viewpoints. To adaptively manage energy and time constraints, we introduce an extended Monte Carlo Tree Search approach that strategically optimizes both modality selection and viewpoint sequencing. Combined with an improved bimodal vehicle motion planner, we present a complete bimodal energy- and time-aware exploration system. Extensive simulations and deployment on a customized real-world platform demonstrate the effectiveness of our system.

Abstract:
Universal jamming grippers excel at grasping unknown objects due to their compliant bodies. Traditional tactile sensors can compromise this compliance, reducing grasping performance. We present acoustic sensing as a form of morphological sensing, where the gripper's soft body itself becomes the sensor. A speaker and microphone are placed inside the gripper cavity, away from the deformable membrane, fully preserving compliance. Sound propagates through the gripper and object, encoding object properties, which are then reconstructed via machine learning. Our sensor achieves high spatial resolution in sensing object size (2.6 mm error) and orientation (0.6 deg error), remains robust to external noise levels of 80 dBA, and discriminates object materials (up to 100% accuracy) and 16 everyday objects (85.6% accuracy). We validate the sensor in a realistic tactile object sorting task, achieving 53 minutes of uninterrupted grasping and sensing, confirming the preserved grasping performance. Finally, we demonstrate that disentangled acoustic representations can be learned, improving robustness to irrelevant acoustic variations.

Abstract:
We propose the first framework that leverages physically-based inverse rendering for novel lighting generation on existing real-world human demonstrations of robotic manipulation tasks. Specifically, inverse rendering decomposes the first frame in each demonstration into geometric (surface normal, depth) and material (albedo, roughness, metallic) properties, which are then used to render appearance changes under different lighting sources. To improve efficiency and maintain consistency across each generated sequence, we fine-tune Stable Video Diffusion on robot execution videos for temporal lighting propagation. We evaluate our framework by measuring the visual quality of the generated sequences, assessing its effectiveness in improving the imitation learning policy performance (38.75%) under six unseen real-world lighting conditions, and conducting ablation studies on individual modules of the proposed framework. We further showcase three downstream applications enabled by the proposed framework: background generation, object texture generation and distractor positioning.

Abstract:
Healthcare robotics has been gaining traction as a key area of research focused on enhancing human wellness. This paper explores the use of intelligent robots in the beauty industry, specifically within the context of photorejuvenation-based cosmetic dermatology, aimed at improving facial skin aesthetics. The beauty industry, traditionally labor-intensive, is experiencing a critical shortage of skilled beauticians, highlighting the opportunity for robotic technologies to meet this demand. However, integrating robots into cosmetic procedures presents unique challenges, particularly in tasks requiring high precision, such as laser pulse delivery and thermal dose management. This study addresses these challenges by introducing a deep learning approach for trajectory generation in laser path planning and a model-based control strategy for thermal dose regulation. Our empirical results demonstrate that the presented healthcare robots can deliver effective photorejuvenation treatments, suggesting a promising future for increased automation in cosmetic services.

Abstract:
Autonomous racing has become increasingly pop-ular in both academia and industry as a testbed for pushing general autonomous driving modules, such as perception, plan-ning, and control, to their limits. Although traditional control approaches can generate optimal control sequences at the edge of the racing vehicles physical controllability, they are highly sensitive to the accuracy of modeling parameters, such as tire model coefficients. Meanwhile, end-to-end learning methods are susceptible to distributional shifts, leading to unpredictable and irreversible failures. To address these challenges, this work introduces a physics-constrained imitation learning (PCIL) framework that effectively leverages the advantages of deep learning techniques and knowledge-driven strategies. Specifically, a fallback strategy would be automatically triggered when the vehicle states exceed predefined physical constraints. Meanwhile, the data from the knowledge-driven strategy will be augmented into the original dataset, and repeated re-training using an aggregated dataset could progressively improve PCIL. A series of simulations and real-world shadow testing are conducted at the Yas Marina circuit, and experimental results demonstrate superior performance compared to state-of-the-art methods, which suggests that it provides a promising solution for real-world autonomous racing.

Abstract:
A foundational humanoid motion tracker is expected to be able to track diverse, highly dynamic, and contact-rich motions. More importantly, it needs to operate stably in real-world scenarios against various dynamics disturbances including terrains, external forces, and physical property changes for general practical usage. To achieve this goal, we propose Any2Track (Track Any motions under Any disturbances), a two-stage RL framework to track various motions under multiple disturbances in the real world. Any2Track reformulates dynamics adaptivity as an additional capability on top of basic action execution and consists of two key components: AnyTracker and AnyAdapter. AnyTracker is a general motion tracker with a series of careful designs to track various motions within a single policy. AnyAdapter is a history-informed adaptation module that endows the tracker with the online dynamics adaptivity to overcome sim2real gap and multiple real-world disturbances. We deploy Any2Track on Unitree G1 hardware and achieve successful sim2real transfer in a zero-shot manner. Any2Track performs remarkably well in tracking various motions under multiple real-world disturbances.

Abstract:
Radio-based methods such as Ultra-Wideband (UWB) and RAdio Detection And Ranging (radar), which have traditionally seen limited adoption in robotics, are experiencing a boost in popularity thanks to their robustness to harsh environmental conditions and cluttered environments. This work proposes a multi-robot UGV-UAV localization system that leverages the two technologies with inexpensive and readily-available sensors, such as Inertial Measurement Units (IMUs) and wheel encoders, to estimate the relative position of an aerial robot with respect to a ground robot. The first stage of the system pipeline includes a nonlinear optimization framework to trilaterate the location of the aerial platform based on UWB range data, and a radar pre-processing module with loosely coupled ego-motion estimation which has been adapted for a multi-robot scenario. Then, the pre-processed radar data as well as the relative transformation are fed to a pose-graph optimization framework with odometry and inter-robot constraints. The system, implemented for the Robotic Operating System (ROS 2) with the Ceres optimizer, has been validated in Software-in-the-Loop (SITL) simulations and in a real-world dataset. The proposed relative localization module outperforms state-of-the-art closed-form methods which are less robust to noise. Our SITL environment includes a custom Gazebo plugin for generating realistic UWB measurements modeled after real data. Conveniently, the proposed factor graph formulation makes the system readily extensible to full Simultaneous Localization And Mapping (SLAM). Finally, all the code and experimental data is publicly available to support reproducibility and to serve as a common open dataset for benchmarking.

Abstract:
Abstract In recent years, 3D scene graphs have emerged as a powerful world representation, offering both geometric accuracy and semantic richness. Combining 3D scene graphs with large language models enables robots to reason, plan, and navigate in complex human-centered environments. However, current approaches for constructing 3D scene graphs are semantically limited to a predefined set of relationships, and their serialization in large environments can easily exceed an LLMs context window. We introduce KeySG, a framework that represents 3D scenes as a hierarchical graph consisting of floors, rooms, objects, and functional elements, where nodes are augmented with multi-modal information extracted from keyframes selected to optimize geometric and visual coverage. The keyframes allow us to efficiently leverage VLMs to extract scene information, alleviating the need to explicitly model relationship edges between objects, enabling more general, task-agnostic reasoning and planning. Our approach can process complex and ambiguous queries while mitigating the scalability issues associated with large scene graphs by utilizing a hierarchical multi-modal retrieval-augmented generation (RAG) pipeline to extract relevant context from the graph. Evaluated across three distinct benchmarks, 3D object semantic segmentation, functional element segmentation, and complex query retrieval KeySG outperforms prior approaches on most metrics, demonstrating its superior semantic richness and efficiency.

Abstract:
Kinematic and hand-eye calibration of robotic arms is a critical research area in robotics, essential to ensuring the accuracy of manipulation tasks. The widely adopted methods for robotic arm calibration typically rely on specialized markers or external sensors to achieve precise measurements. However, these approaches are often expensive and require additional effort, such as the installation and maintenance of auxiliary equipment. Furthermore, many downstream tasks require separate hand-eye calibration steps because of differences between the sensors used for calibration and those used for task execution. Comprehensive calibration of both the robot arm and sensors plays a vital role in optimizing system performance. However, the robot's posture could be constrained due to either the sensor's limited range or textureless scenes when a camera is used. To address these limitations, this study proposes a cost-effective self-calibration method that simultaneously calibrates the robot arm and determines the spatial relationship between the robot and an RGB-D camera, allowing for data collection at multiple locations. The proposed approach leverages recent advancements in machine learning to identify correspondences between images captured at different robot postures, facilitating automatic data selection. Furthermore, the removal of location constraints increases flexibility, enabling the collection of sufficient data as the robot's location changes. The method is evaluated using a Franka Emika Panda robotic arm, and the experimental results demonstrate its effectiveness in achieving accurate calibration without the need for external devices or markers.

Abstract:
Rapid urbanization has increased demand for customized urban mobility, making on-demand services and robo-taxis central to future transportation. The efficiency of these systems hinges on real-time fleet coordination algorithms. This work accelerates the state-of-the-art high-capacity ridepooling framework by identifying its computational bottlenecks and introducing two complementary strategies: (i) a data-driven feasibility predictor that filters low-potential trips, and (ii) a graph-partitioning scheme that enables parallelizable trip generation. Using real-world Manhattan demand data, we show that the acceleration algorithms reduce the optimality gap by up to 27% under real-time constraints and cut empty travel time by up to 5%. These improvements translate into tangible economic and environmental benefits, advancing the scalability of high-capacity robo-taxi operations in dense urban settings.

Abstract:
This paper proposes a new birth model including semantic information derived from deep learning to create an occlusion-aware Gaussian Mixture Probability Hypothesis Density (GM-PHD) filter. Unlike prior approaches that rely on simplistic or uniform assumptions, the proposed Semantic-Occlusion Aware (S-OA) birth model defines initialization terms by explicitly considering regions of occlusion and by leveraging semantic information about the environment. This enables the filter to accurately represent where new objects are more likely to appear, thereby improving tracking performance in complex and high-density driving scenarios. The method is evaluated through Monte Carlo simulations and experiments on the KITTI dataset. Performance is assessed by measuring the latency between first detection and track initiation, along with the mean absolute cardinality error and the Optimal Subpattern Assignment (OSPA) metric. Results demonstrate that the S-OA birth model reduces initialization delay in occlusion-heavy settings, matching or outperforming the strongest baseline in approximately 70% of cases. A sensitivity analysis of birth model weights is also provided. Overall, the findings underscore the benefits of integrating occlusion reasoning and semantic priors into Bayesian tracking frameworks for autonomous driving.

Abstract:
Achieving robust and reliable control in robotic systems is crucial, especially in presence of model uncertainties. Over the last years the notion of closed-loop sensitivity has emerged as an effective tool for analyzing and quantifying how uncertainties in the model parameters affect the closed-loop system behavior. In particular, several previous works have shown how the sensitivity matrixes can be used to map uncertainty ellipsoids in the parameter space to the corresponding state/input ellipsoids (the so-called sensitivity tubes) that can be leveraged to robustify any system constraint. This paper extends these previous works based on an ellipsoidal modeling of the parametric uncertainty by proposing two new approaches that significantly improve the computation of the sensitivity tubes. The first method replaces the ellipsoids with hyperboxes for constructing the tubes: this solution avoids any approximation of the parameter set but yields a non-differentiable formulation of the resulting tube. The second method, instead, leverages superquadrics that can approximate a hyperbox with a tunable precision while retaining differentiability of the resulting tubes (as in the ellipsoid case). Both methods have been validated via a simulation campaign and compared with the previous approaches based on an ellipsoidal modeling of the uncertainties. The results confirm the effectiveness of the proposed techniques in producing state/input tubes that accurately envelope the perturbed system behavior, which is an important asset for providing a robustness layer to any online/offline trajectory generation algorithm.

Abstract:
Sign language is a natural and visual form of language that uses movements and expressions to convey meaning, serving as a crucial means of communication for individuals who are deaf or hard-of-hearing (DHH). However, the number of people proficient in sign language remains limited, highlighting the need for technological advancements to bridge communication gaps and foster interactions with minorities. Based on recent advancements in embodied humanoid robots, we propose SignBot, a novel framework for human-robot sign language interaction. SignBot integrates a cerebellum-inspired motion control component and a cerebral-oriented module for comprehension and interaction. Specifically, SignBot consists of: 1) Motion Retargeting, which converts human sign language datasets into robot-compatible kinematics; 2) Motion Control, which leverages a learning-based paradigm to develop a robust humanoid control policy for tracking sign language gestures; and 3) Generative Interaction, which incorporates translator, responser, and generator of sign language, thereby enabling natural and effective communication between robots and humans. Simulation and real-world experimental results demonstrate that SignBot can effectively facilitate human-robot interaction and perform sign language motions with diverse robots and datasets. SignBot represents a significant advancement in automatic sign language interaction on embodied humanoid robot platforms, providing a promising solution to improve communication accessibility for the DHH community.

Abstract:
Turning garments right-side out is a challenging manipulation task: it is highly dynamic, entails rapid contact changes, and is subject to severe visual occlusion. We introduce Right-Side-Out, a zero-shot sim-to-real framework that effectively solves this challenge by exploiting task structures. We decompose the task into Drag/Fling to create and stabilize an access opening, followed by Insert&Pull to invert the garment. Each step uses a depth-inferred, keypoint-parameterized bimanual primitive that sharply reduces the action space while preserving robustness. Efficient data generation is enabled by our custom-built, high-fidelity, GPU-parallel Material Point Method (MPM) simulator that models thin-shell deformation and provides robust and efficient contact handling for batched rollouts. Built on the simulator, our fully automated pipeline scales data generation by randomizing garment geometry, material parameters, and viewpoints, producing depth, masks, and per-primitive keypoint labels without any human annotations. With a single depth camera, policies trained entirely in simulation deploy zero-shot on real hardware, achieving up to 81.3% success rate. By employing task decomposition and high fidelity simulation, our framework enables tackling highly dynamic, severely occluded tasks without laborious human demonstrations.

Abstract:
We present a novel method for optimizing the posture of kinematically redundant torque-controlled robots to improve robustness during impacts. A rigid impact model is used as the basis for a configuration-dependent metric that quantifies the variation between pre- and post-impact velocities. By finding configurations (postures) that minimize the aforementioned metric, spikes in the robot's state and input commands can be significantly reduced during impacts, improving safety and robustness. The problem of identifying impact-robust postures is posed as a minmax optimization of the aforementioned metric. To overcome the real-time intractability of the problem, we reformulate it as a gradient-based motion task that iteratively guides the robot towards configurations that minimize the proposed metric. This task is embedded within a task-space inverse dynamics (TSID) whole-body controller, enabling seamless integration with other control objectives. The method is applied to a kinematically redundant aerial manipulator performing repeated point contact tasks. We test our method inside a realistic physics simulator and compare it with the nominal TSID. Our method leads to a reduction (up to 51% w.r.t. standard TSID) of post-impact spikes in the robot's configuration and successfully avoids actuator saturation. Moreover, we demonstrate the importance of kinematic redundancy for impact-robustness using additional numerical simulations on a quadruped and a humanoid robot, resulting in up to 45%, reduction of post-impact spikes in the robot's state w.r.t. nominal TSID.

Abstract:
Hand tracking plays a key role in capturing and transferring dexterous human manipulation skills to robots. However, achieving reliable tracking across diverse conditions and during complex interactions (e.g., object manipulation) remains challenging. A promising solution is to combine wearable sensors such as IMUs with vision, where previous studies have handled the vision input by attaching markers to wearables or by relying on depth data to avoid the domain gap in color images. In this work, we present a hand tracking framework that fuses inertial measurements with state-of-the-art vision methods, eliminating the need for markers while fully exploiting visual cues. For this, we introduce a dataset generation scheme that produces synthetic and real data for the target glove using a compact setup, without manual annotation. Using the dataset, we train the keypoint detection network that predicts the likelihood of an image for keypoints, designed based on a lightweight vision transformer (ViT) for real-time usage. Based on the network prediction, the IMU-propagated pose is used as a prior in probabilistic inference to estimate the keypoint positions and uncertainties. Tracking primarily relies on high-rate IMU updates for fast motion estimation, while the pose is corrected through factor graph optimization. The framework is validated in challenging scenarios, demonstrating its robustness and accuracy, and can be used for high-quality demonstration data acquisition and teleoperation for dexterous manipulation.

Abstract:
Reach-Avoid-Stay (RAS) tasks are essential in applications where systems must safely reach a target set and remain within it under all bounded disturbances. Existing approaches either struggle to compute the maximal robust RAS setthe set of all states from which the RAS task is achievableor are limited in handling general dynamic systems. To address these challenges, this paper proposes a two-step deep reinforcement learning framework that jointly learns the maximal robust RAS set and the corresponding control policy. The first step identifies the maximal robust control-invariant set within the target set and derives a policy that ensures the system remains within it. The second step computes the maximal robust reach-avoid (RA) set using this invariant set as the target, and it is proven that this RA set is equivalent to the maximal robust RAS set. Leveraging this result, a switching policy is constructed from the two step-wise policies, which constitutes a valid policy guaranteeing completion of the RAS task. Simulation results demonstrate that the proposed framework (1) computes the exact maximal robust RAS set in the absence of training errors, yielding the least restrictive RAS policy, and (2) identifies the RAS set with high accuracy while outperforming baseline methods on RAS tasks.

Abstract:
Autonomous navigation in complex dynamic environments remains a fundamental challenge in robotics, and many reinforcement learning (RL) algorithms have demonstrated promising results, especially the on-policy ones. However, the inherent sample efficiency issue is still a fundamental problem to be solved. Methods integrating off-policy approaches into on-policy frameworks have been proposed to improve the sample efficiency by focusing on imitating the agents past exemplary experiences while discarding less optimal ones. However, these methods overlook the valuable insights embedded within failures. Although some research has begun to explore learning from failures, it is usually done at a point-by-point level, ignoring the rich sequence context inherent in the trajectory. In this paper, we introduce DFPS-Nav, a training framework that utilizes Failure-Aversion Learning (FAL) to perform segmented, trend-based credit assignment, identifying both failure-inducing actions and valuable recovery behaviors within failed trajectories. We further improve successful imitation by adopting Prioritized Self-Imitation Learning (PSIL), which scores trajectories and prioritizes high-quality behaviors so that successful behaviors are reliably reproduced. Extensive simulation and real-world experiments demonstrate that using both FAL and PSIL to extract and refine information from the sequential context within trajectories, DFPS-Nav achieves up to 29.5% and 27% higher success rates in static and dynamic environments compared to the strong baseline method and successfully is applied in the real world. This work underscores how systematically deconstructing failures while prioritizing successes leads to more efficient and robust autonomous navigation.

Abstract:
In this work, we propose Efficient Collision-Aware Hierarchical Diffusion Navigation (ECAHD), a hierarchical diffusion-based framework designed for both safety and computational efficiency. ECAHD generates a sparse trajectory for global path planning and a dense trajectory for local path refinement. The robot follows a rapidly sampled sparse global trajectory, and when a potential collision is detected, a collision-aware guidance diffusion mechanismwhich accounts for the robots shapeadjusts the local trajectory accordingly. Conventional full-sequence diffusion planners suffer from slow sampling speeds and performance degradation when collision-aware guidance is applied across the entire trajectory. ECAHD addresses these issues by significantly reducing the number of waypoints predicted by the global diffusion planner, while delegating robot shape aware collision guidance to the local diffusion planner. This separation not only accelerates planning but also preserves global trajectory quality, as goal-conditioned sampling is no longer disrupted by collision-related constraints. Furthermore, ECAHD allows for increasing the number of global trajectory samples to enhance performance, without incurring substantial computational overhead. In maze2d-large planning tests, ECAHD improved success rates by approximately 1.3% while reducing collision rates by more than 50%, all while cutting inference time by nearly half.

Abstract:
3D Gaussian Splatting (3DGS) has garnered significant attention in Simultaneous Localization and Mapping (SLAM) due to its advances in capturing fine-grained geometry features and synthesizing novel views. For SLAM in large-scale scenes, such as autonomous driving, 3DGS-SLAM faces a critical limitation. The memory consumption increases continuously over time as Gaussian points accumulate, leading to poor memory efficiency and limiting its applicability. In this work, we propose a rendering-areaaware pruning strategy that selectively removes Gaussians based on their contribution to the effective rendering area, rather than solely relying on Gaussian-level heuristics (e.g., opacity or gradient magnitude). This perspective directly targets the sources of memory redundancy, effectively reducing the peak memory footprint of 3DGS-SLAM during runtime. Evaluations on the EuRoC and KITTI datasets demonstrate that our method consistently outperforms existing pruning approaches in large-scale outdoor scenes, achieving over 60% memory reduction and more than 2times FPS improvement while preserving localization and mapping accuracy. These results highlight rendering-areaaware pruning as a promising direction for scaling 3DGS-SLAM to real-world autonomous driving scenarios. Our code is publicly available at https://github.com/UMN-ZhaoLab/Pocket-SLAM.git.

Abstract:
This paper presents an Open-Vocabulary Online 3D semantic mapping pipeline, that we denote by its acronym OVO. Given a sequence of posed RGB-D frames, we detect and track 3D segments, which we describe using CLIP vectors. These are computed from the viewpoints where they are observed by a novel CLIP merging method. Notably, our OVO has a significantly lower computational and memory footprint than offline baselines, while also showing better segmentation metrics than offline and online ones. Along with superior segmentation performance, we also show experimental results of our mapping contributions integrated with two different full SLAM backbones (Gaussian-SLAM and ORB-SLAM2), being the first ones using a neural network to merge CLIP descriptors and demonstrating end-to-end open-vocabulary online 3D mapping with loop closure.

Abstract:
LiDAR-camera extrinsic calibration (LCEC) is crucial for multi-modal data fusion in autonomous robotic systems. Existing methods, whether target-based or target-free, typically rely on customized calibration targets or fixed scene types, which limit their applicability in real-world scenarios. To address these challenges, we present EdO-LCEC, the first environment-driven online calibration approach. Unlike traditional target-free methods, EdO-LCEC employs a generalizable scene discriminator to estimate the feature density of the application environment. Guided by this feature density, EdO-LCEC extracts LiDAR intensity and depth features from varying perspectives to achieve higher calibration accuracy. To overcome the challenges of cross-modal feature matching between LiDAR and camera, we introduce dual-path correspondence matching (DPCM), which leverages both structural and textural consistency for reliable 3D-2D correspondences. Furthermore, we formulate the calibration process as a joint optimization problem that integrates global constraints across multiple views and scenes, thereby enhancing overall accuracy. Extensive experiments on real-world datasets demonstrate that EdO-LCEC outperforms state-of-the-art methods, particularly in scenarios involving sparse point clouds or partially overlapping sensor views.

Abstract:
Visual--inertial SLAM systems often fail in feature-poor environments such as corridors and textureless walls, leading to catastrophic tracking loss. Existing methods detect degradation reactively after failure occurs, leaving little opportunity for corrective action. We propose a proactive framework that predicts feature degradation 1--2 seconds in advance and adapts sensor fusion weights through uncertainty-guided decisions. Through a systematic comparison of eight temporal architectures across 15,233 sequences, including real robot data, we identify LSTM as the most robust predictor (26.77 MAE). We incorporate uncertainty estimation using Monte Carlo Dropout to enable confidence-aware adaptation thresholds that prevent false adjustments. Our approach provides a foundation for proactive SLAM failure prevention through principled sensor fusion and real-time system adaptation.

Abstract:
Hyperspectral imaging is an advanced technique for precisely identifying and analyzing materials or objects. However, its integration with robotic grasping systems has so far been explored due to the deployment complexities and prohibitive costs. Within this paper, we introduce a novel hyperspectral imaging-guided robotic grasping system. The system consists of PRISM (Polyhedral Reflective Imaging Scanning Mechanism) and the SpectralGrasp framework. PRISM is designed to enable high-precision, distortion-free hyperspectral imaging while simplifying system integration and costs. SpectralGrasp generates robotic grasping strategies by effectively leveraging both the spatial and spectral information from hyperspectral images. The proposed system demonstrates substantial improvements in both textile recognition compared to human performance and sorting success rate compared to RGB-based methods. Additionally, a series of comparative experiments further validates the effectiveness of our system. The study highlights the potential benefits of integrating hyperspectral imaging with robotic grasping systems, showcasing enhanced recognition and grasping capabilities in complex and dynamic environments. The project is available at: https://zainzh.github.io/PRISM.

Abstract:
This paper proposes a simple yet effective markerless hand-eye calibration method that achieves low cost, high accuracy, and strong generalization across different types of robots. The method utilizes a circular flange, a standardized structure in industrial robots, for calibration via the perspective-n-point (PnP) algorithm, achieving superior performance with a simpler pipeline. The entire system is built using mature, off-the-shelf components, avoiding complex architectures. By combining a lightweight object detection network (e.g., Faster R-CNN) with classical geometric techniques, we construct a flange detector that is both accurate and robust. The training process requires no manual annotations, and the resulting model generalizes well across various robot platforms. Experiments demonstrate that our method achieves higher calibration accuracy than more complex existing approaches. Notably, the method maintains consistent precision even when applied to previously unseen robots. Code and pre-trained models will be made available.

Abstract:
Abstract Understanding the surrounding scene geometrically and semantically is a key requirement for autonomously navigating systems. Vision-based 3D panoptic occupancy prediction aims to provide a 3D representation of the surroundingsincluding semantic meaning and identifying individual objectssuch as traffic participants in the context of urban navigation. The majority of vision-based approaches to occupancy prediction require 3D voxel labels or segmented LiDAR scan as supervision signal. While other vision-based approaches use only a few consecutive images for supervision, these approaches typically do not provide instance-level information, which is crucial for achieving a holistic understanding of the scene. In this paper, we propose a novel method for 3D panoptic occupancy prediction that relies solely on image data for both training and inference. We use bundle adjustment to align all available images in the training set to obtain depth information. We further use a pre-trained open-vocabulary image model to obtain panoptic segmentation of the RGB images and generate occupancy pseudo labels to directly optimize for the 3D panoptic occupancy prediction task. Furthermore, we use a 3D foundation model to obtain depth predictions for individual images to add dynamic objects into the pseudo labels. Without any manual or LiDAR-based annotations, our approach outputs occupancy, semantic class, and instance ID for each 3D voxel in the full voxel grid. We achieve state-of-the-art results on 3D semantic occupancy prediction among label-free methods, and we propose the first method for 3D panoptic occupancy without any LiDAR supervision.

Abstract:
Event cameras offer high-temporal-resolution sensing that remains reliable under high-speed motion and challenging lighting, making them promising for localization from LiDAR point clouds in GPS-denied and visually degraded environments. However, aligning sparse, asynchronous events with dense LiDAR maps is fundamentally ill-posed, as direct correspondence estimation suffers from modality gaps. We propose LEAR, a dual-task learning framework that jointly estimates edge structures and dense eventdepth flow fields to bridge the sensing-modality divide. Instead of treating edges as a post-hoc aid, LEAR couples them with flow estimation through a cross-modal fusion mechanism that injects modality-invariant geometric cues into the motion representation, and an iterative refinement strategy that enforces mutual consistency between the two tasks over multiple update steps. This synergy produces edge-aware, depth-aligned flow fields that enable more robust and accurate pose recovery via Perspective-n-Point (PnP) solvers. On several popular and challenging datasets, LEAR achieves superior performance over the best prior method. The source code, trained models, and demo videos are made publicly available online.

Abstract:
Urban scene reconstruction from real-world observations has emerged as a powerful tool for self-driving development and testing. While current neural rendering approaches achieve high-fidelity rendering along the recorded trajectories, their quality degrades significantly under large viewpoint shifts, limiting the applicability for closed-loop simulation. Recent works have shown promising results in using diffusion models to enhance quality at these challenging viewpoints and distill improvements back into 3D representations. However, they often require costly per-scene optimization, and the distilled representations remain fragile and fail to generalize beyond limited synthesized views. To address these limitations, we propose GenRe, a novel diffusion-guided generalizable enhancer for urban scene reconstruction. GenRe takes as input any pretrained 3D Gaussian representation and fixes the deficiencies within a few minutes. By learning to distill generative priors across diverse scenes, GenRe produces robust and high-fidelity representation efficiently that generalizes reliably to challenging unseen viewpoints (e.g., lane change). Experiments show that GenRe outperforms existing methods in both quality and efficiency and benefits various downstream tasks, enabling robust and scalable sensor simulation for autonomous driving.

Abstract:
Human-like motion generation for robots often draws inspiration from biomechanical studies, which categorize complex human motions into hierarchical taxonomies. While these taxonomies provide rich structural information about how movements relate to one another, this information is frequently overlooked in motion generation models, leading to a disconnect between the generated motions and their underlying hierarchical structure. This paper introduces the Gaussian Process Hyperbolic Dynamical Model (GPHDM), a novel approach that learns latent representations preserving both the hierarchical structure of motions and their temporal dynamics to ensure physical consistency. Our model achieves this by extending the dynamics prior of the Gaussian Process Dynamical Model (GPDM) to the hyperbolic manifold and integrating it with taxonomy-aware inductive biases. Building on this geometry- and taxonomy-aware frameworks, we propose three novel mechanisms for generating motions that are both taxonomically-structured and physically-consistent: two probabilistic recursive approaches and a method based on pullback-metric geodesics. Experiments on generating realistic motion sequences on the hand grasping taxonomy show that the proposed GPHDM faithfully encodes the underlying taxonomy and temporal dynamics, and it generates novel physically-consistent trajectories.

Abstract:
Multi-agent deep Reinforcement Learning (RL) has made significant progress in developing intelligent game-playing agents in recent years. However, the efficient training of collective robots using multi-agent RL and the transfer of learned policies to real-world applications remain open research questions. In this work, we first develop a comprehensive robotic system, including simulation, distributed learning framework, and physical robot components. We then propose and evaluate reinforcement learning techniques designed for efficient training of cooperative and competitive policies on this platform. To address the challenges of multi-agent sim-to-real transfer, we introduce Out of Distribution State Initialization (OODSI) to mitigate the impact of the sim-to-real gap. In the experiments, OODSI improves the Sim2Real performance by 20%. We demonstrate the effectiveness of our approach through experiments with a multi-robot car competitive game and a cooperative task in real-world settings.

Abstract:
Crowd-sourced mapping offers a scalable alternative to creating maps using traditional survey vehicles. Yet, existing methods either rely on prior high-definition (HD) maps or neglect uncertainties in the map fusion. In this work, we present a complete pipeline for HD map generation using production vehicles equipped only with a monocular camera, consumer-grade GNSS, and IMU. Our approach includes on-cloud localization using lightweight standard-definition maps, on-vehicle mapping via an extended object trajectory (EOT) Poisson multi-Bernoulli (PMB) filter with Gibbs sampling, and on-cloud multi-drive optimization and Bayesian map fusion. We represent the lane lines using B-splines, where each B-spline is parameterized by a sequence of Gaussian distributed control points, and propose a novel Bayesian fusion framework for B-spline trajectories with differing density representation, enabling principled handling of uncertainties. We evaluate our proposed approach, B^2F-Map, on large-scale real-world datasets collected across diverse driving conditions and demonstrate that our method is able to produce geometrically consistent lane-level maps.

Abstract:
Lower extremity exoskeletons designed for multi-joint assistance are increasingly explored for rehabilitation and human augmentation. However, conventional monoarticular designs often suffer from joint misalignment and actuator redundancy, limiting their efficiency and user comfort. This study presents a biarticular rigid powered lower extremity exoskeleton that simultaneously assists the knee and ankle joints through a single actuator, enabling coordinated torque generation across adjacent joints. A hierarchical control framework combining gait segmentation, impedance-based torque generation, and gravity/friction compensation is implemented to provide phase-specific assistance. Experimental results show that the proposed exoskeleton reduces gastrocnemius activation by up to 63.4% and metabolic cost by up to 11.6% during stair ascent, with corresponding reductions of 28.3% and 8.2% during level walking. These findings demonstrate the effectiveness of the biarticular and underactuated structure in enhancing locomotor efficiency, highlighting its potential as a compact and practical solution for dynamic and diverse mobility scenarios.

Abstract:
Superpoint-based pipelines provide an efficient alternative to point- or voxel-based 3D semantic segmentation, but are often bottlenecked by their CPU-bound partition step. We propose a learnable, fully GPU partitioning algorithm that generates geometrically and semantically coherent superpoints 13× faster than prior methods. Our module is compact (under 60k parameters), trains in under 20 minutes with a differentiable surrogate loss, and requires no handcrafted features. Combined with a lightweight superpoint classifier, the full pipeline fits in < 2 MB of VRAM, scales to multi-million-point scenes, and supports real-time inference. With 72× faster inference and 120× fewer parameters, EZ-SP matches the accuracy of point-based SOTA models across three domains: indoor scans (S3DIS), autonomous driving (KITTI-360), and aerial LiDAR (DALES). Code and pretrained models are accessible at github.com/drprojects/superpoint_transformer.

Abstract:
Vision-and-Language Navigation (VLN) tasks agents with locating specific objects in unseen environments using natural language instructions and visual cues. Many existing VLN approaches typically follow an `observe-and-reason' schema, that is, agents observe the environment and decide on the next action to take based on the visual observations of their surroundings. They often face challenges in long-horizon scenarios due to limitations in immediate observation and vision-language modality gaps. To overcome this, we present VISTA, a novel framework that employs an `imagine-and-align navigation strategy. Specifically, we leverage the generative prior of pre-trained diffusion models for dynamic visual imagination conditioned on both local observations and high-level language instructions. A Perceptual Alignment Filter module then grounds these goal imaginations against current observations, guiding an interpretable and structured reasoning process for action selection. Experiments show that VISTA sets new state-of-the-art results on Room-to-Room (R2R) and RoboTHOR benchmarks, e.g., +3.6% increase in Success Rate on R2R. Extensive ablation analysis underscores the value of integrating forward-looking imagination, perceptual alignment, and structured reasoning for robust navigation in long-horizon environments.

Abstract:
This paper introduces AROSpect, a framework for timing introspection and controlled experimentation for ROS 2-based applications. AROSpect enables developers to model system components using standardized templates, inject synthetic delays, and measure end-to-end latencies across message paths. Through instrumentation, the framework supports iterative refinement of timing parameters and identification of misconfigurations. A case study using a multi-agent turtlesim system demonstrates how AROSpect can guide developers to understand the effects of adapting timing parameters, contributing toward more predictable robotic systems.

Abstract:
In this paper, we use asymmetric vibrations to demonstrate two degree-of-freedom (DoF) in-hand manipulation of grasped parts. The asymmetric vibrations are achieved through closed-loop position control of a moving surface, which applies a periodic stick-slip waveform to the part to be manipulated. We show analytically how two vibratory waveform parameters, the sticking acceleration and the slipping acceleration, affect average part velocity when moving against gravity. The theoretical trends are then validated using an experimental setup where the squeeze force is controlled and part motion is recorded by a high-resolution encoder. We also develop a 2-DoF vibratory surface capable of translation in one direction and rotation about the surface normal. Using two of these 2-DoF surfaces in a parallel jaw gripper configuration, we bidirectionally translate and rotate a variety of grasped parts, as well as demonstrate that the same waveform trends for translation also persist for in-plane rotation.

Abstract:
Robotic affordance estimation is challenging due to visual, geometric, and semantic ambiguities in sensory input. We propose a method that disambiguates these signals using two coupled recursive estimators for sub-aspects of affordances: graspable and movable regions. Each estimator encodes property-specific regularities to reduce uncertainty, while their coupling enables bidirectional information exchange that focuses attention on regions where both agree, i.e., affordances. Evaluated on a real-world dataset, our method outperforms three recent affordance estimators (Where2Act, Hands-as-Probes, and HRP) by 308%, 245%, and 257% in precision, and remains robust under challenging conditions such as low light or cluttered environments. Furthermore, our method achieves a 70% success rate in our real-world evaluation. These results demonstrate that coupling complementary estimators yields precise, robust, and embodiment-appropriate affordance predictions.

Abstract:
Contact-rich manipulation depends on applying the correct grasp forces throughout the manipulation task, especially when handling fragile or deformable objects. Most existing imitation learning approaches often treat visuotactile feedback only as an additional observation, leaving applied forces as an uncontrolled consequence of gripper commands. In this work, we present Force-Aware Robotic Manipulation (FARM), an imitation learning framework that integrates high-dimensional tactile data to infer tactile-conditioned force signals, which in turn define a matching force-based action space. We collect human demonstrations using a modified version of the hand-held Universal Manipulation Interface (UMI) gripper that integrates a GelSight Mini visual tactile sensor. For deploying the learned policies, we developed an actuated variant of the UMI gripper with geometry matching our hand-held version. During policy rollouts, the proposed FARM diffusion policy jointly predicts robot pose, grip width, and grip force. FARM outperforms several baselines across three tasks with distinct force requirementshigh-force, low-force, and dynamic force adaptationdemonstrating the advantages of its two key components: leveraging force-grounded, high-dimensional tactile observations and a force-based control space. The codebase and design files are open-sourced and available at https://tactile-farm.github.io.

Abstract:
We present a rolling and jumping underactuated monopedal robot designed to explore multimodal locomotion on low-gravity bodies. It uses only two reaction wheels to control its spatial orientation with two controllers: a balancing controller which can aim the robots jump direction on the ground, and an aerial reorientation controller which can aim the robots leg for landing after flight. We demonstrate rolling, targeted jumping and landing, and self-righting using only three actuators total, keeping system size to 0.33 m and 1.25 kg. Simple switching between locomotion modes enables the system to deal with differing landscapes and environmental conditions.

Abstract:
Autonomous driving systems demand trajectory planners that not only model the inherent uncertainty of future motions but also respect complex temporal dependencies and underlying physical laws. While diffusion-based generative models excel at capturing multi-modal distributions, they often fail to incorporate long-term sequential contexts and domain-specific physical priors. In this work, we bridge these gaps with two key innovations. First, we introduce a Diffusion Mamba Transformer architecture that embeds mamba and attention into the diffusion process, enabling more effective aggregation of sequential input contexts from sensor streams and past motion histories. Second, we design a Port-Hamiltonian Neural Network module that seamlessly integrates energy-based physical constraints into the diffusion model, thereby enhancing trajectory predictions with both consistency and interpretability. Extensive evaluations on standard autonomous driving benchmarks demonstrate that our unified framework significantly outperforms state-of-the-art baselines in predictive accuracy, physical plausibility, and robustness, thereby advancing safe and reliable motion planning.

Abstract:
Accurate and adaptive dynamic models are critical for underwater vehiclemanipulator systems where hydrodynamic effects induce time‐varying parameters. This paper introduces a novel uncertainty‐aware adaptive dynamics model framework that remains linear in lumped vehicle and manipulator parameters, and embeds convex physical consistency constraints during online estimation. Moving horizon estimation is used to stack horizon regressors, enforce realizable inertia, damping, friction, and hydrostatics, and quantify uncertainty from parameter evolution. Experiments on a BlueROV2 Heavy with a 4‐DOF manipulator demonstrate rapid convergence and calibrated predictions. Manipulator fits achieve R^2=0.88 to 0.98 with slopes near unity, while vehicle surge, heave, and roll are reproduced with good fidelity under stronger coupling and noise. Median solver time is approximately 0.023s per update, confirming online feasibility. A comparison against a fixed parameter model shows consistent reductions in MAE and RMSE across degrees of freedom. Results indicate physically plausible parameters and confidence intervals with near 100% coverage, enabling reliable feedforward control and simulation in underwater environments.

Abstract:
Closed-loop planning in complex, real-world driving scenarios presents a critical challenge for autonomous driving systems. While traditional rule-based methods are interpretable, their predefined heuristics lack the adaptability for dynamic traffic environments. Learning-based approaches have shown considerable promise. Conversely, learning-based approaches, despite their promise, struggle to balance the modeling diverse and multimodal driving behaviors and real-time planning, often leading to indecisive or unsafe actions. To address this limitation, we propose ConsistencyPlanner, a real-time planning framework with fast-sampling consistency models. Our approach is built upon two key technical contributions. Efficient Multimodal Sampling: We employ fast-sampling consistency models to generate a diverse set of plausible future trajectories. This enables efficient, real-time exploration of multimodal actions, overcoming the computational bottlenecks of previous iterative generative methods. Heterogeneous Feature Fusion: We introduce an attention-enhanced decoder that dynamically integrates heterogeneous input featuresincluding scene feature and action tokeninto a cohesive representation for robust planning. Extensive evaluation in the Waymax simulator demonstrates superior performance in safety metrics compared to existing methods, with particularly strong results in challenging dynamic scenarios.

Abstract:
The promise of model-predictive control (MPC) in robotics has led to extensive development of efficient numerical optimal control solvers in line with differential dynamic programming because it exploits the sparsity induced by time. In this work, we argue that this effervescence has hidden the fact that sparsity can be equally exploited by standard nonlinear optimization. In particular, we show how a tailored implementation of sequential quadratic programming (QP) achieves state-of-the-art MPC. Then, we clarify the connections between popular algorithms from the robotics community and well-established optimization techniques. Further, the sequential quadratic program formulation naturally encompasses the constrained case, a notoriously difficult problem in the robotics community. Specifically, we show that it only requires a sparsity-exploiting implementation of a state-of-the-art QP solver. We illustrate the validity of this approach in a comparative study and experiments on a torque-controlled manipulator. To the best of our knowledge, this is the first demonstration of closed loop nonlinear MPC with constraints on a real robot.

Abstract:
In cluttered, unknown, and partially observable environments, Unmanned Aerial Vehicle (UAV) navigation encounters formidable challenges. To address these challenges, we propose an innovative spatio-temporal attention fusion navigation framework called STAF-Navi. The framework integrates spatio-temporal attention mechanisms to model sequential dependencies. It captures spatial and temporal correlations from historical observations and actions to improve navigation and obstacle avoidance. STAF-Navi employs deep collision encoding to compress high-dimensional depth images into informative low-dimensional latent states, and a single-site Transformer to model historical sensor inputs and states, enhancing the utility of current observations. By exploiting temporal dependencies, this integration enables early braking and stable hovering. Extensive simulation experiments show that the framework increases the navigation success rate by 10% and improves path efficiency by 7%. Finally, the successful deployment of the proposed strategy in real-world scenarios validates its effectiveness.

Abstract:
This paper presents a task-space admittance controller applicable to redundant manipulators equipped with torque sensors. It extends Kikuuwe's (2019) torque-bounded admittance controller (TBAC), which allows for imposing explicit limits on the joint actuator torques without causing unsafe behaviors such as oscillation and overshoots. The proposed controller enforces the end-effector to follow predefined task-space dynamics as long as the joint torques are unsaturated and the configuration is away from singularities. The behavior in the nullspace, which arises from the redundant degrees of freedom and singular configurations, is governed by predefined joint-space dynamics. The task-space and joint-space dynamics are combined through a newly proposed continualized pseudoinverse, which employs the singular value decomposition. Results of experiments using a seven-degree-of-freedom Kinova Gen3 robot illustrate the validity of the proposed admittance controller in various scenarios, including the case where the robot is fully stretched.

Abstract:
Robotic-assisted minimally invasive surgery (RAMIS) requires accurate enforcement of the remote center of motion (RCM) constraint to ensure safe tool motion through a trocar. Existing virtual RCM controllers are commonly formulated either at the kinematic level or as task-space objectives, which makes torque-level enforcement under trocar motion and physical interaction difficult to formulate consistently. This paper models the RCM as a rheonomic holonomic constraint and incorporates it into a projection-based inverse-dynamics controller with explicit constrained/free-motion torque decomposition. The resulting formulation unifies kinematic RCM enforcement and task-space tracking at the torque level, while preserving a constraint-consistent structure for residual regulation and null-space compliance. The proposed controller is validated in simulation and on a RAMIS training platform against representative projection-based and constrained-dynamics baselines. Across spiral tracking, varying insertion depth, moving trocar conditions, and human interaction, the method achieves lower RCM residuals and smoother torque profiles while maintaining accurate tool-tip tracking. These results support the use of constraint-consistent torque control for reliable virtual RCM enforcement in surgical robotics. The project page is available at https://rcmpc-cube.github.io.

Abstract:
Abstract Dexterous manipulation requires precise geometric reasoning, yet existing visuo-tactile learning methods struggle with sub-millimeter precision tasks that are routine for traditional model-based approaches. We identify a key limitation: while tactile sensors provide rich contact information, current learning frameworks fail to effectively leverage both the perceptual richness of tactile signals and their spatial relationship with hand kinematics. We believe an ideal tactile representation should explicitly ground contact measurements in a stable reference frame while preserving detailed sensory informationenabling policies to not only detect contact occurrence but also precisely infer object geometry in the hands coordinate system. We introduce SaTA (Spatially-anchored Tactile Awareness for dexterous manipulation), an end-to-end policy framework that explicitly anchors tactile features to the hands kinematic frame through forward kinematics, enabling accurate geometric reasoning without requiring object models or explicit pose estimation. Our key insight is that spatially- grounded tactile representations allow policies to not only detect contact occurrence but also precisely infer object geometry in the hands coordinate system. We validate SaTA on challenging dexterous manipulation tasks, including bimanual USB-C mating in free spacea task demanding sub-millimeter alignment precisionas well as light bulb installation requiring precise thread engagement and rotational control, and card sliding that demands delicate force modulation and angular precision. These tasks represent significant challenges for learning-based methods due to their stringent precision requirements. Across multiple benchmarks, SaTA significantly outperforms strong visuo-tactile baselines, improving success rates by up to 30% while reducing task completion times by 27%.

Abstract:
Vine robots are thin-walled, tubular, pneumatic soft robots that lengthen at their tips to navigate constrained and complex environments. Previous studies have already explored the mechanics of vine robot bodies and investigated applications for which the device is well-suited. However, these studies almost exclusively focus on eversion rates in the quasi-static regime, overlooking other potential applications of high-speed vine robots in medical devices, projectile launchers, or for informing biology. To better understand this rapid behavior, we present a dynamic growth model for high-velocity vine robot body extension with a payload mass and verify the model experimentally. To the best of the authors knowledge, this is the first instance of vine robots utilized for projectile launching. We find three key results: i) vine robot bodies experience rate-dependent damping that is scale-dependent and monotonically increases with increasing wall thickness; ii) steady-state velocity, or the upper limit of speed in terms of growth velocity, monotonically increases with isometric scaling; and iii) efficiency increases non-linearly with decreasing wall thickness. These findings are used to inform the preliminary design of a large-scale, drug delivery device proof-of-concept, as well as design the fastestonrecord vine, capable of 60 m/s eversion. Our work provides a basic understanding of the dynamic movement of vine robots and opens the door to new areas of application.

Abstract:
Proprioception is essential for coordinating human movements and enhancing the performance of assistive robotic devices. Skin stretch feedback, which closely aligns with natural proprioception mechanisms, presents a promising method for conveying proprioceptive information. To better understand the impact of interference on skin stretch perception, we conducted a user study with 30 participants that evaluated the effect of two simultaneous skin stretches on user perception. We observed that when participants experience simultaneous skin stretch stimuli, a masking effect occurs which deteriorates perception performance in the collocated skin stretch configurations. However, the perceived workload stays the same. These findings show that interference can affect the perception of skin stretch such that multi-channel skin stretch feedback designs should avoid locating modules in close proximity.

Abstract:
Visual Place Recognition (VPR) is fundamental for the global re-localization of robots and devices, enabling them to recognize previously visited locations based on visual inputs. This capability is crucial for maintaining accurate mapping and localization over large areas. Given that VPR methods need to operate in real-time on embedded systems, it is critical to optimize these systems for minimal resource consumption. While the most efficient VPR approaches employ standard convolutional backbones with fixed descriptor dimensions, these often lead to redundancy in the embedding space as well as in the network architecture. Our work introduces a novel structured pruning method, to not only streamline common VPR architectures but also to strategically remove redundancies within the feature embedding space. This dual focus significantly enhances the efficiency of the system, reducing both map and model memory requirements and decreasing feature extraction and retrieval latencies. Our approach has reduced memory usage and latency by 21% and 16%, respectively, across models, while minimally impacting recall@1 accuracy by less than 1%. This significant improvement enhances real-time applications on edge devices with negligible accuracy loss.

Abstract:
Accurate time synchronization between heterogeneous sensors is crucial for ensuring robust state estimation in multi-sensor fusion systems. Sensor delays often cause discrepancies between the actual time when the event was captured and the time of sensor measurement, leading to temporal misalignment (time offset) between sensor measurement streams. In this paper, we propose an extended Kalman filter (EKF)-based radar-inertial odometry (RIO) framework that estimates the time offset online. The radar ego-velocity measurement model, derived from a single radar scan, is formulated to incorporate the time offset into the update. By leveraging temporal calibration, the proposed RIO enables accurate propagation and measurement updates based on a common time stream. Experiments on both simulated and real-world datasets demonstrate the accurate time offset estimation of the proposed method and its impact on RIO performance, validating the importance of sensor time synchronization. Our implementation of the EKF-RIO with online temporal calibration is available at https://github.com/spearwin/EKF-RIO-TC.

Abstract:
Grasping is a fundamental skill for interacting with and manipulating objects in the environment. However, this ability can be challenging for individuals with hand impairments. Soft hand exoskeletons designed to assist grasping can enhance or restore essential hand functions, yet controlling these soft exoskeletons to support users effectively remains difficult due to the complexity of understanding the environment. This study presents a vision-based predictive control framework that leverages contextual awareness from depth perception to predict the grasping target and determine the next control state for activation. Unlike data-driven approaches that require extensive labelled datasets and struggle with generalizability, our method is grounded in geometric modelling, enabling robust adaptation across diverse grasping scenarios. The Grasping Ability Score (GAS) was used to evaluate performance, with our system achieving a state-of-the-art GAS of 91 ± 2% across 15 objects and healthy participants, demonstrating its effectiveness across different object types. The proposed approach maintained reconstruction success for unseen objects, underscoring its enhanced generalizability compared to learning-based models.

Abstract:
The use of semantic features can improve the efficiency of target search in unknown environments for robotic search and rescue missions. Current target search methods rely on training with large datasets of similar domains, which limits the adaptability to diverse environments. However, human experts possess high-level knowledge about semantic relationships necessary to effectively guide a robot during target search missions in diverse and previously unseen environments. In this paper, we propose a target search method that leverages expert input to train a model of semantic priorities. By employing the learned priorities in a frontier exploration planner using combinatorial optimization, our approach achieves efficient target search driven by semantic features while ensuring robustness and complete coverage. The proposed semantic priority model is trained with several synthetic datasets of simulated expert guidance for target search. Simulation tests in previously unseen environments show that our method consistently achieves faster target recovery than a coverage-driven exploration planner.

Abstract:
Quadruped locomotion provides a natural setting for understanding when model-free learning can outperform model-based control design, by exploiting data patterns to bypass the difficulty of optimizing over discrete contacts and the combinatorial explosion of mode changes. We give a principled analysis of why imitation learning with quadrupeds can be inherently effective in a small data regime, based on the structure of its limit cycles, Poincaré return maps, and local numerical properties of neural networks. The understanding motivates a new imitation learning method that regulates the alignment between variations in a latent space and those over the output actions. Hardware experiments confirm that a few seconds of demonstration is sufficient to train various locomotion policies from scratch entirely offline with reasonable robustness.

Abstract:
Contact-rich manipulation requires reliable estimation of extrinsic contactsthe interactions between a grasped object and its environmentwhich provide essential contextual information for planning, control, and policy learning. However, existing approaches often rely on restrictive assumptions, such as predefined contact types, fixed grasp configurations, or camera calibration, that hinder generalization to novel objects and deployment in unstructured environments. In this paper, we present UNIC, a unified multimodal framework for extrinsic contact estimation that operates without any prior knowledge or camera calibration. UNIC directly encodes visual observations in the camera frame and integrates them with proprioceptive and tactile modalities in a fully data-driven manner. It introduces a unified contact representation based on scene affordance maps that captures diverse contact formations and employs a multimodal fusion mechanism with random masking, enabling robust multimodal representation learning. Extensive experiments demonstrate that UNIC performs reliably. It achieves a 9.6 mm average Chamfer distance error on unseen contact locations, performs well on unseen objects, remains robust under missing modalities, and adapts to dynamic camera viewpoints. These results establish extrinsic contact estimation as a practical and versatile capability for contact-rich manipulation.

Abstract:
Many manipulation tasks require memory beyond the current observation, yet most visuomotor policies rely on the Markov assumption and thus struggle with repeated states or long-horizon dependencies. Existing methods attempt to extend observation horizons but remain insufficient for diverse memory requirements. To this end, we propose an object-centric history representation based on point tracking, which abstracts past observations into a compact and structured form that retains only essential task-relevant information. Tracked points are encoded and aggregated at the object level, yielding a compact history representation that can be seamlessly integrated into various visuomotor policies. Our design provides full history-awareness with high computational efficiency, leading to improved overall task performance and decision accuracy. Through extensive evaluations on diverse manipulation tasks, we show that our method addresses multiple facets of memory requirements - such as task stage identification, spatial memorization, and action counting, as well as longer-term demands like continuous and pre-loaded memory - and consistently outperforms both Markovian baselines and prior history-based approaches. Project website: https://tonyfang.net/history/.

Abstract:
Cross-embodiment manipulation is crucial for enhancing the scalability of robot manipulation and reducing the high cost of data collection. However, the significant differences between embodiments, such as variations in action spaces and structural disparities, pose challenges for joint training across multiple sources of data. To address this, we propose One-Policy-Fits-All (OPFA), a framework that enables learning a single, versatile policy across multiple embodiments. We first learn a Geometry-Aware Latent Representation (GaLR), which leverages 3D convolution networks and transformers to build a shared latent action space across different embodiments. Then we design a unified latent retargeting decoder that extracts embodiment-specific actions from the latent representations, without any embodiment-specific decoder tuning. OPFA enables end-to-end co-training of data from any embodiment, including various grippers and dexterous hands with arbitrary degrees of freedom, significantly improving data efficiency and reducing the cost of skill transfer. We conduct extensive experiments across 11 different end-effectors. The results demonstrate that OPFA significantly improves policy performance in diverse settings by leveraging heterogeneous embodiment data. For instance, cross-embodiment co-training can improve success rates by more than 50% compared to single-source training. Moreover, by adding only a few demonstrations from a new embodiment (e.g., eight), OPFA can achieve performance comparable to that of a well-trained model with 72 demonstrations.

Abstract:
Audio serves as an important bridge connecting humans to their surroundings, providing a unique modality for perceiving the world. For embodied AI systems, such as robots and autonomous vehicles, enabling them to understand the world through sound is a promising and significant research direction. In this paper, we explore the underexplored domain of audio-driven 3D shape generation and propose a novel architecture for audio-conditioned 3D shape synthesis. Specifically, our framework comprises three key modules: cross-modal alignment, a latent diffusion model for generation, and a 3D Gaussian Splatting (3DGS) based optimization module. We first align audio and 3D shape representations within a unified embedding space using a contrastive learning strategy, which conditions a latent diffusion model to generate an initial coarse 3D structure. Subsequently, we introduce a refinement stage utilizing 3D Gaussian Splatting to produce high-fidelity 3D shapes. Extensive qualitative and quantitative experiments validate the effectiveness of our proposed method, demonstrating its capability to generate semantically coherent 3D shapes from audio input.

Abstract:
Unmanned Aerial Vehicles (UAVs) depend on onboard sensors for perception, navigation, and control. However, these sensors are susceptible to physical attacks, such as GPS spoofing, that can corrupt state estimates and lead to unsafe behavior. While reinforcement learning (RL) offers adaptive control capabilities, existing safe RL methods are ineffective against such attacks. We present ARMOR (Adaptive Robust Manipulation-Optimized State Representations), an attack-resilient, model-free RL controller that enables robust UAV operation under adversarial sensor manipulation. Instead of relying on raw sensor observations, ARMOR learns a robust latent representation of the UAVs physical state via a two-stage training framework. In the first stage, a teacher encoder, trained with privileged attack information, generates attack aware latent states for RL policy training. In the second stage, a student encoder is trained via supervised learning to approximate the teachers latent states using only historical sensor data, enabling real-world deployment without privileged information. Our experiments show that ARMOR outperforms conventional methods, ensuring UAV safety. Additionally, ARMOR improves generalization to unseen attacks and reduces training cost by eliminating the need for iterative adversarial training.

Abstract:
This paper presents Spiking-Refined 3D Object Detection through YOLOSNN Fusion, a real-time pipeline that leverages both convolutional and spiking neural representations for enhanced scene perception. Our system integrates YOLOv11 for robust 2D detection, Depth Anything v2 for monocular depth inference, and geometry-based reasoning for 3D bounding box construction, while a Birds-Eye View visualizer provides spatial context. To further improve recognition consistency, we fuse the predictions of a trained Spiking Neural Network (SNN) with YOLO outputs, enabling class refinement that is more resilient to temporal noise and ambiguous appearances. Kalman filtering is employed to stabilize trajectories over time, ensuring coherent 3D tracking. Unlike sensor-heavy setups, our approach runs on a single RGB camera and lightweight models, making it suitable for robotic perception, AR/VR applications, and low-cost embedded platforms. Experiments on real-world video sequences demonstrate improved 3D detection accuracy, temporal stability, and cross-class discrimination compared to conventional monocular pipelines.

Abstract:
Cloth manipulation is a difficult problem mainly because of the non-rigid nature of cloth, which makes a good representation of deformation essential. We present a new representation for the deformation-state of clothes. First, we propose the dGLI disk representation based on topological indices computed for edge segments of the cloth border that are arranged on a circular grid. The heat-map of the dGLI disk uncovers patterns that correspond to features of the cloth state that are consistent for different shapes, sizes or orientation of the cloth. We then abstract these important features from the dGLI disk into a circle, calling it the Cloth StatE representation (CloSE). This representation is compact, continuous, and general for different shapes. We show that this representation is able to accurately predict the fold locations for several simulation clothing datasets. Finally, we also show the strengths of this representation in two relevant applications: semantic labeling and high- and low-level planning. The code and the dataset can be accessed from: https://close-representation.github.io/

Abstract:
Adverse weather conditions, such as rain, snow, and fog, severely degrade LiDAR semantic segmentation by introducing refraction, scattering, and point dropouts that compromise geometric integrity. While prior approaches ranging from weather simulation and mixing-based augmentation to domain randomization and regularization enhance robustness, they frequently overlook structural vulnerabilities inherent to object boundaries, corners, and highly sparse regions. To address this limitation, we propose a Light GeometryAware Adapter. This module aligns azimuths and applies horizontal circular padding to preserve neighbor continuity across the 0 �?360�?wrap-around boundary. Using a local-window KNearest Neighbors (KNN) search, it aggregates nearby points and computes lightweight local statistics, compressing them into compact geometry-aware cues. During training, these cues facilitate region-aware regularization, which effectively stabilizes predictions in structurally fragile areas. The proposed adapter is designed to be plug-and-play, complements existing augmentation techniques, and operates exclusively during training, incurring negligible inference overhead. Operating under a rigorous source-only cross-weather paradigm wherein models are trained on SemanticKITTI and evaluated on SemanticSTF without target-domain labels or fine-tuning, our adapter achieves a +3.4 mIoU improvement over strong data-centric augmentation baselines. Furthermore, it demonstrates performance comparable to advanced classcentric regularization methods. These findings highlight that geometry-driven regularization constitutes a critical pathway toward achieving highly robust, all-weather LiDAR segmentation. S

Abstract:
This work presents amortized path-integral policies that enable efficient and real-time active search for robotic systems. We model search as an active sensing problem where agents select actions to maximize information about target locations. Unlike previous approaches that only consider information gain at final waypoints, our method accounts for observations along entire paths. To address the computational expense of path-integral policies, we amortize costs through Graph Neural Network (GNN) policies trained via behavior cloning. GNNs provide equivariance to spatial transformations and generalize across diverse maps. We validate our approach through field experiments in a 75,000 m² forested environment using an autonomous ground vehicle, along with simulated testing. Our experiments demonstrate successful policy amortization, cross-map transfer, and improved search efficiency.

Abstract:
Imitation learning is a promising approach for training autonomous vehicles (AV) to navigate complex traffic environments by mimicking expert driver behaviors. While existing imitation learning frameworks focus on leveraging expert demonstrations, they often overlook the potential of additional complex driving data from surrounding traffic participants. In this paper, we study a data augmentation strategy that leverages the observed trajectories of nearby vehicles, captured by the AVs sensors, as additional demonstrations. We introduce a simple vehicle-selection sampling and filtering strategy that prioritizes informative and diverse driving behaviors, contributing to a richer dataset for training. We evaluate this idea with a representative learning-based planner on a large real-world dataset and demonstrate improved performance in complex driving scenarios. Specifically, the approach reduces collision rates and improves safety metrics compared to the baseline. Notably, even when using only 10 percent of the original dataset, the method matches or exceeds the performance of the full dataset. Through ablations, we analyze selection criteria and show that naive random selection can degrade performance. Our findings highlight the value of leveraging diverse real-world trajectory data in imitation learning and provide insights into data augmentation strategies for autonomous driving.

Abstract:
Developing broadly generalizable visual navigation policies for robots is a significant challenge, primarily constrained by the availability of large-scale, diverse training data. While curated datasets collected by researchers offer high quality, their limited size restricts policy generalization. To overcome this, we explore leveraging abundant, passively collected data sources, including large volumes of crowd-sourced teleoperation data and unlabeled YouTube videos, despite their potential for lower quality or missing action labels. We propose Model-Based ReAnnotation (MBRA), a framework that utilizes a learned short-horizon, model-based expert model to relabel or generate high-quality actions for these passive datasets. This relabeled data is then distilled into LogoNav, a long-horizon navigation policy conditioned on visual goals or GPS waypoints. We demonstrate that LogoNav, trained using MBRA-processed data, achieves state-of-the-art performance, enabling robust navigation over distances exceeding 300 meters in previously unseen indoor and outdoor environments. Our extensive real-world evaluations, conducted across a fleet of robots (including quadrupeds) in six cities on three continents, validate the policy's ability to generalize and navigate effectively even amidst pedestrians in crowded settings.

Abstract:
This study introduces a novel Bowden cable (BC) system for hand-assistive exoskeletons employing superelastic (SE) shape memory alloy wires to address key limitations, such as efficiency and safety limitations. The unique properties of SE wires enable a single-wire transmission, offering enhanced performance, plus inherent self-sensing and self-limiting capabilities that provide tendon-like overload protection. Experimental results obtained with a setup simulating use conditions demonstrate the superior efficiency of SE wires, with 1/4 the friction of conventional steel cables. In addition, a validated force-sensing capability, achieved by monitoring electrical resistance, proves to accurately detect overload within 1% force error. This, along with the inherent passive force self-limiting behaviour during simulated collisions, demonstrates the ability of the SE BC to effectively mimic the protective function of biological tendons. Therefore, this biomimetic innovation in soft robotic transmission significantly improves safety and efficiency, presenting a promising advancement for human-robot interaction in assistive and rehabilitative robotics.

Abstract:
Soft optical sensors hold potential for enhancing minimally invasive procedures like colonoscopy, yet their complex, multi-modal responses pose significant challenges. This work introduces a machine learning (ML) framework for real-time estimation of 3D shape and contact force in a soft robotic sleeve for colonoscopy. To overcome limitations of manual calibration and collect large datasets for ML, we developed an automated platform for collecting data across a range of orientations, curvatures, and contact forces. A cascaded ML architecture was implemented for sequential estimation of contact force and 3D shape, enabling an accuracy with errors of 4.7% for curvature, 2.37% for orientation, and 5.5% for force tracking. We also explored the potential of ML for contact localization by training a model to estimate contact intensity and location across 16 indenters distributed along the sleeve. The force intensity was estimated with an error ranging from 0.06 N to 0.31 N throughout the indenters. Despite the proximity of the contact points, the system achieved high localization performances, with 8 indenters reaching over 80% accuracy, demonstrating promising spatial resolution.

Abstract:
Effective human-robot collaboration depends on task-oriented handovers, where robots present objects in ways that support the partners intended use. However, many existing approaches neglect the humans intended action after the handover, relying on assumptions that limit generalizability. To address this gap, we propose LLM-Handover, a novel framework that integrates large language model (LLM)-based reasoning with part segmentation to enable context-aware grasp selection and execution. Given an RGB-D image and a task description, our system infers relevant object parts and selects grasps that optimize post-handover usability. To support evaluation, we introduce a new dataset of 60 household objects spanning 12 categories, each annotated with detailed part labels. We first demonstrate that our approach improves the performance of the used state-of-the-art part segmentation method, in the context of robot-human handovers. Next, we show that LLM-Handover achieves higher grasp success rates and adapts better to post-handover task constraints. During hardware experiments, we achieve a success rate of 83% in a zero-shot setting over conventional and unconventional post-handover tasks. Finally, our comparative user study underlines that our method enables more intuitive, context-aware handovers, with participants preferring it in 86% of cases.

Abstract:
As robotic systems become increasingly complex, the need for explainable decision-making becomes critical. Existing explainability approaches in robotics typically either focus on individual modules, which can be difficult to query from the perspective of high-level behaviour, or employ monolithic approaches, which do not exploit the modularity of robotic architectures. We present HEXAR (Hierarchical EXplainability Architecture for Robots), a novel framework that provides a plug-in, hierarchical approach to generate explanations about robotic systems. HEXAR consists of specialised component explainers using diverse explanation techniques (e.g., LLM-based reasoning, causal models, feature importance, etc) tailored to specific robot modules, orchestrated by an explainer selector that chooses the most appropriate one for a given query. We implement and evaluate HEXAR on a TIAGo robot performing assistive tasks in a home environment, comparing it against end-to-end and aggregated baseline approaches across 180 scenario-query variations. We observe that HEXAR significantly outperforms baselines in root cause identification, incorrect information exclusion, and runtime, offering a promising direction for transparent autonomous systems.

Abstract:
Soft growing robots, as highly mobile pneumatic membrane robots, are limited in control performance due to their soft structure and nonlinear mechanical properties, especially under dynamic conditions. Therefore, developing reliable control strategies for the robot is essential. This study proposes a dual-thread, goal-oriented control strategy for soft growing robot that combines planning and control. By integrating graph convolutional networks with deep reinforcement learning, the global path planning method is better suited to the self-growing behaviors of soft robots, leading to improvements in both computational efficiency and accuracy compared to inverse kinematics planning methods. Motion control reduces the adverse effects of deformation errors caused by its own low stiffness or by disturbances in the external environment. This strategy effectively combines reinforcement learning-based global planning with a multiple closed-loop motion control system, addressing the issues of low precision and reliability under dynamic conditions. Experimental results demonstrate that the robot achieves a tracking accuracy of 11.83 mm within a 5-meter range and successfully tracks and approaches a non-cooperative dynamic target. These results highlight the significant potential of the proposed approach in applications such as target capture and dynamic manipulation.

Abstract:
Micro underwater robots offer scalable and low-cost access to environments that are difficult to study with conventional vehicles, but severe communication constraints, limited onboard power, and low swimming speed restrict the capability of these miniature systems. Inspired by colonial organisms such as salps and siphonophores, this work explores physically connected swarms of small underwater robots that form larger structures with improved collective performance. We present a modular platform of centimeter-scale robots capable of three-dimensional propulsion, onboard sensing, and autonomous behavior, with magnetic interfaces that enable reversible connections into prescribed morphologies. Experiments demonstrate autonomous assembly and disassembly, quantify the propulsive benefits of chain aggregates, and show that mechanically coupled robots can distribute sensing, actuation, and control across the collective. Results indicate that certain colonial architectures can greatly improve swimming speed, locomotion efficiency, and task performance compared to individual robots, suggesting a path toward more capable centimeter-scale underwater swarms.

Abstract:
Reinforcement learning (RL) has shown promise in generating robust locomotion policies for bipedal robots, but often suffers from tedious reward design and sensitivity to poorly shaped objectives. In this work, we propose a structured reward shaping framework that leverages model-based trajectory generation and control Lyapunov functions (CLFs) to guide policy learning. We explore two model-based planners for generating reference trajectories: a reduced-order linear inverted pendulum (LIP) model for velocity-conditioned motion planning, and a precomputed gait library based on hybrid zero dynamics (HZD) using full-order dynamics. These planners define desired end-effector and joint trajectories, which are used to construct CLF-based rewards that penalize tracking error and encourage rapid convergence. This formulation provides meaningful intermediate rewards, and is straightforward to implement once a reference is available. Both the reference trajectories and CLF shaping are used only during training, resulting in a lightweight policy at deployment. We validate our method both in simulation and through extensive real-world experiments on a Unitree G1 robot. CLF-RL demonstrates significantly improved robustness relative to the baseline RL policy and better performance than a classic tracking reward RL formulation.

Abstract:
Recent years have seen a focus on research into distributed optimization algorithms for multi-robot Collaborative Simultaneous Localization and Mapping (C-SLAM). Research in this domain, however, is made difficult by a lack of standard benchmark datasets. Such datasets have been used to great effect in the field of single-robot SLAM, and researchers focused on multi-robot problems would benefit greatly from dedicated benchmark datasets. To address this gap, we design and release the Collaborative Open-Source Multi-robot Optimization Benchmark (COSMO-Bench) -- a suite of 24 datasets derived from a baseline C-SLAM front-end and real-world LiDAR data.

Abstract:
Trajectory prediction is a fundamental yet challenging task in intelligent systems. Existing methods are mainly categorized as single-stage time-domain, two-stage time-domain, or two-stage spectrum-domain approaches, while single-stage spectrum-domain methods have been relatively underexplored. In the frequency domain, low-frequency components reflect global motion trends, while high-frequency components capture fine-grained local variations. Most existing spectrum-domain approaches process these components independently, overlooking their intrinsic complementarity. Inspired by the success of bilinear models in explicitly capturing cross-factor interactions, we propose S^3-Net, a single-stage spectrum-domain trajectory prediction network with a bilinear fusion module that integrates low- and high-frequency dynamics. This design yields richer spectral representations and enables accurate, socially compliant, and multimodal predictions. Experiments on the ETH-UCY and Stanford Drone Datasets demonstrate that S^3-Net achieves up to 16.8%/15.1% ADE/FDE reduction over spectrum-domain baselines while maintaining a compact model size and low inference latency, making it suitable for real-time scenarios.

Abstract:
Generative models have shown substantial impact across multiple domains, their potential for scene synthesis remains underexplored in robotics. This gap is more evident in drone simulators, where simulation environments still rely heavily on manual efforts, which are time-consuming to create and difficult to scale. In this work, we introduce AeroScene, a hierarchical diffusion model for progressive 3D scene synthesis. Our approach leverages hierarchy-aware tokenization and multi-branch feature extraction to reason across both global layouts and local details, ensuring physical plausibility and semantic consistency. This makes AeroScene particularly suited for generating realistic scenes for aerial robotics tasks such as navigation, landing, and perching. We demonstrate its effectiveness through extensive experiments on our newly collected dataset and a public benchmark, showing that AeroScene significantly outperforms prior methods. Furthermore, we use AeroScene to generate a large-scale dataset of over 1,000 physics-ready, high fidelity 3D scenes that can be directly integrated into NVIDIA Isaac Sim. Finally, we illustrate the utility of these generated environments on downstream drone navigation tasks.

Abstract:
Imitation Learning (IL) is a natural way for humans to teach robots, particularly when high-quality demonstrations are easy to obtain. While IL has been widely applied to single-robot settings, relatively few studies have addressed the extension of these methods to multi-agent systems, especially in settings where a single human must provide demonstrations to a team of collaborating robots. In this paper, we introduce and study Round-Robin Behavior Cloning (R2BC), a method that enables a single human operator to effectively train multi-robot systems through sequential, single-agent demonstrations. Our approach allows the human to teleoperate one agent at a time and incrementally teach multi-agent behavior to the entire system, without requiring demonstrations in the joint multi-agent action space. We show that R2BC methods matchand in some cases surpassthe performance of an oracle behavior cloning approach trained on privileged synchronized demonstrations across four multi-agent simulated tasks. Finally, we deploy R2BC on two physical robot tasks trained using real human demonstrations.

Abstract:
This paper studies runtime monitoring for persistent surveillance by autonomous robots when the autonomy stack is a black box. The environment is partitioned into finitely many parts, each carrying an uncertainty state that decreases when observed and increases otherwise. We model the closed loop as a state-dependent hybrid system with linear parameter varying dynamics and design a monitor based on an invariant computed offline. As this invariant is typically hard to obtain for large to-be-surveyed spaces, we propose a compositional monitor obtained by decentralized computation of low-dimensional invariant sets for each uncertainty region, and checking their conjunction online. Under common independence assumptions, the compositional monitor is sound and complete with respect to the full-system invariant. The approach is applied in a case study with a real robot persistently monitoring a labyrinth, emphasizing its applicability in practice.

Abstract:
Understanding open-world semantics is critical for robotic planning and control, particularly in unstructured outdoor environments. Existing vision-language mapping approaches typically rely on object-centric segmentation priors, which often fail outdoors due to semantic ambiguities and indistinct class boundaries. We propose OTAS - an Open-vocabulary Token Alignment method for outdoor Segmentation. OTAS addresses the limitations of open-vocabulary segmentation models by extracting semantic structure directly from the output tokens of pre-trained vision models. By clustering semantically similar structures across single and multiple views and grounding them in language, OTAS reconstructs a geometrically consistent feature field that supports open-vocabulary segmentation queries. Our method operates in a zero-shot manner, without scene-specific fine-tuning, and achieves real-time performance of up to ~17 fps. On the Off-Road Freespace Detection dataset, OTAS yields a modest IoU improvement over fine-tuned and open-vocabulary 2D segmentation baselines. In 3D segmentation on TartanAir, it achieves up to a 151% relative IoU improvement compared to existing open-vocabulary mapping methods. Real-world reconstructions further demonstrate OTAS' applicability to robotic deployment. Code and a ROS 2 node are available at https://otas-segmentation.github.io/.

Abstract:
3D Gaussian Splatting (3DGS) has recently demonstrated impressive capabilities in real-time novel view synthesis. However, the performance of 3DGS tends to degrade significantly when the quality of the initial point cloud is poor. Although subsequent research has successfully addressed the initialization issue by using suboptimal point clouds to train the 3D Gaussian model, certain challenges still remain in practical applications. Specifically, the lack of an effective pruning strategy to thoroughly eliminate suboptimal points (defined as erroneous points in this paper). The excessive accumulation of these erroneous points leads to overfitting in specific viewpoints, thereby affecting the visual appearance and geometric accuracy in novel view synthesis. To address these challenges, we propose a novel 3DGS optimization method named MVC-GS, which introduces two key innovative contributions. First, based on multi-view geometric constraints, we use image rendering errors as a guiding criterion for optimization. By performing point calibration in the target region, we effectively mitigate the impact of erroneous Gaussian points. Subsequently, we introduce a multi-view Gaussian attribute optimization method that further enhances the precision of 3D Gaussian attributes representation, while avoiding overfitting to the training views. We conducted comprehensive visualization analysis across multiple scenes in various datasets. Extensive experiments on public datasets show that the proposed method achieves state-of-the-art performance across diverse scenes.

Abstract:
In this paper, we address the point cloud registration problem, where well-known methods like ICP fail under uncertainty arising from sensor noise, pose‐estimation errors, and partial overlap due to occlusion. We develop a novel approach, Gaussian Process Concept Attribution (GP-CA), which not only quantifies registration uncertainty but also explains it by attributing uncertainty to well-known sources of errors in registration problems. Our approach leverages active learning to discover new uncertainty sources in the wild by querying informative instances. We validate GP-CA on three publicly available datasets and in our real-world robot experiment. Extensive ablations substantiate our design choices. Our approach outperforms other state-of-the-art methods in terms of runtime, high sample-efficiency with active learning, and high accuracy. Our real-world experiment clearly demonstrates its applicability. Our video also demonstrates that GP-CA enables effective failure-recovery behaviors, yielding more robust robotic perception.

Abstract:
Large Language Model (LLM)-based robotic assembly assistance has gained significant research attention. It requires the injection of domain-specific knowledge to guide the assembly process through natural language interaction with humans. Despite some progress, existing methods represent knowledge in the form of natural language text. Due to the long context and redundant content, they struggle to meet the robots' requirements for real-time and precise reasoning. In order to bridge this gap, we present a novel graph-based LLM, denoted as AssemMate, which consists of two stages: graph-based question answering and vision-enhanced grasp execution. The first stage enables natural language question answering on a knowledge graph, supporting human-robot interaction and assembly task planning for specific products. The second stage then utilizes the planning generated before as a target, senses stacked scenes, and executes grasping to assist with assembly. Specifically, a self-supervised Graph Convolutional Network (GCN) encodes knowledge graph entities and relations into a latent space and aligns them with LLM's representation, enabling the LLM to understand graph information. In addition, a vision-enhanced strategy is employed to address stacked scenes in grasping. Through training and evaluation, AssemMate outperforms existing methods, achieving 6.4% higher accuracy, 3 times faster inference, and 28 times shorter context length, while demonstrating strong generalization ability on random graphs. And our approach further demonstrates superiority through robotic grasping experiments in both simulated and real-world settings. More details can be found on the project page https://github.com/cristina304/AssemMate.git

Abstract:
End-to-end (E2E) driving has become a cornerstone of both industry deployment and academic research, offering a single learnable pipeline that maps multi-sensor inputs to actions while avoiding hand-engineered modules. However, the reliability of such pipelines strongly depends on how well they handle uncertainty: sensors are noisy, semantics can be ambiguous, and interaction with other road users is inherently stochastic. Uncertainty also appears in multiple forms: classification vs. localization, and, crucially, in both static map elements and dynamic agents. Existing E2E approaches model only static-map uncertainty, leaving planning vulnerable to overconfident and unreliable inputs. We present UniUncer, the first lightweight, unified uncertainty framework that jointly estimates and uses uncertainty for both static and dynamic scene elements inside an E2E planner. Concretely: (1) we convert deterministic heads to probabilistic Laplace regressors that output per-vertex location and scale for vectorized static and dynamic entities; (2) we introduce an uncertainty-fusion module that encodes these parameters and injects them into object/map queries to form uncertainty-aware queries; and (3) we design an uncertainty-aware gate that adaptively modulates reliance on historical inputs (ego status or temporal perception queries) based on current uncertainty levels. The design adds minimal overhead and drops throughput by only ~0.5 FPS while remaining plug-and-play for common E2E backbones. On nuScenes (open-loop), UniUncer reduces average L2 trajectory error by 7%. On NavsimV2 (pseudo closed-loop), it improves overall EPDMS by 10.8% and achieve notable Stage-2 gains in challenging, interaction-heavy scenes. Ablations confirm that dynamic-agent uncertainty and the uncertainty-aware gate are both necessary.

Abstract:
Robot manipulation, especially bimanual manipulation, often requires setting up multiple cameras on multiple robot manipulators. Before robot manipulators can generate motion or even build representations of their environments, the cameras rigidly mounted to the robot need to be calibrated. Camera calibration is a cumbersome process involving collecting a set of images, with each capturing a pre-determined marker. In this work, we introduce the Bi-Manual Joint Calibration and Representation Framework (Bi-JCR). Bi-JCR enables multiple robot manipulators, each with cameras mounted, to circumvent taking images of calibration markers. By leveraging 3D foundation models for dense, marker-free multi-view correspondence, Bi-JCR jointly estimates: (i) the extrinsic transformation from each camera to its end-effector, (ii) the inter-arm relative poses between manipulators, and (iii) a unified, scale-consistent 3D representation of the shared workspace, all from the same captured RGB image sets. The representation, jointly constructed from images captured by cameras on both manipulators, lives in a common coordinate frame and supports collision checking and semantic segmentation to facilitate downstream bimanual coordination tasks. We empirically evaluate the robustness of Bi-JCR on a variety of tabletop environments, and demonstrate its applicability on a variety of downstream tasks.

Abstract:
Magnetic actuation enables contactless control of medical microrobots and instruments and offers the potential for improved safety and effectiveness in robot-assisted minimally invasive surgery. While much research is being conducted on the development of surgical devices, there is a lack of external actuation systems that provide the necessary magnetic field shaping capability for in vivo control. Existing magnetic actuation systems often face trade-offs between field shaping capability and workspace size. In this work, we introduce HybNetic, a mobile hybrid magnetic actuation system that combines a single electromagnet with four independently rotatable permanent magnets mounted on a robotic arm. The C-shaped configuration of HybNetic has an opening of 520 mm, allowing positioning around the human torso. The mobility of the employed robotic arm extends the effective workspace to the length of a human body. We describe the design and field modeling and characterize the magnetic performance by comparing analytical model predictions and finite element simulations with experimental validations. Finally, we demonstrate the versatility of HybNetic by levitating a magnetic sphere and navigating a magnetic guidewire through a dimensionally accurate phantom of the abdominal aorta. The demonstrations highlight the potential of HybNetic as a magnetic actuation system with a workspace that is suitable for in vivo manipulation of macro- and micro-scale magnetic devices.

Abstract:
Dexterous manipulation has advanced rapidly, with policies now capable of performing complex, contact-rich tasks in simulation. However, transferring these policies from simulation to real world remains a significant challenge. A key obstacle is the mismatch in low-level controller dynamics, where same trajectories can produce vastly different contact forces and behaviors when control parameters change. Existing solutions often rely on manual tuning or controller randomization, which can be labor-intensive, task-specific, and introduce substantial training difficulty. In this work, we propose DexCtrl, a novel framework that jointly learns actions and controller parameters by leveraging the historical information of both trajectory and controller. This adaptive controller adjustment mechanism enables the policy to automatically tune control parameters during execution, thereby mitigating severe sim-to-real gap without extensive manual tuning or excessive randomization. Moreover, by explicitly providing controller parameters as part of the observation, our approach facilitates better reasoning over force interactions and improves robustness in real-world scenarios. Experimental results demonstrate that our method achieves improved transfer performance across a variety of dexterous tasks involving variable force conditions.

Abstract:
We introduce UMI-on-Air, a framework for embodiment-aware deployment of embodiment-agnostic manipulation policies. Our approach leverages diverse, unconstrained human demonstrations collected with a handheld gripper (UMI) to train generalizable visuomotor policies. A central challenge in transferring these policies to constrained robotic embodimentssuch as aerial manipulatorsis the mismatch in control and robot dynamics, which often leads to out-of-distribution behaviors and poor execution. To address this, we propose Embodiment-Aware Diffusion Policy (EADP), which couples a high-level UMI policy with a low-level embodiment-specific controller at inference time. By integrating gradient feedback from the controllers tracking cost into the diffusion sampling process, our method steers trajectory generation towards dynamically feasible modes tailored to the deployment embodiment. This enables plug-and-play, embodiment-aware trajectory adaptation at test time. We validate our approach on long-horizon and high-precision aerial manipulation tasks, showing improved success rates, efficiency, and robustness under disturbances compared to unguided diffusion baselines. Finally, we demonstrate deployment in previously unseen environments, using UMI demonstrations collected in the wild, highlighting a practical pathway for scaling generalizable manipulation skills across diverseand even highly constrainedembodiments. All code, data, checkpoints, and result videos can be found at umi-on-air.github.io.

Abstract:
Autonomous Vehicle (AV) perception systems have advanced rapidly in recent years, providing vehicles with the ability to accurately interpret their environment. Perception systems remain susceptible to errors caused by overly-confident predictions in the case of rare events or out-of-sample data. This study equips an autonomous vehicle with the ability to know when it is uncertain, using an uncertainty-aware image classifier as part of the AV software stack. Specifically, the study exploits the ability of Random-Set Neural Networks (RS-NNs) to explicitly quantify prediction uncertainty. Unlike traditional CNNs or Bayesian methods, RS-NNs predict belief functions over sets of classes, allowing the system to identify and signal uncertainty clearly in novel or ambiguous scenarios. The system is tested in a real-world autonomous racing vehicle software stack, with the RS-NN classifying the layout of the road ahead and providing the associated uncertainty of the prediction. Performance of the RS-NN under a range of road conditions is compared against traditional CNN and Bayesian neural networks, with the RS-NN achieving significantly higher accuracy and superior uncertainty calibration. This integration of RS-NNs into Robot Operating System (ROS)-based vehicle control pipeline demonstrates that predictive uncertainty can dynamically modulate vehicle speed, maintaining high-speed performance under confident predictions while proactively improving safety through speed reductions in uncertain scenarios. These results demonstrate the potential of uncertainty-aware neural networks - in particular RS-NNs - as a practical solution for safer and more robust autonomous driving.

Abstract:
Since vision-based manipulation policies are typically trained from data gathered from a single viewpoint, their performance drops when the view changes during deployment. Naively aggregating demonstrations from numerous random views is not only costly but also known to destabilize learning, as excessive visual diversity acts as noise. We present Vantage, a viewpoint selection framework to fine-tune any pre-trained policy on a small, strategically chosen set of camera poses to induce viewpoint-agnostic behavior. Instead of relying on costly brute-force search over viewpoints, Vantage formulates camera placement as an information gain optimization problem in a continuous space. This approach balances exploration of novel poses with exploitation of promising ones, while also providing theoretical guarantees about convergence and robustness. Across manipulation tasks and policy families, Vantage consistently improves success under viewpoint shifts compared to fixed, grid, or random data selection strategies with only a handful of fine-tuning steps. Experiments conducted on simulated and real-world setups show that Vantage increases the task success rate by �?5% for diffusion policies, and yields robust gains in dynamic-camera settings.

Abstract:
Diffusion models are increasingly used in robotics to represent multi-modal distributions over system states and behaviors, but precise control of generated outcomes without degrading physical realism remains challenging. This paper introduces a controllable diffusion framework that (i) replaces the standard unimodal Gaussian prior with an explicit multi-modal prior, and (ii) enforces modal coupling between prior components and principal data modes through novel forward and reverse diffusion processes. Sampling is initialized directly from a selected prior mode aligned with task constraints, avoiding traintest mismatch and manifold drift commonly induced by post-hoc guidance. Empirical evaluations on motion prediction (Waymo Dataset) and multi-task control (Maze2D) show consistent improvements over guidance-based baselines in fidelity, diversity, and controllability. These results indicate that multi-modal priors with strong modal coupling provide a scalable basis for controllable motion generation in robotics.

Abstract:
Planning long-horizon manipulation motions using a set of predefined skills is a central challenge in robotics; solving it efficiently could enable general-purpose robots to tackle novel tasks by flexibly composing generic skills. Solutions to this problem lie in an infinitely vast space of parameterized skill sequences--a space where common incremental methods struggle to find sequences that have non-obvious intermediate steps. Some approaches reason over lower-dimensional, symbolic spaces, which are more tractable to explore but may be brittle and are laborious to construct. In this work, we introduce textscMosaic, a skill-centric, multi-directional planning approach that targets these challenges by reasoning about which skills to employ and where they are most likely to succeed, by utilizing physics simulation to estimate skill execution outcomes. Specifically, textscMosaic employs two complementary skill families: textitGenerators, which identify ``islands of competence'' where skills are demonstrably effective, and textitConnectors, which link these skill-trajectories by solving boundary value problems. By focusing planning efforts on regions of high competence, textscMosaic efficiently discovers physically-grounded solutions. We demonstrate its efficacy on complex long-horizon problems in both simulation and the real world, using a diverse set of skills including generative diffusion models, motion planning algorithms, and manipulation-specific models. Visit hrefskill-mosaic.github.iotextttskill-mosaic.github.io for demonstrations.

Abstract:
We study decentralized cooperative transport using teams of N-quadruped robots with arm that must pinch, lift, and move ungraspable objects through physical contact alone. Unlike prior work that relies on rigid mechanical coupling between robots and objects, we address the more challenging setting where mechanically independent robots must coordinate through contact forces alone without any communication or centralized control. To this end, we employ a hierarchical policy architecture that separates base locomotion from arm control, and propose a constellation reward formulation that unifies position and orientation tracking to enforce rigid contact behavior. The key insight is encouraging robots to behave as if rigidly connected to the object through careful reward design and training curriculum rather than explicit mechanical constraints. Our approach enables coordination through shared policy parameters and implicit synchronization cuesscaling to arbitrary team sizes without retraining. We show extensive simulation experiments to validate the approach and demonstrate robust transport across 2-10 robots on diverse object geometries and masses, along with sim2real transfer results on lightweight objects.

Abstract:
Enabling humanoid robots to exploit physical contact, rather than simply avoid collisions, is crucial for autonomy in unstructured environments. Traditional optimization-based planners struggle with contact complexity, while on-policy reinforcement learning (RL) is sample-inefficient and has limited multi-task ability. We propose a framework combining a learned world model with sampling-based Model Predictive Control (MPC), trained on a demonstration-free offline dataset to predict future outcomes in a compressed latent space. To address sparse contact rewards and sensor noise, the MPC uses a learned surrogate value function for dense, robust planning. Our single, scalable model supports contact-aware tasks, including wall support after perturbation, blocking incoming objects, and traversing height-limited arches, with improved sample efficiency and multi-task capability over on-policy RL. Deployed on a physical humanoid, our system achieves robust, real-time contact planning from proprioception and ego-centric depth images.

Abstract:
A central challenge for multi-robot systems is fusing independently gathered perception data into a unified representation. Despite progress in Collaborative SLAM (C-SLAM), benchmarking remains hindered by the scarcity of dedicated multi-robot datasets. Many evaluations instead partition single-robot trajectories, a practice that may only partially reflect true multi-robot operations and, more critically, lacks standardization, leading to inconsistent results across studies. While several multi-robot datasets have recently been introduced, they mostly contain short trajectories with limited inter-robot overlap and sparse intra-robot loop closures. To overcome these limitations, we introduce CU-Multi, a dataset collected over multiple days at two large outdoor sites on the University of Colorado Boulder campus. CU-Multi comprises four synchronized runs with aligned start times and controlled trajectory overlap, replicating the distinct perspectives of a robot team. It includes RGB-D sensing, RTK GPS, semantic LiDAR, and refined ground-truth odometry. By combining overlap variation with dense semantic annotations, CU-Multi provides a strong foundation for reproducible evaluation in multi-robot collaborative perception tasks.

Abstract:
Mobile robots deployed for persistent operations in partially known environments need to be able to recover and adapt against unforeseen changes in dynamics, e.g., due to failures, or external disturbances. This paper presents a novel hierarchical framework capable of zero-shot adaptation to environmental and dynamic changes. At the high level, an abstract planner generates a collision-free global path, adapting to degraded mobility by inflating a dynamic safety buffer around obstacles to ensure the route remains navigable. At the low level, a concrete planner employs a conditional Denoising Diffusion Probabilistic Model (DDPM) to refine the abstract path into a smooth, executable trajectory. The key to our approach is conditioning the diffusion model's generation process on the robot's online-estimated dynamic limits. Our framework's effectiveness and robustness are validated in both complex simulations and real-world hardware experiments, demonstrating its ability to ensure mission success under unstructured and unexpected fault situations.

Abstract:
Finding high-quality solutions quickly is an important objective in motion planning. This is especially true for high-degree-of-freedom robots. Satisficing planners have traditionally found feasible solutions quickly but provide no guarantees on their optimality, while almost-surely asymptotically optimal (a.s.a.o.) planners have probabilistic guarantees on their convergence towards an optimal solution but are more computationally expensive. This paper uses the AO-x meta-algorithm to extend the satisficing RRT-Connect planner to optimal planning. The resulting Asymptotically Optimal RRT-Connect (AORRTC) finds initial solutions in similar times as RRT-Connect and uses any additional planning time to converge towards the optimal solution in an anytime manner. It is proven to be probabilistically complete and a.s.a.o. AORRTC was tested with the Panda (7 DoF) and Fetch (8 DoF) robotic arms on the MotionBenchMaker dataset. These experiments show that AORRTC finds initial solutions as fast as RRT-Connect and faster than the tested state-of-the-art a.s.a.o. algorithms while converging to better solutions faster. AORRTC finds solutions to difficult high-DoF planning problems in milliseconds where the other a.s.a.o. planners could not consistently find solutions in seconds. This performance was demonstrated both with and without single instruction/multiple data (SIMD) acceleration.

Abstract:
Autonomous Mobile Robots (AMRs) are revolutionizing industries by enhancing flexibility and efficiency, particularly in dynamic environments such as automotive manufacturing. These environments pose challenges due to their constantly changing layouts, unpredictable obstacles, and varying conditions, which impact the performance of localization systems. This paper presents a novel real-time localization scoring architecture to address these challenges by quantifying the confidence in a robots positioning system. The proposed textitLocalization Score improves map reconciliation, manages sensor interference, adapts navigation strategies, and enhances traffic coordination. Extensive experimental studies, including real-world deployment in an operational automotive production factory, demonstrate the robustness, accuracy, and adaptability of the developed Localization Score algorithm. The results showcase its potential to significantly enhance the operational efficiency and reliability of AMRs in industrial settings.

Abstract:
Robotic dressing assistance has the potential to improve the quality of life for individuals with limited mobility. Existing solutions predominantly rely on rigid robotic manipulators, which have challenges in handling deformable garments and ensuring safe physical interaction with the human body. Prior robotic dressing methods require excessive operation times, complex control strategies, and constrained user postures, limiting their practicality and adaptability. This paper proposes a novel soft robotic dressing system, the Self-Wearing Adaptive Garment (SWAG), which uses an unfurling and growth mechanism to facilitate autonomous dressing. Unlike traditional approaches, the SWAG conforms to the human body through an unfurling-based deployment method, eliminating skin-garment friction and enabling a safer and more efficient dressing process. We present the working principles of the SWAG, introduce its design and fabrication, and demonstrate its performance in dressing assistance. The proposed system demonstrates effective garment application across various garment configurations, presenting a promising alternative to conventional robotic dressing assistance.

Abstract:
Bimanual teleoperation tasks are highly demanding for human operators, requiring the simultaneous control of two robotic arms while managing complex coordination and cognitive load. Current approaches to this challenge often rely on rigid control schemes or task-specific automation that do not adapt well to dynamic environments or varied operator needs. This paper presents a novel large language model (LLM)-aided bimanual teleoperation assistant (BTLA) that helps operators control dual-arm robots through an intuitive voice command interface and variable autonomy. The BTLA system enables a hybrid control paradigm by combining natural language interaction for an assistive robot arm with direct teleoperation of the dominant robotic arm. Our system implements six core manipulation skills with varying autonomy, ranging from direct mirroring to autonomous object manipulation. The BTLA leverages the LLM to interpret natural language commands and select an appropriate assistance mode based on task requirements and operator preferences. Experimental validation on bimanual object manipulation tasks demonstrates that the BTLA system yields a 240.8% increase in success rate over solo teleoperation and a 69.9% increase over dyadic teleoperation, while significantly reducing operator mental workload. In addition, we validate our approach on a physical dual-arm UR3e robot system, achieving a 90% success rate on challenging soft-bottle handling and box-transportation tasks.

Abstract:
This paper proposes a control strategy for flexible link manipulators preserving high tracking accuracy in free motion, while ensuring smooth and safe recovery in scenarios involving physical interaction or large positional errors, based on VB-PSMC. The scheme is extended to compensate for the manipulator's flexural dynamics, resulting in a nested control scheme where damping of the induced oscillations is achieved by a model-free proportional strain feedback while gravity induced deflections are counteracted by a feed-forward term based on a quasi-static Euler-Bernoulli beam model. A convergence study on the modified sliding manifold and a stability analysis of the closed-loop system are provided. The performance of the controller was evaluated experimentally and compared against other control strategies such as PSMC and torque limited PD control. The results demonstrate the controller's accurate end effector tracking in free motion, while achieving compliant behavior during contact, by efficiently handling the links inherent flexibility leading up to a 32% reduction in interaction force. In addition, studying the FL-VB-PSMC response after releasing contact demonstrated the overdamped and vibration-free recovery even for large position errors.

Abstract:
LiDAR-based global localization provides accurate robot pose estimates against a prior map. Existing deep-learning methods, however, demand heavy computation and long training or inference times and degrade sharply when faced with domain shifts. This letter presents LighterBEV, a lightweight, fast, and generalizable localization method. An Informative Compression Module achieves a fourfold reduction in local-feature dimensional- ity while improving accuracy. We further integrate online learning to enable rapid post - deployment adaptation, mitigating degradation under distribution shift. Extensive experiments on four large-scale datasets show that LighterBEV achieves state-of-the-art performance with limited training data, maintains high accuracy under domain shift, and runs in real time on resource- constrained hardwaresupporting both inference and online updates. To our knowledge, LighterBEV is the first LiDAR global localization approach to incorporate online learning for automatic adaptation to new environments, thereby narrowing the domain gap. Code will be released at: https://github.com/npu-iusl-lab/LighterBEV.

Abstract:
3D mapping is vital for a broad range of applications that rely on a consistent and accurate representation of the environment. Change is an ever-persistent force in our world and with the evolution of a scene its 3D map becomes outdated. Thus, a mapping framework that can adapt and refine the 3D maps with the changes in the scene is necessary. In this paper, we propose a lifelong mapping framework where map maintenance is based on two objectives including preservation of static structures and refinement of the 3D map. To preserve only the static structures, we classify the objects state and remove the dynamic objects and the quasi-static objects, i.e., objects which temporarily appear static. For classifying the state of objects, we propose a discrete probabilistic solution utilizing a factor graph. Using this classification, we generate static maps from multiple sessions which are used for map refinement. The refinement is based on change detection and map update, leveraging semantic and geometric information. For the evaluation, we collect a multi-campus lifelong dataset as an extension of the MCD datasets from KTH and NTU campuses. The proposed approach is capable of accurately detecting quasi-static objects even in highly dynamic environments. Our system demonstrates state of the art performance in large scale environments. Furthermore, our approach can handle both SLAM-generated and survey-grade maps.

Abstract:
Trajectory optimization for robotic systems remains a challenging problem. This is especially true for robotic systems featuring nonlinear dynamics and many degrees of freedom. Data-based or model-free diffusion has recently been popularized in the fields of artificial intelligence and trajectory optimization. Model-Based Diffusion provides a data-free method of trajectory optimization, trained at runtime on a system dynamics model, suitable for high-dimensional models. This paper examines how importance sampling can enhance the performance of Model-Based Diffusion for trajectory optimization. We quantify the benefits of importance sampling across three long horizon planning tasks. These results show as much as a 13x improvement in sample efficiency depending on environment and optimization parameters.

Abstract:
Manipulating clusters of deformable objects presents a substantial challenge with widespread applicability, but requires contact-rich whole-arm interactions. A potential solution must address the limited capacity for realistic model synthesis, high uncertainty in perception, and the lack of efficient spatial abstractions, among others. We propose a novel framework for learning model-free policies integrating two modalities: 3D point clouds and proprioceptive touch indicators, emphasising manipulation with full body contact awareness, going beyond traditional end-effector modes. Our reinforcement learning framework leverages a distributional state representation, aided by kernel mean embeddings, to achieve improved training efficiency and real-time inference. Furthermore, we propose a novel context-agnostic occlusion heuristic to clear deformables from a target region for exposure tasks. We deploy the framework in a power line clearance scenario and observe that the agent generates creative strategies leveraging multiple arm links for de-occlusion. Finally, we perform zero-shot sim-to-real policy transfer, allowing the arm to clear real branches with unknown occlusion patterns, unseen topology, and uncertain dynamics.

Abstract:
Robotic navigation in complex environments remains a critical research challenge. Traditional navigation focuses on optimal trajectory generation within free space, struggling in environments lacking viable paths to the goal, such as disaster zones or cluttered warehouses. To address this gap, we propose an adaptive interactive navigation approach that proactively interacts with environments to create feasible paths to reach unavailable goals. Specifically, we present a primitive tree for task planning with large language models (LLMs), facilitating effective reasoning to determine interaction objects and sequences. For subtask execution, we adopt reinforcement learning to pre-train a skill library containing versatile locomotion and interaction behaviors. Furthermore, we introduce an adaptive replanning method featuring two LLM-based modules: an advisor serving as a flexible replanning trigger and an arborist for autonomous plan adjustment. Integrated with the tree structure, the replanning mechanism allows for rapid plan modification in unknown environments. Comprehensive simulations and experiments have demonstrated our method's effectiveness and adaptivity in diverse scenarios.

Abstract:
Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated impressive progress in high-fidelity scene reconstruction within visual SLAM. However, existing approaches often suffer from scene inconsistency, leading to visual artifacts, and the explicit maintenance of millions of Gaussians imposes significant storage overhead. To address these limitations, we present a unified Neural Gaussian SLAM with feature splatting, which represents the spatial scene as a coherent feature space while encoding view direction, distance, and position into neural Gaussians. Arbitrary image modalities-including color, depth, normals, semantics, and even language-can be decoded from this feature space. Extensive evaluations on several challenging datasets show that our method achieves state-of-the-art performance in rendering quality, reconstruction accuracy, and pose estimation.

Abstract:
Service robots require instruction-following capabilities to perform various tasks regardless of environmental changes. A task planner must accurately infer user intent even when human instructions are ambiguous. To this end, we propose TIGER, a task planning framework that generates reliable action sequences by deriving immutable subgoals from instructions. TIGER employs an Immutable Subgoal Planner (ISP) to decompose instructions into environment-independent subgoals and a Target Grounder (TG) to ground abstract keywords to real-world objects via visual perception and reasoning. A task-representative one-shot strategy improves subgoal generation using only seven annotated examples. TIGER outperformed LLM-Planner in the ALFRED benchmark, increasing success rates from 15.09% to 35.06% on the seen set and from 19.73% to 42.57% on the unseen set. Its scalability was also verified in real-world experiments with a UR5e robot.

Abstract:
Vision-Language-Action (VLA) models have recently become a central direction in embodied AI, but current systems are restricted to either dual-gripper control or single-arm dexterous hand manipulation. While low-dimensional gripper control can often be handled with simpler methods, high-dimensional dexterous hand control benefits greatly from full end-to-end VLA learning. In this work, we introduce Dexora, the first open-source VLA system that natively targets dual-arm, dual-hand high-DoF manipulation. We design a hybrid teleoperation pipeline that decouples gross arm kinematics (captured with a custom exoskeleton backpack) from fine finger motion (markerless hand tracking via Apple Vision Pro), and that drives both a physical dual-arm dual-hand platform and an identical MuJoCo digital twin. Using that interface, we assemble a large training corpus: an embodiment-matched synthetic corpus (100K simulated trajectories, 6.5M frames) and a real-world dataset of 10K teleoperated episodes (2.92M frames). To mitigate noisy teleoperation demonstrations, we propose a data-quality-aware training recipe: an offline discriminator provides clip-level weights for diffusion-transformer policy training, down-weighting low-quality demonstrations. Empirically, Dexora outperforms competitive VLA baselines on both basic and dexterous benchmarks (e.g., average dexterous success 66.7% vs. 51.7%), attains 90% success on basic tasks, and shows robust out-of-distribution and cross-embodiment generalization. Ablations confirm the importance of real data and the discriminator for dexterity.

Abstract:
Spherical robots rolling on flat ground often exhibit a wobbling motion that, at higher speeds, can escalate into end-over-end flipping. This paper proposes a fundamental dynamic cause of this instability: a relaxation effect analogous to the Intermediate Axis Theorem. Rotating bodies with oblate inertial profiles under dissipative loads tend to reorient toward spinning about their major moment of inertia, leading to the observed wobbling in spherical robots. While relaxation dynamics are well-studied in satellites and asteroids, this effect has not been previously applied to rolling systems. We extend these methods to constrained spherical robots, derive the governing dynamics, and conduct experiments with an empty shell on a slope and a reduced pendulum on flat ground and in water to aid in the discussion. Results suggest that translational rolling constraints act as a pseudo-dissipative load to drive the relaxation effect. This work bridges the fields of satellite dynamics theory and ground robotics, providing new insights into the stability of high-speed rolling robots to influence future hardware and control design choices.

Abstract:
Robust Simultaneous Localization and Mapping (SLAM) is a crucial enabler for autonomous navigation in natural, semi-structured environments such as parks and gardens. However, these environments present unique challenges for SLAM due to frequent seasonal changes, varying light conditions, and dense vegetation. These factors often degrade the performance of visual SLAM algorithms originally developed for structured urban environments. To address this gap, we present ROVER, a comprehensive benchmark dataset tailored for evaluating visual SLAM algorithms under diverse environmental conditions and spatial configurations. We captured the dataset with a robotic platform equipped with monocular, stereo, and RGBD cameras, as well as inertial sensors. It covers 39 recordings across five outdoor locations, collected through all seasons and various lighting scenarios, i.e., day, dusk, and night with and without external lighting. With this novel dataset, we evaluate several traditional and deep learning-based SLAM methods and study their performance in diverse challenging conditions. The results demonstrate that while stereo-inertial and RGBD configurations generally perform better under favorable lighting and moderate vegetation, most SLAM systems perform poorly in low-light and high-vegetation scenarios, particularly during summer and autumn. Our analysis highlights the need for improved adaptability in visual SLAM algorithms for outdoor applications, as current systems struggle with dynamic environmental factors affecting scale, feature extraction, and trajectory consistency. This dataset provides a solid foundation for advancing visual SLAM research in real-world, highlightsemi-structured environments, fostering the development of more resilient SLAM systems for long-term outdoor localization and mapping. The dataset and the code of the benchmark are available under https://iis-esslingen.github.io/rover.

Abstract:
In real-world scenarios, multi-view cameras are typically employed for fine-grained manipulation tasks. Existing approaches (e.g., ACT ) tend to treat multi-view features equally and directly concatenate them for policy learning. How ever, it will introduce redundant visual information and bring higher computational costs, leading to ineffective manipulation. Fine-grained manipulation tasks typically consist of multiple stages, where the best view may vary across different phases. This paper proposes a plug-and-play Best-Feature-Aware (BFA) fusion strategy for multi-view manipulation tasks, which is adaptable to various policies. Building upon the visual backbone of the policy network, we design a lightweight subnetwork to effectively predict the importance score of each view. Based on the predicted importance scores, the reweighted multi-view features are subsequently fused and fed into the end-to-end policy network for seamless integration. Notably, our method demonstrates outstanding performance in fine-grained manip ulations. The experimental results show that our approach outperforms multiple baselines by 22-46% success rate on different tasks. Our work provides new insights and inspiration for tackling key challenges in fine-grained manipulations.

Abstract:
Existing aquatic robotic vehicles tend to be large, heavy, and difficult to deploy. This often renders them unsuitable for monitoring delicate aquatic habitats and hard-to-access areas. We present a comprehensive framework for the design and development of sailing micro aerial vehicles (SailMAVs), whose combination of flight and sailing capabilities is highly valuable for sensing missions in aquatic environments. This concept allows for quick hand-launch deployment from land, access to remote areas, rapid multipoint sampling at six locations, and easy movement between separate water bodies. Our framework places particular emphasis on the complex aero-hydrodynamic design, ensuring dual use of subsystems in both locomotion modes, which in turn maximizes performance and reduces redundant payloads. The small scale of the robots considered represents a particular challenge, in terms of both practical design aspects and the underlying physics. In addition to the hardware design, control laws are derived to allow for automated long-duration mission execution. To illustrate the proposed framework, a robotic prototype is presented, analyzed, and tested as an example. The developed design and control laws are validated in autonomous outdoor sailing missions, demonstrating the effectiveness of the framework. The prototype is further employed in remote sensing missions, demonstrating the use of SailMAVs for passive acoustic monitoring (PAM) of aquatic environments. The data obtained demo

Abstract:
Leveraging pretrained Vision-Language Models (VLMs) to map language instruction and visual observations to raw low-level actions, Vision-Language-Action models (VLAs) hold great promise for achieving general-purpose robotic systems. Despite their advancements, existing VLAs tend to spuriously correlate task-irrelevant visual features with actions, limiting their generalization capacity beyond the training data. To tackle this challenge, we propose Intrinsic Spatial Reasoning (InSpire), a simple yet effective approach that mitigates the adverse effects of spurious correlations by boosting the spatial reasoning ability of VLAs. Specifically, InSpire redirects the VLA's attention to task-relevant factors by prepending the question In which direction is the [object] relative to the robot? to the language instruction and aligning the model's output answer right/left/up/down/front/back/grasped and predicted actions with the ground-truth. Notably, InSpire can be used as a plugin to enhance existing autoregressive VLAs, requiring no extra training data or interaction with other large models. Extensive experimental results in both simulation and real-world environments demonstrate the effectiveness and flexibility of InSpire. Code, pretrained models and demos are publicly available at: https://koorye.github.io/Inspire.

Abstract:
Vision Transformers can achieve high accuracy and strong generalization across various contexts, but their practical applicability on real-world robotic systems is limited due to their quadratic attention complexity. Recent works have focused on dynamically merging tokens according to the image complexity. Token merging works well for classification but is less suited to dense prediction. We propose ClustViT, where we expand upon the Vision Transformer (ViT) backbone and address semantic segmentation. Within our architecture, a trainable Cluster module merges similar tokens along the network guided by pseudo-clusters from segmentation masks. Subsequently, a Regenerator module restores fine details for downstream heads. Our approach achieves up to 2.18x fewer GFLOPs and 1.64x faster inference on three different datasets, with comparable segmentation accuracy. Our code and models are made publicly available.

Abstract:
Learning from real-world robot demonstrations holds promise for interacting with complex real-world environments. However, the complexity and variability of interaction dynamics often cause purely positional controllers to struggle with contacts or varying payloads. To address this, we propose a Heterogeneous Meta-Control (HMC) framework for Loco-Manipulation that adaptively stitches multiple control modalities: position, impedance, and hybrid force-position. We first introduce an interface, HMC-Controller, for blending actions from different control profiles continuously in the torque space. HMC-Controller facilitates both teleoperation and policy deployment. Then, to learn a robust force-aware policy, we propose HMC-Policy to unify different controllers into a heterogeneous architecture. We adopt a mixture-of-experts style routing to learn from large-scale position-only data and fine-grained force-aware demonstrations. Experiments on a real humanoid robot show over 50% relative improvement vs. baselines on challenging tasks such as compliant table wiping and drawer opening, demonstrating the efficacy of HMC.

Abstract:
Event-based cameras are bio-inspired sensors with pixels that independently and asynchronously respond to brightness changes at microsecond resolution, offering the potential to handle visual tasks in challenging scenarios. However, due to the sparse information content in individual events, directly processing the raw event data to solve vision tasks is highly inefficient, which severely limits the applicability of state-of-the-art methods in real-time tasks, such as motion segmentationa fundamental task for dynamic scene understanding. Incorporating normal flow as an intermediate representation to compress motion information from event clusters within a localized region provides a more effective solution. In this work, we propose a normal flow-based motion segmentation framework for event-based vision. Leveraging the dense normal flow directly learned from event neighborhoods as input, we formulate the motion segmentation task as an energy minimization problem solved via graph cuts, and optimize it iteratively with normal flow clustering and motion model fitting. By using a normal flow-based motion model initialization and fitting method, the proposed system is able to efficiently estimate the motion models of independently moving objects with only a limited number of candidate models, which significantly reduces the computational complexity and ensures real-time performance, achieving nearly a 800x speedup in comparison to the open-source state-of-the-art method. Extensive evaluations on multiple public datasets fully demonstrate the accuracy and efficiency of our framework. Our code will be open-sourced to facilitate further research in this field.

Abstract:
Neuromorphic hardware and spiking neural networks (SNNs) offer a bio-inspired path to low-latency, energy-efficient computation by emulating the brains asynchronous, spike-based processing. This is particularly attractive for resource-constrained robots that are tightly limited in size, weight, and power. We propose a neuromorphic approach to real-time optical flow estimation tailored to the SynSense Speck system-on-chip, which integrates a Dynamic Vision Sensor (DVS) with a neuromorphic processor. Our inference architecture combines spiking and artificial neural layers in a hybrid SNNANN framework, enabling the use of Speck to perform regression for closed-loop drone control, an application not previously demonstrated on this chip. Despite its compact form factor, the system produces dense flow in real time and achieves stable indoor hover and forward flight using flow-based control. The hybrid pipeline runs ~2x faster than an ANN-only baseline at identical power, highlighting the promise of neuromorphic sensing and processing for ultra-efficient autonomous flight in real-world scenarios. Code and data are available at: https://mavlab.tudelft.nl/speck-optical-flow

Abstract:
Robust and accurate perception of dynamic objects and map elements is crucial for autonomous vehicles performing safe navigation in complex traffic scenarios. While vision-only methods have become the de facto standard due to their technical advances, they can benefit from effective and cost-efficient fusion with radar measurements. In this work, we advance fusion methods by repurposing Gaussian Splatting as an efficient universal view transformer that bridges the view disparity gap, mapping both image pixels and radar points into a common Birds-Eye View (BEV) representation. Our main contribution is GaussianCaR, an end-to-end network for BEV segmentation that, unlike prior BEV fusion methods, leverages Gaussian Splatting to map raw sensor information into latent features for efficient camera-radar fusion. Our architecture combines multi-scale fusion with a transformer decoder to efficiently extract BEV features. Experimental results demonstrate that our approach achieves performance on par with, or even surpassing, the state of the art on BEV segmentation tasks (57.3%, 82.9%, and 50.1% IoU for vehicles, roads, and lane dividers) on the nuScenes dataset, while maintaining a 3.2x faster inference runtime. Code and project page are available online.

Abstract:
Recent advances in robotic manipulation have highlighted the effectiveness of learning from demonstration. However, while end-to-end policies excel in expressivity and flexibility, they struggle both in generalizing to novel object geometries and in attaining a high degree of precision. An alternative, object-centric approach frames the task as predicting the placement pose of the target object, providing a modular decomposition of the problem. Building on this goal-prediction paradigm, we propose a hierarchical, disentangled point diffusion framework that achieves state-of-the-art performance in placement precision, multi-modal coverage, and generalization to variations in object geometries and scene configurations. Specifically, we model global scene-level placements through a novel feed-forward Dense Gaussian Mixture Model (GMM) that yields a spatially dense prior over global placements; we then model the local object-level configuration through a novel disentangled point cloud diffusion module that separately diffuses the object geometry and the placement frame, enabling precise local geometric reasoning. Interestingly, we demonstrate that our point cloud diffusion achieves substantially higher accuracy compared to prior approaches based on SE(3) diffusion, even in the context of rigid object placement. We validate our approach across a suite of challenging tasks in simulation and in the real-world on high-precision industrial insertion tasks. Furthermore, we present results on a cloth-hanging task in simulation, indicating that our method can further relax assumptions on object rigidity. Visualizations and supplementary materials can be found on our project website: https://3dgp-icra2026.github.io/.

Abstract:
For Deep Reinforcement Learning (DRL) models to deliver actual utility, they must function within production environments, which often lack the extensive computational resources of training environments. This requirement for dedicated GPU resources is not economically feasible and can be especially prohibitive in low-cost robotic contexts. Neural network quantization serves as a viable solution to these constraints. This technique aims to lessen computational and memory requirements, while maintaining performance. By reducing the precision of the DRL network weights and the network input (sensory observations), the deployment size can be compacted to fit within MCU class devices, while ensuring that inference operates at adequate frequencies. This paper investigates the impact of quantization on DRL policies and presents a quantization-friendly network architecture for the Soft Actor-Critic (SAC) and TD3 algorithms. We propose a streamlined actor network optimized for inference-only deployments and quantization, and integrate a GRU-based encoder into the DRL framework using a custom, quantization-compatible implementation. The changes enable both to be quantized to integer precision. We then deploy the quantized policies on a microcontroller-scale device (ESP32-S3) to control a low-cost quadrupedal robot using only proprioception and on-board inference.

Abstract:
Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic perception and control, yet most existing approaches primarily rely on VLM trained using 2D images, which limits their spatial understanding and action grounding in complex 3D environments. To address this limitation, we propose a novel framework that integrates depth estimation into VLA models to enrich 3D feature representations. Specifically, we employ a depth estimation baseline called VGGT to extract geometry-aware 3D cues from standard RGB inputs, enabling efficient utilization of existing large-scale 2D datasets while implicitly recovering 3D structural information. To further enhance the reliability of these depth-derived features, we introduce a new module called action assistant, which constrains the learned 3D representations with action priors and ensures their consistency with downstream control tasks. By fusing the enhanced 3D features with conventional 2D visual tokens, our approach significantly improves the generalization ability and robustness of VLA models. Experimental results demonstrate that the proposed method not only strengthens perception in geometrically ambiguous scenarios but also leads to superior action prediction accuracy. This work highlights the potential of depth-driven data augmentation and auxiliary expert supervision for bridging the gap between 2D observations and 3D-aware decision-making in robotic systems.

Abstract:
Multispectral sensors, which measure multiple wavelength bands beyond the standard red, green, and blue channels, capture richer information than conventional RGB cameras. Such enriched data is especially valuable in visual servoing, where robot control critically depends on image content. However, leveraging multiple spectral bands (typically around a dozen) directly within real-time visual servoing constitutes a significant challenge. The only prior work tackled this problem using a Pixel Selection strategy based on image gradients. This paper introduces a learning-based framework to enhance Multi-Spectral Visual Servoing (MSVS) by fusing data from multispectral cameras into a single, robust representation for control. An autoencoder is employed to compress multispectral inputs into a noise-attenuated 2D image, which is then used within a standard rule-based Direct Visual Servoing (DVS) scheme. Comparison experiments both with simulated data and with a real robot in complex and unstructured environments show that the proposed learning-based fusion maintains stable convergence and improves positioning accuracy under noisy conditions while preserving computational efficiency.

Abstract:
Estimating human gaze target from visible images is a critical task for robots to understand human attention, yet the development of generalizable neural architectures and training paradigms remains challenging. While recent advances in pre-trained vision foundation models offer promising avenues for locating gaze targets, the integration of multi-modal cues including eyes, head poses, gestures, and contextual featuresdemands adaptive and efficient decoding mechanisms. Inspired by Mixture-of-Experts (MoE) for adaptive domain expertise in large vision-language models, we propose GazeMoE, a novel end-to-end framework that selectively leverages gaze-target-related cues from a frozen foundation model through MoE modules. To address class imbalance in gaze target classification (in-frame vs. out-of-frame) and enhance robustness, GazeMoE incorporates a class-balancing auxiliary loss alongside strategic data augmentations, including region-specific cropping and photometric transformations. Extensive experiments on benchmark datasets demonstrate that our GazeMoE achieves state-of-the-art performance, outperforming existing methods on challenging gaze estimation tasks. The code and pre-trained models are released at: https://huggingface.co/zdai257/GazeMoE.

Abstract:
Learning from human demonstrations has enabled robots to acquire a wide range of manipulation skills, but learned policies typically execute far slower than ordinary humans. This speed gap is mainly due to lack of an interface for collecting demonstration data at high speed, and the difficulty in training policies that can robustly execute high-speed motions. In this paper, we present ALOHA Lightning, a system for learning fast and precise robotic manipulation. Our system uses kinesthetic teaching to intuitively collect near-human-speed demonstrations on a backdrivable bimanual platform, yielding natural and fast trajectories. We also present a learning pipeline that enables smooth high-speed execution through test-time action smoothing and aligns the visual data distribution between data collection and deployment with masking. Given 50 demonstrations for each task, ALOHA Lightning autonomously completes tasks such as folding shorts, battery insertion, and bussing tables for over 80% success rates at or close to human speed.

Abstract:
3D meshes are a fundamental representation widely used in computer science and engineering. In robotics, they are particularly valuable because they capture objects in a form that aligns directly with how robots interact with the physical world, enabling core capabilities such as predicting stable grasps, detecting collisions, and simulating dynamics. Although automatic 3D mesh generation methods have shown promising progress in recent years, potentially offering a path toward real-time robot perception, two critical challenges remain. First, generating high-fidelity meshes is prohibitively slow for real-time use, often requiring tens of seconds per object. Second, mesh generation by itself is insufficient. In robotics, a mesh must be contextually grounded, i.e., correctly segmented from the scene and registered with the proper scale and pose. Additionally, unless these contextual grounding steps remain efficient, they simply introduce new bottlenecks. In this work, we introduce an end-to-end system that addresses these challenges, producing a high-quality, contextually grounded 3D mesh from a single RGB-D image in under one second. Our contribution is a system level design that integrates open-vocabulary object segmentation, accelerated diffusion-based mesh generation, and robust point cloud registration, each optimized for both speed and accuracy. We demonstrate its effectiveness in a real-world manipulation task, showing that it enables meshes to be used as a practical, on-demand representation for robotics perception and planning. Open-source code and videos are located at the paper website: https://apollo-lab-yale.github.io/26-ICRA-subsecond-mesh-gen-website/

Abstract:
Accurate physics simulation is essential for robotic learning and control, yet analytical simulators often fail to capture complex contact dynamics, while learning-based simulators typically require large amounts of costly real-world data. To bridge this gap, we propose a few-shot real-to-sim approach that combines the physical consistency of analytical formulations with the representational capacity of graph neural network (GNN)-based models. Using only a small amount of real-world data, our method calibrates analytical simulators to generate large-scale synthetic datasets that capture diverse contact interactions. On this foundation, we introduce a mesh-based GNN that implicitly models rigid-body forward dynamics and derive surrogate gradients for collision detection, achieving full differentiability. Experimental results demonstrate that our approach enables learning-based simulators to outperform differentiable baselines in replicating real-world trajectories. In addition, the differentiable design supports gradient-based optimization, which we validate through simulation-based policy learning in multi-object interaction scenarios. Extensive experiments show that our framework not only improves simulation fidelity with minimal supervision but also increases the efficiency of policy learning. Taken together, these findings suggest that differentiable simulation with few-shot real-world grounding provides a powerful direction for advancing future robotic manipulation and control.

Abstract:
LiDAR-event camera integration has shown considerable promise and is gaining traction across various perception applications. Event cameras offer high temporal resolution and wide dynamic range but suffer from noise sensitivity and lack depth information. LiDAR complements these capabilities by providing absolute scale and robustness, yet accurate calibration between the two sensors remains a significant challenge. This paper presents targetless calibration framework for LiDARevent camera systems that removes dependence on dedicated calibration targets and strong initial assumptions. The method estimates the event camera angular velocity by analyzing the timestamp and spatial changes of per-pixel, enabling precise detection of natural edges. Calibration proceeds in two stages: (i) motion-based initialization, where Canonical Correlation Analysis (CCA) on rotational estimates from the event camera and LiDAR jointly recovers the temporal offset and rotation; (ii) nonlinear refinement of the extrinsics via cross-modal alignment of natural edge features. Experiments on physical platforms and public datasets demonstrate robust performance and high calibration accuracy across diverse scenarios. This work provides a solid foundation for further development and application of LiDAR-event camera fusion.

Abstract:
Collision detection is a core component of robotics applications such as simulation, control, and planning. Traditional algorithms like GJK+EPA compute textitwitness pointsthe closest or deepest-penetration pairs between two objectsbut are inherently non-differentiable, preventing gradient flow and limiting gradient-based optimization in contact-rich tasks such as grasping and manipulation. Recent work introduced efficient first-order randomized smoothing to make witness points differentiable; however, their direction-based formulation is restricted to convex objects and lacks robustness for complex geometries. In this work, we propose a robust and efficient differentiable collision detection framework that supports both convex and concave objects across diverse scales and configurations. Our method introduces distance-based first-order randomized smoothing, adaptive sampling, and equivalent gradient transport for robust and informative gradient computation. Experiments on complex meshes from DexGraspNet and Objaverse show significant improvements over existing baselines. Finally, we demonstrate a direct application of our method for dexterous grasp synthesis to refine the grasp quality. The code is available at https://github.com/JYChen18/DiffCollision.

Abstract:
Existing datasets for training generalist manipulation policies often lack diversity in object variety and initial states, limiting the range of physically grounded interactions present in them. Consequently, these policies struggle with unseen object shapes, sizes, or unfamiliar object poses. Manually collecting real-world trajectories with diverse physical interactions is tedious, time-consuming, and expensive, underscoring the need to generate these autonomously. Simulators offer a scalable pathway to autonomously generate trajectories by enabling extensive variation not only in tasks (e.g., objects, object properties, and initial conditions), but also in the robot behaviors required to solve these tasks. We develop a data generation pipeline that autonomously produces physically grounded trajectories in simulation using video diffusion models. Our approach first simulates random initial conditions across various tasks using a diverse asset library. A video diffusion model generates videos of a robot performing these tasks in physically diverse scenarios, which are then fed to a learned goal-conditioned planner to extract actions that closely follow the generated videos. Unlike prior trajectory generation methods, our pipeline generalizes to new objects across multiple tasks without relying on human demonstrations. Using our approach, we generate a simulation dataset PHYSVIVID containing 5k+ demonstrations involving 400+ objects. We demonstrate the effectiveness of PHYSVIVID by fine-tuning robot policies on it, and demonstrating generalization of policies to unseen objects with varying shapes, textures, and sizes, as well as to unseen object categories.

Abstract:
Urban micromobility applications, such as delivery robots, demand reliable navigation across large-scale urban environments while following long-horizon route instructions. This task is particularly challenging due to the dynamic and unstructured nature of real-world city areas, yet most existing navigation methods remain tailored to short-scale and controllable scenarios. Effective urban micromobility requires two complementary levels of navigation skills: low-level capabilities such as point-goal reaching and obstacle avoidance, and high-level capabilities, such as routevisual alignment. To this end, we propose UrbanVLA, a route-conditioned Vision-Language-Action (VLA) framework designed for scalable urban navigation. Our method explicitly aligns noisy route waypoints with visual observations during execution, and subsequently plans trajectories to drive the robot. To enable UrbanVLA to master both levels of navigation, we employ a two-stage training pipeline. The process begins with Supervised Fine-Tuning (SFT) using simulated environments and trajectories parsed from web videos. This is followed by Reinforcement Fine-Tuning (RFT) on a mixture of simulation and real-world data, which enhances the model's safety and adaptability in real-world settings. Experiments demonstrate that UrbanVLA surpasses strong baselines by more than 55% in the SocialNav task in MetaUrban. Furthermore, UrbanVLA achieves reliable real-world navigation, showcasing both scalability to large-scale urban environments and robustness against real-world uncertainties.

Abstract:
Recently, Vision-Language-Action Models (VLA) have advanced robot imitation learning, but high data collection costs and limited demonstrations hinder generalization and current imitation learning methods struggle in out-of-distribution scenarios, especially for long-horizon tasks. A key challenge is how to mitigate compounding errors in imitation learning, which lead to cascading failures over extended trajectories. To address these challenges, we propose the Diffusion Trajectory-guided Policy (DTP) framework, which generates 2D trajectories through a diffusion model to guide policy learning for long-horizon tasks. By leveraging task-relevant trajectories, DTP provides trajectory-level guidance to reduce error accumulation. Our two-stage approach first trains a generative vision-language model to create diffusion-based trajectories, then refines the imitation policy using them. Experiments on the CALVIN benchmark show that DTP outperforms state-of-the art baselines by 25% in success rate, starting from scratch without external pretraining. Moreover, DTP significantly improves real-world robot performance.

Abstract:
Initializing the state of a sensorized platform can be challenging, as a limited set of measurements often provide low-informative constraints that are in addition highly non-linear. This may lead to poor initial estimates that may converge to local minima during subsequent non-linear optimization. We propose an adaptive GNSSinertial initialization strategy that delays the incorporation of global GNSS constraints until they become sufficiently informative. In the initial stage, our method leverages inter-epoch baseline vector residuals between consecutive GNSS fixes to mitigate inertial drift. To determine when to activate global constraints, we introduce a general criterion based on the evolution of the Hessian matrixs singular values, effectively quantifying system observability. Experiments on EuRoC, GVINS and MARS-LVIG datasets show that our approach consistently outperforms the naive strategy of fusing all measurements from the outset, yielding more accurate and robust initializations.

Abstract:
This paper presents a real-time control framework for nonlinear pure-feedback systems with unknown dynamics to satisfy reach-avoid-stay tasks within a prescribed time in dynamic environments. To achieve this, we introduce a real-time spatiotemporal tube (STT) framework. An STT is defined as a time-varying ball in the state space whose center and radius adapt online using only real-time sensory input. A closed-form, approximation-free control law is then derived to constrain the system output within the STT, ensuring safety and task satisfaction. We provide formal guarantees for obstacle avoidance and on-time task completion. The effectiveness and scalability of the framework are demonstrated through simulations and hardware experiments on a mobile robot and an aerial vehicle, navigating in cluttered dynamic environments.

Abstract:
Traversing terrains with sparse footholds like legged animals presents a promising yet challenging task for quadruped robots, as it requires precise environmental perception and agile control to secure safe foot placement while maintaining dynamic stability. Model-based hierarchical controllers excel in laboratory settings, but suffer from limited generalization and overly conservative behaviors. End-to-end learning-based approaches unlock greater flexibility and adaptability, but existing state-of-the-art methods either rely on heightmaps that introduce noise and complex, costly pipelines, or implicitly infer terrain features from egocentric depth images, often missing accurate critical geometric cues and leading to inefficient learning and rigid gaits. To overcome these limitations, we propose START, a single-stage learning framework that enables agile, stable locomotion on highly sparse and randomized footholds. START leverages only low-cost onboard vision and proprioception to accurately reconstruct local terrain heightmap, providing an explicit intermediate representation to convey essential features relevant to sparse foothold regions. This supports comprehensive environmental understanding and precise terrain assessment, reducing exploration cost and accelerating skill acquisition. Experimental results demonstrate that START achieves zero-shot transfer across diverse real-world scenarios, showcasing superior adaptability, precise foothold placement, and robust locomotion.

Abstract:
Visual Place Recognition (VPR) is a fundamental task in robotics and computer vision, enabling systems to identify locations seen in the past using visual information. Previous state-of-the-art approach - SegVLAD - focuses on encoding and retrieving semantically meaningful supersegment representations of images to significantly enhance recognition recall rates. However, we find that they struggle to cope with significant variations in viewpoint and scale, as well as scenes with sparse or limited information. Furthermore, these semantic-driven supersegment representations often exclude semantically meaningless yet valuable pixel information. In this work, we present Sel-V and MuSSel-V, two efficient variants within the segment-level VPR paradigm that replace heavy and fragmented supersegments with lightweight, visually compact and complete dilated superpixels for local feature aggregation. The use of superpixels preserves pixel-level details while reducing computational overhead. A multi-scale extension further enhances robustness to viewpoint and scale changes. Comprehensive experiments on twelve public benchmarks show that our approach achieves a better trade-off between accuracy and efficiency than existing segment-based methods. These results demonstrate that lightweight, non-semantic segmentation can serve as an effective alternative for high-performance, resource efficient visual place recognition in robotics.

Abstract:
Soft robotic manipulators are generally slow despite their great adaptability, resilience, and compliance. This limitation also extends to current soft robotic micromanipulators. Here, we introduce FilMBot, a 3-DOF ﬁlm-based, electromagnetically actuated, soft kinematic robotic micromanipulator achieving speeds up to 2117°/s and 2456°/s in α and β angular motions, with corresponding linear velocities of 1.61 m/s and 1.92 m/s using a 4-cm needle end-effector, 0.54 m/s along the Z-axis, and 1.57 m/s during Z-axis morph switching. The robot can reach �?.50 m/s in path-following tasks, with an operational bandwidth below �?0 Hz, and remains responsive at 50 Hz. It demonstrates high precision (�?.3 μm, or �?.05% of its workspace) in path-following tasks, with precision remaining largely stable across frequencies. The novel combination of the low-stiffness soft kinematic ﬁlm structure and strong electromagnetic actuation in FilMBot opens new avenues for soft robotics. Furthermore, its simple construction and inexpensive, readily accessible components could broaden the application of micromanipulators beyond current academic and professional users.

Abstract:
Loco-manipulation demands coordinated whole-body motion to manipulate objects effectively while maintaining locomotion stability, presenting significant challenges for both planning and control. In this work, we propose a whole-body model predictive control (MPC) framework that directly optimizes joint torques through full-order inverse dynamics, enabling unified motion and force planning and execution within a single predictive layer. This approach allows emergent, physically consistent whole-body behaviors that account for the systems dynamics and physical constraints. We implement our MPC formulation using open software frameworks (Pinocchio and CasADi), along with the state-of-the-art interior-point solver Fatrop. In real-world experiments on a Unitree B2 quadruped equipped with a Unitree Z1 manipulator arm, our MPC formulation achieves real-time performance at 80 Hz. We demonstrate loco manipulation tasks that demand fine control over the end effectors position and force to perform real-world interactions like pulling heavy loads, pushing boxes, and wiping whiteboards.

Abstract:
Structured, model-level information on the worlds robot systems remains scarce: existing reports often provide aggregated market statistics, while industry directories typically stop at company-level information. In this work, we present an LLM-assisted, web-grounded analysis pipeline for studying the global robotics landscape at the robot-model level. The method combines company discovery, iterative verification, and model-level extraction of robot type, target industries, release year, and task descriptions from open-web evidence. Applying this pipeline, we study 8,229 robot models associated with 1,062 companies across 50 countries and 6 continents. Our findings reveal strong geographic concentration in the United States, China, and Japan, rapid growth after 2017, and substantial diffusion of robotics beyond manufacturing into logistics, healthcare, education, and household settings. Our work illustrates both the promise and certain limitations of LLM-assisted web analysis for large-scale robotics landscape mapping.

Abstract:
Space robots operate in extreme environments where hardware degradation can critically compromise traditional control strategies. While continual reinforcement learning offers a promising mechanism for online adaptation, it inherently requires access to a reward signal during deployment. However, precise reward computation in space is often infeasible due to the lack of external tracking systems and the overall complexity of the environment. To address the challenge of unobservable rewards, we introduce a reward-free continual learning framework that leverages latent-state world models. By pre-training a model-based agent across diverse simulations, the world model learns a robust predictor of the reward structure within its latent space. Upon deployment to an environment with severe hardware degradation, we freeze the observation encoder and reward predictor to update only the transition dynamics of the world model through unsupervised rollouts. By training the policy entirely on imagined trajectories generated by this updated world model, the agent adapts to altered dynamics without receiving new rewards. We demonstrate our approach across simulated planetary traversal, orbital navigation, and precision assembly tasks subjected to severe morphological failures.

Abstract:
This work proposes a framework that generates and optimally selects task-specific assembly configurations for a large group of homogeneous modular aerial systems, explicitly enforcing bounds on inter-module downwash. Prior work largely focuses on planar layouts and often ignores aerodynamic interference. In contrast, firstly we enumerate non-isomorphic connection topologies at scale; secondly, we solve a nonlinear program to check feasibility and select the configuration that minimizes control input subject to actuation limits and downwash constraints. We evaluate the framework in physics-based simulation and demonstrate it in real-world experiments.

Abstract:
Deployment of robots in dynamic environments requires reactive trajectory generation. While optimization-based methods, such as Model Predictive Control focus on constraint verificaction, Geometric Fabrics offer a computationally efficient way to generate trajectories that include all avoidance behaviors if the environment can be represented as a set of object primitives. Obtaining such a representation from sensor data is challenging, especially in dynamic environments. In this paper, we integrate implicit environment representations, such as Signed Distance Fields and Free Space Decomposition into the framework of Geometric Fabrics. In the process, we derive how numerical gradients can be integrated into the push and pull operations in Geometric Fabrics. Our experiments reveal that both, ground robots and robotic manipulators, can be controlled using these implicit representations. Moreover, we show that, unlike the explicit representation, implicit representations can be used in the presence of dynamic obstacles without further considerations. Finally, we demonstrate our methods in the real-world, showing the applicability of our approach in practice.

Abstract:
This paper presents a novel approach to range-based distributed cooperative localization (DCL) for robot swarms in GPS-denied environments, relying solely on inter-robot range measurements, specifically addressing the limitations of current methods in noisy and sparse settings where the geometric non-rigidity of the sensing graph creates flipping (suboptimal) effects in the localization outcomes. We propose a robust multilayered localization framework (DCL-Sparse) that utilizes distributed 1-hop shadow edges (S1-Edge) to address the non-rigidity problem and improve localization convergence in sparse and noisy sensing graphs. Our approach leverages the advantages of distributed localization methods, enhancing scalability and adaptability in large robot networks. We establish theoretical conditions for the new S1-Edge that ensure solutions exist even in the presence of noise, thereby validating the effectiveness of the new shadow edge localization. Extensive simulation and real-world experiments confirm the superior performance of our method compared to state-of-the-art techniques, resulting in a reduction of up to 93% in the localization error in DCL. These experiments demonstrate substantial improvements in localization accuracy and robustness to sparse graphs. DCL-Sparse increases the localizability of large multi-robot and sensor networks, offering a powerful tool for high-performance and reliable operations in challenging large-scale environments.

Abstract:
Despite the recent advancements of vision-language-action (VLA) models on a variety of robotics tasks, they suffer from critical issues such as poor generalizability to unseen tasks, due to their reliance on behavior cloning exclusively from successful rollouts. Furthermore, they are typically fine-tuned to replicate demonstrations collected by experts under different settings, thus introducing distribution bias and limiting their adaptability to diverse manipulation objectives, such as efficiency, safety, and task completion. To bridge this gap, we introduce GRAPE: Generalizing Robot Policy via Preference Alignment. Specifically, GRAPE aligns VLAs on a trajectory level and implicitly models reward from both successful and failure trials to boost generalizability to diverse tasks. Moreover, GRAPE breaks down complex manipulation tasks to independent stages and automatically guides preference modeling through customized spatiotemporal constraints with keypoints proposed by a large vision-language model. Notably, these constraints are flexible and can be customized to align the model with varying objectives, such as safety, efficiency, or task success. We evaluate GRAPE across a diverse array of tasks in both real-world and simulated environments. Experimental results demonstrate that GRAPE enhances the performance of state-of-the-art VLA models, increasing success rates on in-domain and unseen manipulation tasks by 51.79% and 58.20%, respectively. Additionally, GRAPE can be aligned with various objectives, such as safety and efficiency, reducing collision rates by 37.44% and rollout step-length by 11.15%, respectively.

Abstract:
Given a demonstration, a robot should be able to generalize a skill to any object it encountersbut existing approaches to skill transfer often fail to adapt to objects with unfamiliar shapes. Motivated by examples of improved transfer from compositional modeling, we propose a method for improving transfer by decomposing objects into their constituent semantic parts. We leverage data-efficient generative shape models to accurately transfer interaction points from the parts of a demonstration object to a novel object. We autonomously construct an objective to optimize the alignment of those points on skill-relevant object parts. Our method generalizes to a wider range of object geometries than existing work, and achieves successful one-shot transfer for a range of skills and objects from a single demonstration, in both simulated and real environments.

Abstract:
A key requirement for generalist robots is compositional generalizationthe ability to combine atomic skills to solve complex, long-horizon tasks. While prior work has primarily focused on synthesizing a planner that sequences pre-learned skills, robust execution of the individual skills themselves remains challenging, as visuomotor policies often fail under distribution shifts induced by scene composition. To address this, we introduce a scene graph-based representation that focuses on task-relevant objects and relations, thereby mitigating sensitivity to irrelevant variation. Building on this idea, we develop a scene-graph skill learning framework that integrates graph neural networks with diffusion-based imitation learning, and further combine focused scene-graph skills with a vision-language model (VLM) based task planner. Experiments in both simulation and real-world manipulation tasks demonstrate substantially higher success rates than state-of-the-art baselines, highlighting improved robustness and compositional generalization in long-horizon tasks.

Abstract:
Accurate torque estimation in robotic actuators with harmonic drives is challenging due to nonlinear hysteresis and efficiency losses, often necessitating external torque sensors. This paper presents a learning-based torque estimation method that leverages encoder-derived features and mechanical compliance to enhance estimation accuracy without additional sensors. An actuator design incorporating a compliant helical tube provides deformation features that are effectively modeled using a Long Short-Term Memory (LSTM) network. Unlike conventional calibration or parametric approaches, the proposed framework captures nonlinear, history-dependent behaviors across varying operating conditions. Experimental evaluations demonstrate that compliant tubes significantly improve estimation accuracy compared with designs using stiffer or even rigid tubes, enabling more robust generalization under different torques, impedance modes, and stiffness levels. These results highlight the importance of co-designing actuator compliance and deep learning models to achieve reliable and compact torque estimation for harmonic drive actuators.

Abstract:
This paper presents a novel control strategy for multi-agent shepherding of non-cohesive targets in obstacle-rich environments. Unlike previous approaches that assume cohesive flocking behavior, our method handles targets that interact only with nearby herders through repulsive forces and exhibit no inter-target coordination. Each herder employs a hybrid control policy that combines direct goal-oriented steering with obstacle-tangent maneuvering, enabling targets to circumnavigate obstacles while being guided toward a goal region. The herder dynamics integrate three key behaviors: return-to-goal motion when idle, target steering with adaptive directional control, and obstacle avoidance using both normal and tangential force components. Numerical simulations demonstrate superior performance compared to existing shepherding methods, achieving higher target confinement rates in cluttered environments. Experimental validation using TurtleBot4 herders and Osoyoo target robots in an indoor arena confirms the practical effectiveness of the proposed approach.

Abstract:
The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle motion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained models rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance against held-out human preference annotations using rater feedback score (RFS) and average displacement error (ADE). Our experiments confirm that the VLMs trajectory selection is a high-quality proxy for human preference. Our final model, VL-DPO, yields an 11.94% increase in RFS and a 10.01% reduction in ADE over the pretrained model.

Abstract:
Soft continuum robots are gaining attention for their potential to enable inherently safe and adaptive human-robot collaboration, especially in dynamic industrial environments. However, the development of these robots varies drastically and no standardization exists. This is particularly problematic for soft continuum robots, because of the variety of different actuation methods, and control strategies. This paper addresses the challenge of engineering soft continuum robots by introducing a generalized framework that enables hardware abstraction and controller reuse. The approach combines a modular robot design by extending the Unified Robot Description Format (URDF) information model to support soft continuum robotics. Enabling the decoupling of hard- and software development. For the concept validation, a modular tendon-driven continuum robot was developed and integrated into the framework. The extended Unified (Continuum) Robot Description Format (U(C)RDF) enables visualization and controller parameterization through standardized interfaces, allowing for reusable software components across different actuation principles. This approach achieves a flexible and scalable engineering process for soft continuum robots, bridging the gap between research prototypes and industrial deployment. It lays the foundation for future developments in model-based design, automated control, and interoperability of soft continuum robotic systems.

Abstract:
Metric Simultaneous Localization and Mapping (SLAM) prioritizes geometric accuracy of estimated robot poses and maps. However, in many real-world robot applications, such as inspection robots operating inside pipelines or other confined network environments, metric accuracy is less critical than correctly capturing the underlying topological connectivity. In this paper, we investigate back-end optimization for topological mapping/SLAM, and propose a probabilistic topological map inference algorithm. Given noisy front-end measurements, our approach explicitly models the topological map inference problem within a factor graph framework. It performs inference using belief propagation, which yields a posterior distribution over multiple plausible topological maps rather than a single estimate. We evaluate our method on topologies derived from an open-source pipeline network dataset, spanning various topology sizes and degrees of perceptual aliasing. Extensive experiments demonstrate that our algorithm infers high-quality topological maps across varying conditions.

Abstract:
Multi-Robot Task Assignment (MRTA) studies the problem of allocating spatially distributed tasks to a fleet of cooperative robots as well as determining the optimal task sequence for each robot. Common objectives include minimizing the task waiting times, minimizing the robot tour lengths and maximizing the number of serviced tasks within given time windows. However, this does not consider an equitable distribution of the workload among the fleet. Yet, uneven workloads are often undesirable since it can incur solutions where few robots service most tasks while parts of the fleet remain underused. On the other hand, under fully balanced workloads robots may insufficiently consider the total operation cost and thus can be deployed a redundant manner. In this paper, we study MRTA from the viewpoint of multi-objective optimization (MOO), formulating the problem of simultaneously minimizing the costs of individual robot tours. We explore how this treatment allows for attaining more balanced solutions than common formulations using the sum or maximum of tour costs. We present a generalist formulation using a scalar objective and establish theoretical guarantees on the attainable multi-objective trade-offs. Further, we derive an effective heuristic based on a p-norm of tour lengths that is able to find balanced workloads among robots. Our approach is agnostic to the specific choice of MRTA solver and we provide insights into how it can be incorporated into two state-of-the-art algorithms. We demonstrate our approach in experiments for offline and online MRTA setups, including servicing tasks as well as pickup and delivery, and highlight its advantages with respect to balanced workloads compared to state-of-the-art formulations.

Abstract:
Operating Unmanned Aerial Vehicles (UAVs) remains challenging for non-experts because single-modality interfaces distort intent: gesture-only systems depend on discrete vocabularies and mode switches that break continuity and raise cognitive load, while gaze-only control offers limited dimensionality and is vulnerable to Midas-touch and saccadic jitter. We present IntuFly, an intuition-driven hand-gaze framework in which hands draw the path to give continuous 3D translation and eyes set heading and lock targets, preserving intent continuity and reducing effort. To overcome cross-stream asynchrony and noise, our deployment-oriented fusion layer performs timestamp-consistent late fusion with stale-frame dropping and lightweight stabilization, yielding stable closed-loop operation at more than 25 Hz on commodity hardware. In simulation racing, novices fly faster on shorter paths than a Remote controller (RC) baseline, and intermediates select shorter, smoother yet more conservative lines; Subjective scales indicate lower workload and higher usability. In mobile target tracking, adding gaze produces faster responses with near-complete line-of-sight (LOS) coverage under identical limits. The same perception-control stack runs stably on an indoor DJI Tello platform with behavior consistent with simulation, demonstrating sim-to-real feasibility. These results show that IntuFly lowers the learning barrier for non-expert users while preserving fine control and stability, offering a deployable path toward intuitive, continuous human-UAV cooperative flight. Our code is publicly available at https://github.com/Crotonbee/IntuFly.

Abstract:
Safety-critical scenarios are essential for the development of autonomous vehicles (AVs) but are rare in real-world driving data. While simulation offers a way to generate such scenarios, manually designed test cases lack scalability, and adversarial optimization often produces unrealistic behaviors. In this work, we introduce a conditional latent flow matching approach for scalable and realistic safety-critical scenario generation. Our method uses distribution matching to transform nominal scenes into safety-critical rollouts. Furthermore, we demonstrate that incorporating both simulation and real-world data enables our framework to efficiently generate diverse, data-driven scenarios. Experimental results highlight that our approach is able to more consistently and realistically generate novel safety-critical scenarios, making it a valuable tool for training and benchmarking AV systems.

Abstract:
Solving complex real-world control tasks often takes multiple tries: if we fail at first, we reflect on what went wrong, and change our strategy accordingly to avoid making the same mistake. In robotics, Vision-Language-Action models (VLAs) offer a promising path towards solving complex control tasks, but lack the ability to contextually and dynamically readjust behavior when they fail to accomplish a task. In this work, we introduce Learning from Inference-Time Execution (LITEN), which connects a VLA low-level policy to a high-level VLM that conditions on past experiences by including them in-context, allowing it to learn the affordances and capabilities of the low-level VLA. Our approach iterates between a reasoning phase that generates and executes plans for the low-level VLA, and an assessment phase that reflects on the resulting execution and draws useful conclusions to be included in future reasoning contexts. Unlike similar approaches to self-refinement in non-robotics domains, LITEN must reflect on unstructured real-world robot trajectories (e.g., raw videos), which requires careful treatment during assessment. Our experimental results demonstrate LITEN is able to effectively learn from past experience to generate plans that use high-affordance instructions to accomplish long-horizon tasks.

Abstract:
Robots learn reward functions from user demonstrations, but these rewards often fail to generalize to new environments. This failure occurs because learned rewards latch onto spurious correlations in training data rather than the underlying human intent that demonstrations represent. Existing methods leverage visual or semantic similarity to improve robustness, yet these surface-level cues often diverge from what humans actually care about. We present Generalizing Intent for Flexible Test-Time rewards (GIFT), a framework that grounds reward generalization in human intent rather than surface cues. GIFT leverages language models to infer high-level intent from user demonstrations by contrasting preferred with non-preferred behaviors. At deployment, GIFT maps novel test states to behaviorally equivalent training states via intent-conditioned similarity, enabling learned rewards to generalize across distribution shifts without retraining. We evaluate GIFT on tabletop manipulation tasks with new objects and layouts. Across four simulated tasks with over 50 unseen objects, GIFT consistently outperforms visual and semantic similarity baselines in test-time pairwise win rate and state-alignment F1 score. Real-world experiments on a 7-DoF Franka Panda robot demonstrate that GIFT reliably transfers to physical settings. Further discussion can be found at https://mit-clear-lab.github.io/GIFT/

Abstract:
Manipulators are essential for advancing orchard robotics tasks such as pruning and harvesting, which require precise, dexterous motion in cluttered and unstructured environments. Off-the-shelf industrial arms, while readily available, often lack the reach and dexterity required for these settings. In this paper we present a simulation-driven, multi-objective optimization framework for task-specific manipulator kinematics, leveraging the NSGA-II evolutionary algorithm and physics-based evaluation. Candidate designs are encoded with high-level parameters -- joint type, axis orientation, link length, and joint count -- then automatically generated as URDF models and evaluated in simulation for reachability, manipulability, torque demand, and motion planning cost. Trade-offs are revealed on a Pareto front, enabling exploration across diverse designs. The framework is demonstrated on a real-world tree pruning task, using collected 3D scans of expert-pruned trees and an automated prune point identification pipeline to generate target points to guide the optimization. Results show that the proposed approach produces task-specific manipulator designs with improved workspaces and reduced operational constraints compared to a commercial industrial arm, offering a viable pathway toward deployable agricultural manipulation systems.

Abstract:
Several approaches have been proposed to improve the sample efficiency of online reinforcement learning (RL) by leveraging demonstrations collected offline. The offline data can be used directly as transitions to optimize RL objectives, or offline policy and value functions can first be learned from the data and then used for online finetuning or to provide reference actions. While each of these strategies has shown compelling results, it is unclear which method has the most impact on sample efficiency, whether these approaches can be combined, and if there are cumulative benefits. We classify existing demonstration-augmented RL approaches into three categories and perform an extensive empirical study of their strengths, weaknesses, and combinations to isolate the contribution of each strategy and determine effective hybrid combinations for sample-efficient online RL. Our analysis reveals that directly reusing offline data and initializing with behavior cloning consistently outperform more complex offline RL pretraining methods for improving online sample efficiency.

Abstract:
A deep understanding of kinematic structures is essential for robot motion and interaction with the environment. Such understanding is captured through articulated objects, which are essential for physical simulation, motion planning, and policy learning. However, creating these models, particularly for objects with high degrees of freedom (DoF), remains a significant challenge. Existing methods typically rely on motion sequences or strong assumptions from hand-curated datasets. In this paper, we introduce Kinematify, an automated framework that synthesizes articulated objects from arbitrary RGB images or textual descriptions. Our method addresses two core challenges: (i) inferring kinematic topologies for high-DoF objects and (ii) estimating joint parameters from static 3D geometry. To achieve this, we combine MCTS search for structural inference with geometry-driven optimization for joint reasoning, producing physically consistent and functionally valid models. We evaluate Kinematify on diverse inputs from both synthetic environments and real-world, demonstrating improvements in registration and kinematic topology accuracy over prior work.

Abstract:
We present Graphite, a GPU-accelerated nonlinear least squares graph optimization framework. It provides a CUDA C++ interface to enable the sharing of code between a real-time application, such as a SLAM system, and its optimization tasks. The framework supports techniques to reduce memory usage, including in-place optimization, support for multiple floating point types and mixed-precision modes, and dynamically computed Jacobians. We evaluate Graphite on well-known bundle adjustment problems and find that it achieves similar performance to MegBA, a solver specialized for bundle adjustment, while maintaining generality and using less memory. We also apply Graphite to global visual-inertial bundle adjustment on maps generated from stereo-inertial SLAM datasets, and observe speed-ups of up to 59× compared to a CPU baseline. Our results indicate that our framework enables faster large-scale optimization on both desktop and resource-constrained devices.

Abstract:
Vision algorithms can be executed directly on the image sensor when implemented on the next-generation sensors known as focal-plane sensor-processor arrays (FPSP)s, where every pixel has a processor. FPSPs greatly improve latency, reducing the problems associated with the bottleneck of data transfer from a vision sensor to a processor. FPSPs accelerate vision-based algorithms such as visual-inertial odometry (VIO). However, VIO frameworks suffer from spatial drift due to the vision-based pose estimation, whilst temporal drift arises from the inertial measurements. FPSPs circumvent the spatial drift by operating at a high frame rate to match the high-frequency output of the inertial measurements. In this paper, we present TCB-VIO, a tightly-coupled 6 degrees-of-freedom VIO by a Multi-State Constraint Kalman Filter (MSCKF), operating at a high frame-rate of 250 FPS and from IMU measurements obtained at 400 Hz. TCB-VIO outperforms state-of-the-art methods: ROVIO, VINS-Mono, and ORB-SLAM3.

Abstract:
Gaussian Splatting SLAM (GS-SLAM) offers a notable improvement over traditional SLAM methods, enabling photorealistic 3D reconstruction that conventional approaches often struggle to achieve. However, existing GS-SLAM systems perform poorly under persistent and severe motion blur commonly encountered in real-world scenarios, leading to significantly degraded tracking accuracy and compromised 3D reconstruction quality. To address this limitation, we propose EGS-SLAM, a novel GS-SLAM framework that fuses event data with RGB-D inputs to simultaneously reduce motion blur in images and compensate for the sparse and discrete nature of event streams, enabling robust tracking and high-fidelity 3D Gaussian Splatting reconstruction. Specifically, our system explicitly models the camera's continuous trajectory during exposure, supporting event- and blur-aware tracking and mapping on a unified 3D Gaussian Splatting scene. Furthermore, we introduce a learnable camera response function to align the dynamic ranges of events and images, along with a no-event loss to suppress ringing artifacts during reconstruction. We validate our approach on a new dataset comprising synthetic and real-world sequences with significant motion blur. Extensive experimental results demonstrate that EGS-SLAM consistently outperforms existing GS-SLAM systems in both trajectory accuracy and photorealistic 3D Gaussian Splatting reconstruction. The source code will be available at https://github.com/Chensiyu00/EGS-SLAM.

Abstract:
Residual Reinforcement Learning (RL) is a popular approach for adapting pretrained policies by learning a lightweight residual policy that provides corrective actions. While Residual RL is more sample-efficient than finetuning the entire base policy, existing methods struggle with sparse rewards and are designed for deterministic base policies. We propose two improvements to Residual RL that further enhance its sample efficiency and make it suitable for stochastic base policies. First, we leverage uncertainty estimates of the base policy to focus exploration on regions in which the base policy is not confident. Second, we propose a simple modification to off-policy residual learning that allows it to observe base actions and better handle stochastic base policies. We evaluate our method with both Gaussian-based and Diffusion-based stochastic base policies on tasks from Robosuite and D4RL, and compare against state-of-the-art finetuning methods, demo-augmented RL methods, and other Residual RL methods. Our algorithm significantly outperforms existing baselines in a variety of simulation benchmark environments. We also deploy our learned policies in the real world to demonstrate their robustness with zero-shot sim-to-real transfer.

Abstract:
Autonomous vehicles employ semantic segmentation as a foundational component for perception and scene understanding, upon which driving decisions can be informed. Despite their performance, these deep learning models remain susceptible to subtle input perturbations that can cause severe deviation in model output. To enhance algorithmic robustness by examining such vulnerabilities, researchers have investigated adversarial examples, which are visually imperceptible yet can severely degrade model performance. However, traditional attacks produce arbitrary misclassifications that ignore semantic relationships, making the attack less effective. This paper introduces a semantic hierarchy-guided adversarial attack (SHAA), a white-box adversarial attack against semantic segmentation for autonomous driving. By combining semantic hierarchy and adaptive momentum-based updates across the image, SHAA produces semantically nontrivial yet highly effective perturbations. The SHAA method exposes deeper vulnerabilities with a higher attack success rate in semantic segmentation than existing methods, aiding the design of a more resilient perception system for autonomous vehicles.

Abstract:
Expressive motion planning for Aerial Manipulators (AMs) is essential for tackling complex manipulation tasks, yet achieving coupled trajectory planning adaptive to various tasks remains challenging, especially for those requiring aggressive maneuvers. In this work, we propose a novel whole-body integrated motion planning framework for quadrotor-based AMs that leverages flexible waypoint constraints to achieve versatile manipulation capabilities. These waypoint constraints enable the specification of individual position requirements for either the quadrotor or end-effector, while also accommodating higher-order velocity and orientation constraints for complex manipulation tasks. To implement our framework, we exploit spatio-temporal trajectory characteristics and formulate an optimization problem to generate feasible trajectories for both the quadrotor and manipulator while ensuring collision avoidance considering varying robot configurations, dynamic feasibility, and kinematic feasibility. Furthermore, to enhance the maneuverability for specific tasks, we employ Imitation Learning (IL) to facilitate the optimization process to avoid poor local optima. The effectiveness of our framework is validated through comprehensive simulations and real-world experiments, where we successfully demonstrate nine fundamental manipulation skills across various environments.

Abstract:
This paper introduces a novel approach for controlling aerial robots during physical interaction by integrating Admittance Control with Nonlinear Model Predictive Control (NMPC). Unlike existing methods, our technique incorporates the desired impedance dynamics directly into the NMPC prediction model, alongside the robots dynamics. This allows for the explicit prediction of how the robots impedance will respond to interaction forces within the prediction horizon. Consequently, our controller effectively tracks the desired impedance behavior during physical interaction while seamlessly transitioning to trajectory tracking in free motion, all while consistently respecting actuator constraints. The efficacy of this method is validated through realtime simulations and experiments involving physical interaction tasks with an aerial robot. Our findings demonstrate that, across most scenarios, our method significantly outperforms the state-ofthe-art (which does not predict future impedance state), achieving a reduction in tracking error of up to 90%. Furthermore, the results indicate that our approach enables smoother and safer physical interaction, characterized by reduced oscillations and the absence of the unstable behavior observed with the state-ofthe-art method in certain situations.

Abstract:
This paper presents a novel approach to forest habitat monitoring using robotics and advanced data analysis techniques. We introduce a quadrupedal robot with LiDAR and onboard cameras to collect detailed data about forest structure and composition. The data is then processed using a combination of data analysis techniques and machine learning algorithms to perform a comprehensive dendrometric and floristic survey. Our approach provides an efficient and accurate method for assessing the ecological health of forest ecosystems. This work contributes to the ongoing efforts in habitat conservation and offers a promising tool for future environmental monitoring tasks.

Abstract:
Existing recursive rigid body dynamics algorithms with low computational complexity are mostly restricted to kinematic trees with external contact constraints or are sensitive to singular cases (e.g., linearly dependent constraints and kinematic singularities), severely impacting their practical usage in existing simulators. This article introduces two original low-complexity recursive algorithms, loop-constrained articulated body algorithm (LCABA) and proxBBO, based on proximal dynamics formulation for forward simulation of mechanisms with loops. These algorithms are derived from first principles using non-serial dynamic programming, depict linear complexity in practical scenarios, and are numerically robust to singular cases. They extend the existing constrained articulated body algorithm (constrainedABA) to handle internal loops and the pioneering BBO algorithm from the 1980s to singular cases. Both algorithms have been implemented by leveraging the open-source Pinocchio library, benchmarked in detail, and depict state-of-the-art performance for various robot topologies, including over 6x speed-ups compared to existing non-recursive algorithms for high-degree-of-freedom systems with internal loops, such as recent humanoid robots.

Abstract:
Room-level understanding is essential for mobile robots operating in unseen indoor environments. Existing room segmentation methods predominantly assume an offline setting, typically requiring a complete scene reconstruction before producing the final result, which limits their applicability to real-time robotic navigation. In this work, we introduce the novel problem of emphonline 3D room segmentation, where a robot must continuously segment rooms and detect room transitions from streaming sensory observations during exploration. By framing 3D room segmentation in an online setting, we aim to encourage further research in practical, real-time semantic mapping for autonomous agents operating in unknown environments. To properly assess this novel online setting, we also introduce instantaneous evaluation metrics tailored to online room segmentation and transition detection. We also propose textbfROOM-3D: a real-time unsupervised framework, for the problem of online 3D room segmentation. textbfROOM-3D combines Gaussian-based SLAM with open-vocabulary semantic reasoning to incrementally generate a semantically structured 3D room segmentations, as well as transition estimates, without access to future observations or global post-processing. Experiments on HM3D-Semantics dataset demonstrate that ROOM-3D achieves temporally consistent and accurate segmentation under strict online constraints, while offering state-of-the-art results for the offline experimental evaluation.

Abstract:
This study informs the design of future multi-agent pathfinding (MAPF) and multi-robot motion planning (MRMP) algorithms by guiding choices based on constraint classification for constraint-based search algorithms. We categorize constraints as conservative or aggressive and provide insights into their search behavior, focusing specifically on vanilla Conflict-Based Search (CBS) and Conflict-Based Search with Priorities (CBSw/P). Under a hybrid grid-roadmap representation with varying resolution, we observe that aggressive (priority constraint) formulations tend to solve more instances as agent count or resolution increases, whereas conservative (motion constraint) formulations yield stronger solution quality when both succeed. Findings are synthesized in a decision flowchart, aiding users in selecting suitable constraints. Recommendations extend to Multi-Robot Motion Planning (MRMP), emphasizing the importance of considering topological features alongside problem, solution, and representation features. A comprehensive exploration of the study, including raw data and map performance, is available in our public GitHub Repository: https://GitHub.com/hannahjmlee/constraint-mapf-analysis

Abstract:
Robots navigating indoor environments often have access to architectural plans, which can serve as prior knowledge to enhance their localization and mapping capabilities. While some SLAM algorithms leverage these plans for global localization in real-world environments, they typically overlook a critical challenge: the as-planned architectural designs frequently deviate from the as-built real-world environments. To address this gap, we present a novel algorithm that tightly couples LIDAR-based simultaneous localization and mapping with architectural plans under the presence of deviations. Our method utilizes a multi-layered semantic representation to not only localize the robot, but also to estimate global alignment and structural deviations between as-planned and as-built environments in real-time. To validate our approach, we performed experiments in simulated and real datasets demonstrating robustness to structural deviations up to 35 cm and 15�? On average, our method achieves 43% less localization error than baselines in simulated environments, while in real environments, the as-built 3D maps show 7% lower average alignment error.

Abstract:
Tactile texture is vital for robotic manipulation but challenging for camera vision-based observation. To address this, we propose TactileAloha, an integrated tactile-vision robotic system built upon Aloha, with a tactile sensor mounted on the gripper to capture fine-grained texture information and support real-time visualization during teleoperation, facilitating efficient data collection and manipulation. Using data collected from our integrated system, we encode tactile signals with a pre-trained ResNet and fuse them with visual and proprioceptive features. The combined observations are processed by a transformer-based policy with action chunking to predict future actions. We use a weighted loss function during training to emphasize near-future actions, and employ an improved temporal aggregation scheme at deployment to enhance action precision. Experimentally, we introduce two bimanual tasks: zip tie insertion and Velcro fastening, both requiring tactile sensing to perceive the object texture and align two object orientations by two hands. Our proposed method adaptively changes the generated manipulation sequence itself based on tactile sensing in a systematic manner. Results show that our system, leveraging tactile information, can handle texture-related tasks that camera vision-based methods fail to address. Moreover, our method achieves an average relative improvement of approximately 11.0% compared to state-of-the-art method with tactile input, demonstrating its performance.

Abstract:
Loco-manipulation, physical interaction of various objects that is concurrently coordinated with locomotion, remains a major challenge for legged robots due to the need for both precise end-effector control and robustness to unmodeled dynamics. While model-based controllers provide precise planning via online optimization, they are limited by model inaccuracies. In contrast, learning-based methods offer robustness, but they struggle with precise modulation of interaction forces. We introduce RAMBO, a hybrid framework that integrates model-based whole-body control within a feedback policy trained with reinforcement learning. The model-based module generates feedforward torques by solving a quadratic program, while the policy provides feedback corrective terms to enhance robustness. We validate our framework on a quadruped robot across a diverse set of real-world loco-manipulation tasks, such as pushing a shopping cart, balancing a plate, and holding soft objects, in both quadrupedal and bipedal walking. Our experiments demonstrate that RAMBO enables precise manipulation capabilities while achieving robust and dynamic locomotion.

Abstract:
Event-based cameras are bio-inspired sensors with pixels that independently and asynchronously respond to brightness changes at microsecond resolution, offering the potential to handle state estimation tasks involving motion blur and high dynamic range (HDR) illumination conditions. However, the versatility of event-based visual odometry (VO) relying on handcrafted data association (either direct or indirect methods) is still unreliable, especially in field robot applications under low-light HDR conditions, where the dynamic range can be enormous and the signal-to-noise ratio is spatially-and-temporally varying. Leveraging deep neural networks offers new possibilities for overcoming these challenges. In this paper, we propose a learning-based stereo event visual odometry. Building upon Deep Event Visual Odometry (DEVO), our system (called Stereo-DEVO) introduces a novel and efficient static-stereo association strategy for sparse depth estimation with almost no additional computational burden. By integrating it into a tightly coupled bundle adjustment (BA) optimization scheme, and benefiting from the recurrent networks ability to perform accurate optical flow estimation through voxel-based event representations to establish reliable patch associations, our system achieves high-precision pose estimation in metric scale. In contrast to the offline performance of DEVO, our system can process event data of Video Graphics Array (VGA) resolution in real time. Extensive evaluations on multiple public real-world datasets and self-collected data justify our systems versatility, demonstrating superior performance compared to state-of-the-art event-based VO methods. More importantly, our system achieves stable pose estimation even in large-scale nighttime HDR scenarios.

Abstract:
How do humans move? Advances in reinforcement learning (RL) have produced impressive results in capturing human motion using physics-based humanoid control. However, torque-controlled humanoids fail to model key aspects of human motor control such as biomechanical joint constraints & non-linear and overactuated musculotendon control. We present KINESIS, a model-free motion imitation framework that tackles these challenges. KINESIS is trained on 1.8 hours of locomotion data and achieves strong motion imitation performance on unseen trajectories. Through a negative mining approach, KINESIS learns robust locomotion priors that we leverage to deploy the policy on several downstream tasks such as text-to-control, target point reaching, and football penalty kicks. Importantly, KINESIS learns to generate muscle activity patterns that correlate well with human EMG activity. We show that these results scale seamlessly across biomechanical model complexity, demonstrating control of up to 290 muscles. Overall, the physiological plausibility makes KINESIS a promising model for tackling challenging problems in human motor control. Code, videos and benchmarks are available at https://github.com/amathislab/Kinesis.

Abstract:
Open-vocabulary 3D functionality segmentation enables robots to localize functional object components in 3D scenes. It is a challenging task that requires spatial understanding and task interpretation. Current open-vocabulary 3D segmentation methods primarily focus on object-level recognition, while scene-wide part segmentation methods attempt to segment the entire scene exhaustively, making them highly resource-intensive. Balancing segmentation performance in terms of granularity, accuracy, and speed remains a challenge. As one step towards alleviating this, we introduce T-FunS3D, a task-driven hierarchical open-vocabulary 3D functionality segmentation method that provides actionable perception for robotic applications. Our method takes as input the 3D point cloud and posed RGB-D images of an indoor scene. We construct an open-vocabulary scene graph by extracting instances and their visual embeddings in the environment. Given a task description, T-FunS3D identifies the most relevant instances in the scene graph and locates their functional components leveraging a vision-language model. Experiments on the SceneFun3D dataset demonstrate that T-FunS3D is comparable to state-of-the-art in open-vocabulary 3D functionality segmentation, while achieving faster runtime and reduced memory usage.

Abstract:
Force sensing is a crucial modality for Vision-Language-Action (VLA) frameworks, as it enables fine-grained perception and dexterous manipulation in contact-rich tasks. We present Force-Distilled VLA (FD-VLA), a novel framework that integrates force awareness into contact-rich manipulation without relying on physical force sensors. The core of our approach is a Force Distillation Module (FDM), which distills force by mapping a learnable query token, conditioned on visual observations and robot states, into a predicted force token aligned with the latent representation of actual force signals. During inference, this distilled force token is injected into the pretrained VLM, enabling force-aware reasoning while preserving the integrity of its vision-language semantics. This design provides two key benefits: first, it allows practical deployment across a wide range of robots that lack expensive or fragile force-torque sensors, thereby reducing hardware cost and complexity; second, the FDM introduces an additional force-vision-state fusion prior to the VLM, which improves cross-modal alignment and enhances perception-action robustness in contact-rich scenarios. Surprisingly, our physical experiments show that the distilled force token outperforms direct sensor force measurements as well as other baselines, which highlights the effectiveness of this force-distilled VLA approach.

Abstract:
This work introduces a novel compliant model for running gaits. The model consists of a linear leg stiffness paired with a nonlinear energy regulation term. This new model, termed the quartic model, is shown to reproduce the external dynamics of a running gait. The characteristics of the gait are imposed through parametric conditions which are derived through linearization of the model. The nonlinear nature of the model ensures convergence towards a limit cycle, which makes the model a useful template for the control of legged systems.

Abstract:
Autonomous vehicles equipped with robust onboard perception, localization, and planning still face limitations in occlusion and non-line-of-sight (NLOS) scenarios, where delayed reactions can increase collision risk. We propose CooperDrive, a cooperative perception framework that augments situational awareness and enables earlier, safer driving decisions. CooperDrive offers two key advantages: (i) each vehicle retains its native perception, localization, and planning stack, and (ii) a lightweight object-level sharing and fusion strategy bridges perception and planning. Specifically, CooperDrive reuses detector Birds-Eye View (BEV) features to estimate accurate vehicle poses without additional heavy encoders, thereby reconstructing BEV representations and feeding the planner with low latency. On the planning side, CooperDrive leverages the expanded object set to anticipate potential conflicts earlier and adjust speed and trajectory proactively, thereby transforming reactive behaviors into predictive and safer driving decisions. Real-world closed-loop tests at occlusion-heavy NLOS intersections demonstrate that CooperDrive increases reaction lead time, minimum time-to-collision (TTC), and stopping margin, while requiring only ~90 kbps bandwidth and maintaining an average end-to-end latency of 89 ms.

Abstract:
Multi-modal large language models (LLMs) are expected to significantly enhance the intelligence of home service robots. However, reliance on cloud processing of raw visual data poses critical privacy risks. To address this problem, we propose a novel two-stage cloud-edge hybrid architecture for robots in domestic environments. This architecture employs a lightweight local LLM to perform sensitive content screening and semantic abstraction before transmitting the data to a more powerful cloud-based LLM for high-level planning and reasoning. Experiments with our end-to-end system demonstrate that it effectively protects a wide range of private data with minimal impact on task success rates. Without modifying cloud models, our approach offers a deployable performanceprivacy trade-off for home robots, advancing safe and socially acceptable autonomy.

Abstract:
We present a unifying theoretical result that con- nects two foundational principles in robotics: the Signorini law for point contacts, which underpins many simulation methods for preventing object interpenetration, and the center of pres- sure (also known as the zero-moment point), a key concept in optimization-based locomotion control. Our contribution is the planar Signorini condition, a conic complementarity formulation that models general planar contacts between rigid bodies. We prove that this formulation is equivalent to enforcing the punctual Signorini law across an entire contact surface, thereby bridging the gap between discrete and continuous contact models. A geometric interpretation reveals that the framework naturally captures three physical regimes stick- ing, separating, and tilting within a unified complementarity structure. This leads to a principled extension of the classical center of pressure, which we refer to as the extended center of pressure. By establishing this connection, our work provides a mathematically consistent and computationally tractable foundation for handling planar contacts, with implications for both the accurate simulation of contact dynamics and the design of next-generation control and optimization algorithms in locomotion and manipulation.

Abstract:
Safe navigation in uncertain environments requires planning methods that integrate risk aversion with active perception. In this work, we present a unified frame- work that refines a coarse reference path by construct- ing tail-sensitive risk maps from Average Value-at-Risk statistics on an online-updated 3D Gaussian-splat Radiance Field. These maps enable the generation of locally safe and feasible trajectories. In parallel, we formulate Next- Best-View (NBV) selection as an optimization problem on the SE(3) pose manifold, where Riemannian gradient descent maximizes an expected information gain objective to reduce uncertainty most critical for imminent motion. Our approach advances the state-of-the-art by coupling risk-averse path refinement with NBV planning, while introducing scalable gradient decompositions that support efficient online updates in complex environments. We demonstrate the effectiveness of the proposed framework through extensive computational studies.

Abstract:
Construction sites frequently require removing large rocks before excavation or grading can proceed. Human operators typically extract these boulders using only standard digging buckets, avoiding time-consuming tool changes to specialized grippers. This task demands manipulating irregular objects with unknown geometries in harsh outdoor environments where dust, variable lighting, and occlusions hinder perception. The excavator must adapt to varying soil resistancedragging along hard-packed surfaces or penetrating soft groundwhile coordinating multiple hydraulic joints to secure rocks using a shovel. Current autonomous excavation focuses on continuous media (soil, gravel) or uses specialized grippers with detailed geometric planning for discrete objects. These approaches either cannot handle large irregular rocks or require impractical tool changes that interrupt workflow. We train a reinforcement learning policy in simulation using rigid-body dynamics and analytical soil models. The policy processes sparse LiDAR points (just 20 per rock) from vision-based segmentation and proprioceptive feedback to control standard excavator buckets. The learned agent discovers different strategies based on soil resistance: dragging along the surface in hard soil and penetrating directly in soft conditions. Field tests on a 12-ton excavator achieved 70% success across varied rocks (0.4--0.7m) and soil types, compared to 83% for human operators. This demonstrates that standard construction equipment can learn complex manipulation despite sparse perception and challenging outdoor conditions.

Abstract:
For many complex tasks, multi-finger robot hands are poised to revolutionize how we interact with the world, but reliably grasping objects remains a significant challenge. We focus on the problem of synthesizing grasps for multi-finger robot hands that, given an target object's geometry and pose, computes a hand configuration. Existing approaches often struggle to produce reliable grasps that sufficiently constrain object motion, leading to instability under disturbances and failed grasps. A key reason is that during grasp generation, they typically focus on resisting a single wrench, while ignoring the object's potential for adversarial movements, such as escaping. We propose a new grasp-synthesis approach that explicitly captures and leverages the adversarial object motion in grasp generation by formulating the problem as a two-player game. One player controls the robot to generate feasible grasp configurations, while the other adversarially controls the object to seek motions that attempt to escape from the grasp. Simulation experiments on various robot platforms and target objects show that our approach achieves a success rate of 75.78%, up to 19.61% higher than the state-of-the-art baseline. The two-player game mechanism improves the grasping success rate by 27.40% over the method without the game formulation. Our approach requires only 0.28-1.04 seconds on average to generate a grasp configuration, depending on the robot platform, making it suitable for real-world deployment. In real-world experiments, our approach achieves an average success rate of 85.0% on ShadowHand and 87.5% on LeapHand, which confirms its feasibility and effectiveness in real robot setups. Code is publicly available at https://github.com/Neuling-jpg/Game4Grasp.

Abstract:
When exploring complex unknown environments, unmanned aerial vehicles (UAVs) often experience reduced efficiency and robustness due to unevenly distributed occlusions. This paper proposes an efficient hybrid autonomous exploration algorithm that adapts to environmental complexity, enabling effective frontier detection and viewpoint sampling to minimize overall exploration time. We introduce a frontier detection method based on a limited field of view (FOV), along with an unique ID-based frontier management mechanism, which ensures detection completeness while significantly reducing computational and memory overhead. Furthermore, an adaptive sampling strategy incorporating environmental complexity is introduced. By adaptively switching sampling modes and relaxing obstacle-free sphere generation constraints, the method improves both sampling efficiency and visibility evaluation performance. For path planning, a hierarchical planner based on a topological graph is constructed. It jointly optimizes global coverage paths and local frontier information to generate smooth and time-optimal trajectories. Both simulation and real-world experiments validate the advantages of the proposed approach in terms of exploration efficiency, computational overhead, and coverage rate.

Abstract:
Reinforcement learning and sim-to-real transfer have made significant progress in dexterous manipulation. However, progress remains limited by the difficulty of simulating complex contact dynamics and multisensory signals, especially tactile feedback. In this work, we propose DexScrew, a sim-to-real framework that addresses these limitations and demonstrates its effectiveness on nut-bolt fastening and screwdriving with multi-fingered hands. The framework has three stages. First, we train reinforcement learning policies in simulation using simplified object models that lead to the emergence of correct finger gaits. We then use the learned policy as a skill primitive within a teleoperation system to collect real-world demonstrations that contain tactile and proprioceptive information. Finally, we train a behavior cloning policy that incorporates tactile sensing and show that it generalizes to nuts and screwdrivers with diverse geometries. Experiments across both tasks show high task progress ratios compared to direct sim-to-real transfer and robust performance even on unseen object shapes and under external perturbations. Videos and code are available on dexscrew.github.io

Abstract:
Visual navigation ability is strongly tied to its underlying representation of the world. Unlike classical 3D maps that require globally-consistent geometry, image- or object-relative topological graphs almost entirely do away with geometric understanding. But, this comes at the cost of navigation capability, often limiting it to merely teach-and-repeat. In this work, we propose a novel map representation in the form of pixel-relative connectivity, which is geometrically accurate but does not require global geometric consistency. Inspired by recent progress in 3D grounded image matching, we construct a map from an image sequence through inter-image connectivity based on pixel correspondences in the relative 3D coordinate systems of individual image pairs. We then use this pixel-level graph to perform global path planning by approximating and sparsifying intra-image pixel connectivity. Through this, we derive a ''WayPixel Costmap'' representation and train a controller conditioned on it to predict a trajectory rollout. We show that this dense pixel-level costmap based on relative geometry is a more accurate conditioning variable for control prediction than its image- and object-level counterparts. This enables a highly capable navigation system, as validated on four types of navigation tasks in the simulator and through real world demonstrations.

Abstract:
This paper considers the problem of coordinating a group of mobile robots for distributedly estimating the parameters of a diffusion model that generates a time-varying spatial field. We assume that each robot can measure the local concentration of a substance continuously released in the environment and base the proposed distributed estimation strategy on an Extended Information Consensus Filter (E-ICF) with a forgetting factor. We then develop a decentralized online motion strategy aimed at minimizing a Gramian-based information metric that improves the E-ICF convergence. Additional constraints, among which collision avoidance, are integrated as Control Barrier Functions (CBFs) in a Quadratic Program (QP). Finally, we present statistical comparisons against three baselines which show the improved performance of the proposed method in a range of simulated scenarios, and we also report the results of experiments carried out with quadcopters to demonstrate the actual implementability of the approach and its effectiveness in generating online, collision-free, and informative motions.

Abstract:
Motion planning for autonomous vision-based car racing is a challenging task in robotics. Classical racing systems divide the task into numerous submodules, undermining computational efficiency and leading to error propagation. Previous studies have demonstrated impressive reinforcement learning (RL) results for end-to-end autonomous driving. However, RL exhibits poor scalability on high-dimensional data, such as images, and it is challenging to learn optimal racing behaviors due to a lack of global information about the environments. To address these issues, a two-phase learning paradigm is proposed in this work to train a vision-based racing policy. First, RL trains a teacher policy that integrates progress maximization with collision avoidance in the reward function and utilizes privileged information about the racetrack to achieve high-performance racing. Then, a student policy, relying only on an ego-centric depth camera for perception, is trained by distilling racing knowledge from the teacher policy. The student policy achieves high-speed drive, high success rate, and smooth control in vision-based racing games. The proposed approach is validated in the simulation and on a real-world 1/10-scale race car, showing that the approach outperforms previous model-based and learning-based baselines.

Abstract:
The increasing use of drones in human-centric applications highlights the need for designs that can survive collisions and recover rapidly, minimizing risks to both humans and the environment. We present HoLoArm, a quadrotor with compliant arms inspired by the nodus structure of dragonfly wings. This design provides natural flexibility and resilience while preserving flight stability, which is further reinforced by the integration of a Reinforcement Learning (RL) control policy that enhances both recovery and hovering performance. Experimental results demonstrate that HoLoArm can passively deform in any direction, including axial one, and recover within 0.3-0.6~s depending on the direction and level of the impact. The drone can survive collisions at speeds up to 7.6 m/s and carry a 540 g payload while maintaining stable flight. This work contributes to the morphological design of soft aerial robots with high agility and reliable safety, enabling operation in cluttered and human-shared environments, and lays the groundwork for future fully soft drones that integrate compliant structures with intelligent control.

Abstract:
Diffusion models have been extensively leveraged for learning robot skills from demonstrations. These policies are conditioned on several observational modalities such as proprioception, vision and tactile. However, observational modalities have varying levels of influence for different tasks that diffusion polices fail to capture. In this work, we propose 'Factorized Diffusion Policies' abbreviated as FDP, a novel policy formulation that enables observational modalities to have differing influence on the action diffusion process by design. This results in learning policies where certain observations modalities can be prioritized over the others such as vision>tactile or proprioception>vision. FDP achieves modality prioritization by factorizing the observational conditioning for diffusion process, resulting in more performant and robust policies. Our factored approach shows strong performance improvements in low-data regimes with 15% absolute improvement in success rate on several simulated benchmarks when compared to a standard diffusion policy that jointly conditions on all input modalities. Moreover, our benchmark and real-world experiments show that factored policies are naturally more robust with 40% higher absolute success rate across several visuomotor tasks under distribution shifts such as visual distractors or camera occlusions, where existing diffusion policies fail catastrophically. FDP thus offers a safer and more robust alternative to standard diffusion policies for real-world deployment. Videos are available at https://fdp-policy.github.io/fdp-policy/.

Abstract:
Every robot built to date was predesigned by an external process, prior to deployment. Here we show a robot that actively participates in its own design during its lifetime. Starting from a randomly assembled body, and using only proprioceptive feedback, the robot dynamically "sculpts" itself into a new design through kinematic self-destruction: identifying redundant links within its body that inhibit its locomotion, and then thrashing those links against the surface until they break at the joint and fall off the body. It does so using a single autoregressive sequence model, a universal controller that learns in simulation when and how to simplify a robot's body through self-destruction and then adaptively controls the reduced morphology. The optimized policy successfully transfers to reality and generalizes to previously unseen kinematic trees, generating forward locomotion that is more effective than otherwise equivalent policies that randomly remove links or cannot remove any. This suggests that self-designing robots may be more successful than predesigned robots in some cases, and that kinematic self-destruction, though reductive and irreversible, could provide a general adaptive strategy for a wide range of robots.

Abstract:
We address the challenge of reliable and efficient interaction in autonomous multi-agent systems, where agents must balance long-term strategic objectives with short-term dynamic adaptation. We propose context-triggered contingency games, a novel integration of strategic games derived from temporal logic specifications with dynamic contingency games solved in real time. Our two-layered architecture leverages strategy templates to guarantee satisfaction of high-level objectives, while a new factor-graphbased solver enables scalable, real-time model predictive control of dynamic interactions. The resulting framework ensures both safety and progress in uncertain, interactive environments. We validate our approach through simulations and hardware experiments in autonomous driving and robotic navigation, demonstrating efficient, reliable, and adaptive multi-agent interaction.

Abstract:
Generating a collision-free robot motion is crucial for safe applications in real-world settings. This requires an accurate model of all obstacle shapes within the constrained robot cell, which is particularly challenging and time-consuming. The difficulty is heightened in flexible production lines, where the environment model must be updated each time the robot cell is modified. Furthermore, sensor-based methods often necessitate costly hardware and calibration procedures, and can be influenced by environmental factors (e.g., light conditions or reflections). To address these challenges, we present a novel data-driven approach to modeling a cluttered workspace, leveraging solely the robots internal joint encoders to capture exploratory motions. By computing the corresponding swept volume, we generate a (conservative) mesh of the environment that is subsequently used for collision checking within established path planning and control methods. Our method significantly reduces the complexity and cost of classical environment modeling by removing the need for CAD files and external sensors. We validate the approach with the KUKA LBR iisy collaborative robot in a pick-and-place scenario. In less than three minutes of exploratory robot motions and less than four additional minutes of computation time, we obtain an accurate model that enables collision-free motions. Our approach is intuitive, easy-to-use, making it accessible to users without specialized technical knowledge.

Abstract:
Model Predictive Control has emerged as a popular tool for robots to generate complex motions. However, the real-time requirement has limited the use of hard constraints and large preview horizons, which are necessary to ensure safety and stability. In practice, practitioners have to carefully design cost functions that can imitate an infinite horizon formulation, which is tedious and often results in local minima. In this work, we study how to approximate the infinite horizon value function of constrained optimal control problems with neural networks using value iteration and trajectory optimization. Furthermore, we experimentally demonstrate how using this value function approximation as a terminal cost provides global stability to the model predictive controller. The approach is validated on two toy problems and a real-world scenario with online obstacle avoidance on an industrial manipulator where the value function is conditioned to the goal and obstacle.

Abstract:
3D Gaussian Splatting (3DGS) has recently emerged as a powerful technique for representing 3D scenes. Its superior high-fidelity rendering quality and speed have driven its rapid adoption in many applications. Among them, Visual Simultaneous Localization and Mapping (VSLAM) is the most prominent application, as it requires real-time simultaneous mapping and position tracking of navigating objects. However, from our comprehensive study, we observed a fundamental hur- dle in directly applying the current 3DGS technique to VSLAM, which we define as the scale adaptation problem. The scale adaptation problem refers to the inability of existing 3DGS- based SLAM methods to address varying scales, specifically the extent of camera pose difference from the perspective of tracking, and environmental size in terms of mapping and the addition of new 3D Gaussians. To overcome this limitation, we propose SAGA-SLAM, the first scale-adaptive RGB-D Dense SLAM framework based on 3DGS. We optimize the tracking and mapping stages robustly over various scales by utilizing the Polyak step size and momentum. Additionally, we present gaussian fission method to address the scale problem during the addition of 3D Gaussians. Experiments show that our method achieves state-of-the-art results robustly on both large and small scales, such as KITTI, Replica, and TUM-RGBD. By adapting without the need for hyperparameter tuning, our method demonstrates both superior performance and practical applicability.

Abstract:
Ensuring safe and efficient operation of collaborative robots in human environments is challenging, especially in dynamic settings where both obstacle motion and tasks change over time. Current robot controllers typically assume full visibility and fixed tools, which can lead to collisions or overly conservative behavior. In our work, we introduce a tool aware collision avoidance system that adjusts in real time to different tool sizes and modes of tool-environment interaction. Using a learned perception model, our system filters out robot and tool components from the point cloud, reasons about occluded area, and predicts collision under partial observability. We then use a control policy trained via constrained reinforcement learning to produce smooth avoidance maneuvers in under 10 milliseconds. In simulated and real world tests, our approach outperforms traditional approaches(APF, MPPI) in dynamic environments, while maintaining sub-millimeter accuracy. Moreover, our system operates with approximately 60 % lower computational overhead compared to a state-of-the-art GPU-based planner. Our approach provides modular, efficient, and effective collision avoidance for robots operating in dynamic environments. We integrate our method into a collaborative robot application and demonstrate its practical use for safe and responsive operation.

Abstract:
4D radar super-resolution, which aims to reconstruct sparse and noisy point clouds into dense and geometrically consistent representations, is a foundational problem in autonomous perception. However, existing methods often suffer from high training cost or rely on complex diffusion-based sampling, resulting in high inference latency and poor generalization, making it difficult to balance accuracy and efficiency. To address these limitations, we propose MSDNet, a multi-stage distillation framework that efficiently transfers dense LiDAR priors to 4D radar features to achieve both high reconstruction quality and computational efficiency. The first stage performs reconstruction-guided feature distillation (RGFD), aligning and densifying the students features through feature reconstruction. In the second stage, we propose diffusion-guided feature distillation (DGFD), which treats the stage-one distilled features as a noisy version of the teacher's representations and refines them via a lightweight diffusion network. Furthermore, we introduce a noise adapter that adaptively aligns the noise level of the feature with a predefined diffusion timestep, enabling a more precise denoising. Extensive experiments on the VoD and in-house datasets demonstrate that MSDNet achieves both high-fidelity reconstruction and low-latency inference in the task of 4D radar point cloud super-resolution, and consistently improves performance on downstream tasks.

Abstract:
Soft robotic hands promise to provide compliant and safe interaction with objects and environments. However, designing soft hands to be both compliant and functional across diverse use cases remains challenging. Although co-design of hardware and control better couples morphology to behavior, the resulting search space is high-dimensional, and even simulation-based evaluation is computationally expensive. In this paper, we propose a Cross-Entropy Method with Reward Model (CEM-RM) framework that efficiently optimizes tendon-driven soft robotic hands based on teleoperation control policy, reducing design evaluations by more than half compared to pure optimization while learning a distribution of optimized hand designs from pre-collected teleoperation data. We derive a design space for a soft robotic hand composed of flexural soft fingers and implement parallelized training in simulation. The optimized hands are then 3D-printed and deployed in the real world using both teleoperation data and real-time teleoperation. Experiments in both simulation and hardware demonstrate that our optimized design significantly outperforms baseline hands in grasping success rates across a diverse set of challenging objects.

Abstract:
In this paper, we tackle real-time formation trajectory planning for collaborative object transportation in complex environments using a team of nonholonomic robots and a human. The object is transported in a deformable sheet, and robots should follow the humans lead while autonomously avoiding obstacles. By including a human in the formation, we leverage their adaptability and decision-making to improve transportation. However, it can be difficult for a human to predict how autonomous robots will behave in complex situations, such as when the formation must cross an obstacle, i.e. where the object is transported above it. This could cause human decisions that compromise safety. To overcome these challenges, we introduce a multi-modal formation planning framework. By default the human leads the formation, and the robots plan to remain in the same homotopy class as the human to avoid collisions. If obstacle crossing is necessary the robots take the lead of the formation, where human motion is constrained to a feasible region projected visually in front of them. We demonstrate the efficacy of our framework in simulation and on hardware.

Abstract:
The main challenge in lifelong imitation learning lies in the balance between mitigating catastrophic forgetting of previous skills while maintaining sufficient capacity for acquiring new ones. However, current approaches typically address these aspects in isolation, overlooking their internal correlation in lifelong skill acquisition. We address this limitation with a unified framework named Tokenized Skill Scaling (T2S). Specifically, by tokenizing the model parameters, the linear parameter mapping of the traditional transformer is transformed into cross-attention between input and learnable tokens, thereby enhancing model scalability through the easy extension of new tokens. Additionally, we introduce language-guided skill scaling to transfer knowledge across tasks efficiently and avoid linearly growing parameters. Extensive experiments across diverse tasks demonstrate that T2S: 1) effectively prevents catastrophic forgetting (achieving an average NBT of 1.0% across the three LIBERO task suites), 2) excels in new skill scaling with minimal increases in trainable parameters (needing only 8.0% trainable tokens in an average of lifelong tasks), and 3) enables efficient knowledge transfer between tasks (achieving an average FWT of 77.7% across the three LIBERO task suites), offering a promising solution for lifelong imitation learning.

Abstract:
While VisionLanguageAction (VLA) models map visual inputs and language instructions directly to robot actions, they often rely on costly hardware and struggle in novel or cluttered scenes. We introduce EverydayVLA, a 6-DOF manipulator that can be assembled for 300, capable of modest payloads and workspaces. A single unified model jointly outputs discrete and continuous actions, and our adaptive-horizon ensembler monitors motion uncertainty to trigger on-the-fly replanning for safe, reliable operation. On LIBERO, EverydayVLA matches state-of-the-art success rates, and in real-world tests it outperforms prior methods by 49% in-distribution and 34.9% out-of-distribution. By combining a state-of-the-art VLA with cost-effective hardware, EverydayVLA democratizes access to a robotic foundation model, and paves the way for economical use in homes and research labs alike.

Abstract:
This study addresses a flexible holding tool for robotic disassembly. We propose a shell-type soft jig that securely and universally holds objects, mitigating the risk of component damage and adapting to diverse shapes while enabling soft fixation that is robust to recognition, planning, and control errors. The balloon-based holding mechanism ensures proper alignment and stable holding performance, thereby reducing the need for dedicated jig design, highly accurate perception, precise grasping, and finely tuned trajectory planning that are typically required with conventional fixtures. Our experimental results demonstrate the practical feasibility of the proposed jig through performance comparisons with a vise and a jamming-gripper-inspired soft jig. Tests on ten different objects further showed representative successes and failures, clarifying the jig's limitations and outlook.

Abstract:
We present an action sequence transfer system that adaptively transfers user action sequences across different target spaces. Given an input action sequence from a source space and scene graph representations of both the source and target environments, our system predicts a corresponding action sequence in the target space by adapting to the spatial and object constraints of the new environment. To achieve this, we leverage multi-level representations of user activity to generalize actions at varying levels of abstraction. To demonstrate our system, we collect a new scene graph-based dataset derived from the Ego4D GoalStep dataset for valuation. Results indicate that our system can generate valid action sequences even between spaces with drastically different object configurations.

Abstract:
In recent years, parametric representations of point clouds have been widely applied in tasks such as memory-efficient mapping and multi-robot collaboration. Highly adaptive models, like spline surfaces or quadrics, are computationally expensive in detection or fitting. In contrast, real-time methods, such as Gaussian mixture models or planes, have low degrees of freedom, making high accuracy with few primitives difficult. To tackle this problem, a multi-model parametric representation with real-time surface detection and fitting is proposed. Specifically, the Gaussian mixture model is first employed to segment the point cloud into multiple clusters. Then, flat clusters are selected and merged into planes or curved surfaces. Planes can be easily fitted and delimited by a 2D voxel-based boundary description method. Surfaces with curvature are fitted by B-spline surfaces and the same boundary description method is employed. Through evaluations on multiple public datasets, the proposed surface detection exhibits greater robustness than the state-of-the-art approach, with 3.78 times improvement in efficiency. Meanwhile, this representation achieves a 2-fold gain in accuracy over Gaussian mixture models, operating at 36.4 fps on a low-power onboard computer.

Abstract:
Accurate estimation of wheelterrain interaction parameters is important for efficient navigation of Unmanned Ground Vehicles in unstructured outdoor environments. In this paper, we propose a hybrid data-driven and model-based method to estimate a priori emphmotion resistance, a terrain-specific parameter representing the force opposing wheel motion, which is largely influenced by terrain class and geometry. The proposed method relies on learning motion resistance from proprioceptive feedback collected on reference terrains. This learned model is then transferred to new environments, where motion resistance is inferred from exteroceptive observation, including LiDAR and cameras, leveraging terrain geometry and class information. To capture uncertainty from terrain roughness and sensor noise, we evaluate two probabilistic models predicting motion resistance distributions: a Gaussian-MLP and a Gaussian Process Regressor. Their robustness to domain shifts is assessed by measuring performance degradation as the target diverges from the source domain. Extensive off-road field experiments validate the methods effectiveness, demonstrating accurate prediction of motion resistance and its potential for deployment.

Abstract:
Offline reinforcement learning (RL) holds significant potential for crowd robot navigation in human-robot coexistence applications. However, the inherent complexity of pedestrian motion renders the design of effective reward functions for promoting socially compliant robot behaviors a persistent challenge. This paper proposes a Social Preference Learning for Crowd Robot Navigation (SPLC) algorithm to eliminate the need for detailed reward design. Its core innovation lies in the introduction of a social preference feedback mechanism to automatically generate preference data through principled preference evaluation criteria. By explicitly accounting for the intricacies of pedestrian dynamics, the pipeline mitigates the reward bias and facilitates the systematic quantification of broad social norms, thereby fostering socially compliant behaviors. Extensive experiments integrating SPLC with offline RL methods demonstrate consistent improvements over state-of-the-art baselines across standard performance metrics. Furthermore, real-world experiments on the TurtleBot4 further validate the effectiveness of SPLC in practical humanrobot coexistence settings. Our code and video demos are available at https://github.com/sklus949/SPLC.

Abstract:
Image-to-point-cloud registration (I2P) is a fundamental task in robotic applications such as manipulation, grasping and localization. Existing deep learning-based I2P methods seek to align image and point cloud features in a learned representation space to establish correspondences, and have achieved promising results. However, due to issues such as scale ambiguity, many geometrically inconsistent outlier correspondences persist in the feature space. To address this limitation, we propose Angle-I2P, an outlier rejection network that leverages angle-consistent geometric constraints and hierarchical attention. First, we design a scale-invariant, cross-modality geometric constraint based on angular consistency. This explicit geometric constraint guides the model in distinguishing inliers from outliers. Furthermore, we propose a global-to-local hierarchical attention mechanism that effectively filters out geometrically inconsistent matches under rigid transformation, thereby improving the Inlier Ratio (IR) and Registration Recall (RR). Experimental results demonstrate that our method achieves state-of-the-art performance on the 7Scenes, RGBD Scenes V2, and a self-collected dataset, with consistent improvements across all benchmarks.

Abstract:
Guiding Vector Fields (GVFs) are a powerful tool for robotic path following. However, classical methods assume smooth, ordered curves and fail when paths are unordered, multi-branch, or generated by probabilistic models. We propose a unified framework, termed the Score-Induced Guiding Vector Field (SGVF), which leverages score-based generative modeling to construct vector fields directly from data distributions. SGVF learns tangent fields from point clouds with unit-norm, orthogonality, and directional-consistency losses, ensuring geometric fidelity and control feasibility. This approach removes the reliance on ad-hoc path segmentation and enables guidance along complex topologies such as branching and pseudo-manifolds. The study establishes a correspondence between score vanishing in diffusion models and GVF singularities and highlights representational capacity near sharp path curvatures. Experiments on robotic navigation in planar environments demonstrate that SGVF achieves reliable path following in scenarios where classical GVFs fail, underscoring its potential as a bridge between generative modeling and geometric control. Code and experiment video are available at https://github.com/czr-gif/Guiding-Vector-Field-Generation-via-Score-based-Diffusion-Model.

Abstract:
Tactile sensors have long been valued for their perceptual capabilities, offering rich insights into the otherwise hidden interface between the robot and grasped objects. Yet their inherent compliancea key driver of force-rich interactionsremains underexplored. The central challenge is to capture the complex, nonlinear dynamics introduced by these passive compliant elements. Here, we present a computationally efficient non-holonomic hydroelastic model that accurately models path-dependent contact force distributions and dynamic surface area variations. Our insight is to extend the objects state space, explicitly incorporating the distributed forces generated by the compliant sensor. Our differentiable formulation not only accounts for path-dependent behavior but also enables gradient-based trajectory optimization, seamlessly integrating with high-resolution tactile feedback. We demonstrate the effectiveness of our approach across a range of simulated and real-world experiments and demonstrate the importance of modeling the path dependence of sensor dynamics.

Abstract:
Flow-matching-based policies have recently emerged as a promising approach for learning-based robot manipulation, offering significant acceleration in action sampling compared to diffusion-based policies. However, conventional flow-matching methods struggle with multi-modality, often collapsing to averaged or ambiguous behaviors in complex manipulation tasks. To address this, we propose the Variational Flow-Matching Policy (VFP), which introduces a variational latent prior for mode-aware action generation and effectively captures both task-level and trajectory-level multi-modality. VFP further incorporates Kantorovich Optimal Transport (K-OT) for distribution-level alignment and utilizes a Mixture-of-Experts (MoE) decoder for mode specialization and efficient inference. We comprehensively evaluate VFP on 41 simulated tasks and 3 real-robot tasks, demonstrating its effectiveness and sampling efficiency in both simulated and real-world settings. Results show that VFP achieves a 49% relative improvement in task success rate over standard flow-based baselines in simulation, and further outperforms them on real-robot tasks, while still maintaining fast inference and a compact model size. More details are available on our project page: urlhttps://sites.google.com/view/varfp/

Abstract:
Representing and understanding 3D environments in a structured manner is crucial for autonomous agents to navigate and reason about their surroundings. While traditional Simultaneous Localization and Mapping (SLAM) methods generate metric reconstructions and can be extended to metric-semantic mapping, they lack a higher level of abstraction and relational reasoning. To address this gap, 3D scene graphs have emerged as a powerful representation for capturing hierarchical structures and object relationships. In this work, we propose an enhanced hierarchical 3D scene graph that integrates open-vocabulary features across multiple abstraction levels and supports object-relational reasoning. Our approach leverages a Vision Language Model (VLM) to infer semantic relationships. Notably, we introduce a task reasoning module that combines Large Language Models (LLM) and a VLM to interpret the scene graphs semantic and relational information, enabling agents to reason about tasks and interact with their environment more intelligently. We validate our method by deploying it on a quadruped robot in multiple environments and tasks, highlighting its ability to reason about them.

Abstract:
This paper introduces SeaViper, a soft extendable aquatic vibrating intelligent piezoelectric robot that extends previously developed land-based systems into the aquatic domain. The aquatic domain introduces new fundamental mechanisms of motion as well as new robot-platform requirements. To study these, we present the mechanical and electrical design of SeaViper and investigate the drivefrequency response of three prototype configurations, with energy efficiency as a key design consideration. The prototypes achieve a peak velocity of up to 33.2 cm/s (1.38 body-length per second) with an estimated power of 2 W and a minimum cost of transport (CoT) of 3.9, significantly improving upon the performance of the prior land prototype. Measured thrust data combined with current-sense analysis enable estimation of useful mechanical output and end-to-end electromechanical efficiency. Velocity and CoT are benchmarked against both other robotic swimmers and aquatic animals, highlighting the general gap to biological performance. To further advance the sheet-like, untethered design, the aquatic prototype integrates a microcontroller, wireless communication, sensing, and on-board battery charging circuitry, paving the way for future bio-inspired morphologies at the airwater interface with advanced driving patterns.

Abstract:
Despite recent advances in robust locomotion, bipedal robots operating in the real world remain at risk of falling. While most research focuses on preventing such events, we instead concentrate on the phenomenon of falling itself. Specifically, we aim to reduce physical damage to the robot while providing users with control over the robot's end pose. To this end, we propose a robot-agnostic reward function that balances the achievement of a desired end pose with impact minimization and the protection of critical robot parts during reinforcement learning. To make the policy robust to a broad range of initial falling conditions, and to enable the specification of an arbitrary and unseen end pose at inference time, we introduce a simulation-based sampling strategy of initial and end poses. Through simulated and real-world experiments, our work demonstrates that even bipedal robots can perform controlled, soft falls.

Abstract:
Surface manipulation tasks require robots to generate trajectories that comprehensively cover complex 3D surfaces while maintaining precise end-effector poses. Existing ergodic trajectory optimization (TO) methods demonstrate success in coverage tasks, while struggling with point-cloud targets due to the nonconvex optimization landscapes and the inadequate handling of SE(3) constraints in sampling-as-optimization (SAO) techniques. In this work, we introduce a preconditioned SE(3) Stein Variational Gradient Descent (SVGD) approach for SAO ergodic trajectory generation. Our proposed approach comprises multiple innovations. First, we reformulate point-cloud ergodic coverage as a manifold-aware sampling problem. Second, we derive SE(3)-specific SVGD particle updates, and, third, we develop a preconditioner to accelerate TO convergence. Our sampling-based framework consistently identifies superior local optima compared to strong optimization-based and SAO baselines while preserving the SE(3) geometric structure. Experiments on a 3D point-cloud surface coverage benchmark and robotic surface drawing tasks demonstrate that our method achieves superior coverage quality with tractable computation in our setting relative to existing TO and SAO approaches, and is validated in real-world robot experiments.

Abstract:
Design for Robotic Assembly (DfRA) remains largely dependent on manual planning and heuristic simulation, limiting scalability and robustness in complex industrial settings. Although large language models (LLMs) show promise for semantic reasoning and task planning, most approaches remain tightly coupled to pre-built simulators that assume an accurate world model. We introduce Iterative Design for Robotic Assembly (IDfRA), a closed-loop framework that combines an LLM for plan generation with a visionlanguage model (VLM) for execution assessment. Given a target structure and a partial environmental signature, the LLM proposes an assembly plan, the robot executes it once at test time, and the VLM evaluates the resulting state to provide feedback for replanning. Through this iterative planningexecutionverification loop, the system progressively improves semantic fidelity and physical feasibility. Crucially, IDfRA does not require an accurate a priori world model before deployment. Instead, physical constraints are discovered online through interaction, enabling adaptation to under-specified environments. Empirical evaluation demonstrates that IDfRA attains 73.3% top-1 accuracy in semantic recognisability, surpassing the baseline on this metric. Moreover, the resulting assembly plans exhibit robust physical feasibility, achieving an overall 86.9% construction success rate, with design quality improving across iterations, albeit not always monotonically. Pairwise human evaluation further corroborates the advantages of IDfRA relative to alternative approaches. By integrating self-verification with context-aware adaptation, the framework evidences strong potential for deployment in unstructured manufacturing scenarios.

Abstract:
Towards human-robot coexistence, socially aware navigation is significant for mobile robots. Yet existing studies on this area focus mainly on path efficiency and pedestrian collision avoidance, which are essential but represent only a fraction of social navigation. Beyond these basics, robots must also comply with user instructions, aligning their actions to task goals and social norms expressed by humans. In this work, we present LISN-Bench, the first simulation-based benchmark for language-instructed social navigation. Built on Rosnav-Arena 3.0, it is the first standardized social navigation benchmark to incorporate instruction following and scene understanding across diverse contexts. To address this task, we further propose Social-Nav-Modulator, a fastslow hierarchical system where a VLM agent modulates costmaps and controller parameters. Decoupling low-level action generation from the slower VLM loop reduces reliance on high-frequency VLM inference while improving dynamic avoidance and perception adaptability. Our method achieves an average success rate of 91.3%, which surpasses the most competitive baseline by 63%, with most of the improvements observed in challenging tasks such as following a person in a crowd and navigating while strictly avoiding instruction-forbidden regions. The project website is at: https://social-nav.github.io/LISN-project/

Abstract:
Generative navigation policies have made rapid progress in improving end-to-end learned navigation. Despite their promising results, this paradigm has two structural problems. First, the sampled trajectories exist in an abstract, unscaled space without metric grounding. Second, the control strategy discards the full path, instead moving directly towards a single waypoint. This leads to short-sighted and unsafe actions, moving the robot towards obstacles that a complete and correctly scaled path would circumvent. To address these issues, we propose MetricNet, an effective add-on for generative navigation that predicts the metric distance between waypoints, grounding policy outputs in metric coordinates. We evaluate our method in simulation with a new benchmarking framework and show that executing MetricNet-scaled waypoints significantly improves both navigation and exploration performance. Beyond simulation, we further validate our approach in real-world experiments. Finally, we propose MetricNav, which integrates MetricNet into a navigation policy to guide the robot away from obstacles while still moving towards the goal.

Abstract:
Obstructed terrains, such as boulders, crevices, and rubble, limit the locomotion of wheeled mobile robots. Transformable wheel-to-leg designs enable better traversal; small wheels provide driving efficiency and deployable legs enable stepping over obstructions. We propose an approach to such leg deployment utilizing a novel tape spring truss structure. It achieves large shape changes -- demonstrating transformation ratios from 3.77 to 5.38 between wheel radius and leg length -- in a light-weight (8 g) and compact way. Prior tape spring mechanisms have not yet used a string-tensioned truss formation. By tensioning the tape spring via a string mechanism, the wheels emerging deployable legs are strong enough to traverse obstacles greater than one wheel diameter. Yet, it can also be stowed using just the weight of the rover to coil the tape spring. Adjusting the string pretension allows for optimization of the legs transverse buckling load, resulting in a strong truss despite low mass and stowed volume. We validate the trusss capability by incorporating it into a two-wheeled mobile rover platform, demonstrating utility in mobility across obstructed terrain.

Abstract:
Robust feature encoding constitutes the foundation of UAV tracking by enabling the nuanced perception of target appearance and motion, thereby playing a pivotal role in ensuring reliable tracking. However, existing feature encoding methods often overlook critical illumination and viewpoint cues, which are essential for robust perception under challenging nighttime conditions, leading to degraded tracking performance. To overcome the above limitation, this work proposes a dual prompt-driven feature encoding method that integrates prompt-conditioned feature adaptation and context-aware prompt evolution to promote domain-invariant feature encoding. Specifically, the pyramid illumination prompter is proposed to extract multi-scale frequency-aware illumination prompts. The dynamic viewpoint prompter modulates deformable convolution offsets to accommodate viewpoint variations, enabling the tracker to learn view-invariant features. Extensive experiments validate the effectiveness of the proposed dual prompt-driven tracker (DPTracker) in tackling nighttime UAV tracking. Ablation studies highlight the contribution of each component in DPTracker. Real-world tests under diverse nighttime UAV tracking scenarios further demonstrate the robustness and practical utility. The code and demo videos are available at https://github.com/yiheng-wang-duke/DPTracker.

Abstract:
In multi-robot systems operating under uncertainty, maintaining safe inter-robot distances while avoiding collisions with obstacles is crucial. Although chance-constrained methods have been widely adopted to handle such uncertainties, existing approaches often exhibit conservatism due to their reliance on linearized integration regions. To address this limitation, this paper introduces a novel probabilistic Mahalanobis distance constraint that enables tighter reformulations of collision avoidance constraints both between robots and between robots and obstacles. These constraints are integrated into a Model Predictive Path Integral (MPPI) control framework for efficient trajectory optimization. The effectiveness of the proposed method is validated through comprehensive simulations comparing it against state-of-the-art approaches, as well as through real-world experiments conducted across various scenarios.

Abstract:
Modeling the dynamics of micro-mobility vehicles (MMV) is becoming increasingly important for training autonomous vehicle systems and building urban traffic simulations. However, mainstream tools rely on variants of the Kinematic Bicycle Model (KBM) or mode-specific physics that miss tire slip, load transfer, and rider/vehicle lean. To our knowledge, no unified, physics-based model captures these dynamics across the full range of common MMVs and wheel layouts. We propose the "Generalized Micro-mobility Model" (GM3), a tire-level formulation based on the tire brush representation that supports arbitrary wheel configurations, including single/double track and multi-wheel platforms. We introduce an interactive model-agnostic evaluation and visualization framework that decouples vehicle/layout specification from dynamics to compare the GM3 with the KBM and other models, consisting of fixed step RK4 integration, human-in-the-loop and scripted control, real-time trajectory traces, and logging for analysis. We also empirically validate the GM3 on the Stanford Drone Dataset's deathCircle (roundabout) scene.

Abstract:
Manipulation in confined and cluttered environments remains a significant challenge due to partial observability and complex configuration spaces. Effective manipulation in such environments requires an intelligent exploration strategy to safely understand the scene and search the target. In this paper, we propose COMPASS, a multi-stage exploration and manipulation framework featuring a manipulation-aware sampling-based planner. First, we reduce collision risks with an near-field awareness scan to build a local collision map. Additionally, we employ a multi-objective utility function to find viewpoints that are both informative and conducive to subsequent manipulation. Moreover, we perform a constrained manipulation optimization strategy to generate manipulation poses that respect obstacle constraints. To systematically evaluate method's performance under these difficulties, we propose a benchmark of confined-space exploration and manipulation containing four level challenging scenarios. Compared to exploration methods designed for other robots and only considering information gain, our framework increases manipulation success rate by 24.25% in simulations. Real-world experiments demonstrate our method's capability for active sensing and manipulation in confined environments.

Abstract:
Open-vocabulary mobile manipulation (OVMM) requires robots to follow language instructions, navigate, and manipulate while updating their world representation as the environment changes dynamically. However, most prior works update their world representation only at discrete milestones, such as waypoints or the end of an action step. Such sparse updates leave robots with limited awareness between updates, causing missed objects, delayed error detection, and slower replanning. To address this limitation, we propose BINDER (Bridging INstant and DEliberative Reasoning), a dual-process framework that separates strategic planning from continuous environmental monitoring. BINDER combines a Deliberative Response Module (DRM, a multimodal LLM for task planning) with an Instant Response Module (IRM, a Video-LLM for continuous monitoring). The DRM handles strategic planning through structured 3D scene updates and guides the IRMs focus, while the IRM processes video streams to update memory, proactively adjust actions, and trigger replanning when needed. This bidirectional coordination ensures continuous awareness without costly updates, enabling reliable and robust operation under dynamic conditions. We evaluate BINDER in three real-world environments where objects are moved during execution and show that it achieves substantially higher success rates and efficiency than state-of-the-art baselines, confirming its effectiveness for real-world deployment.

Abstract:
As robots transition from performing repetitive tasks to collaborating with humans, understanding human intent becomes crucial to effective interaction. Anticipation enables robots to predict human actions, while proactivity allows them to take initiative and guide human behavior toward optimal outcomes. Although research has largely focused on how robots infer and respond to human intentions, less attention has been paid to how robots communicate their own intent. This paper introduces visual proactivity, a novel, simple yet effective approach that enables robots to communicate their intentions through visual feedback, influencing human behavior and enhancing transparency and fluency. We develop and evaluate proactive robotic behaviors in a human-to-robot handover scenario, where a user study validates human perception of reactive, anticipatory, and proactive behaviors. The results demonstrate that effective visual proactivity fosters better alignment and coordination, paving the way for more intuitive human-robot collaboration.

Abstract:
Indoor 3D occupancy mapping, crucial for robotic perception, struggles with occlusions and reappearing surfaces in continuous observations. Existing methods either fuse frames without discernment, causing occlusion-induced errors to persist and contaminate global representations, or recompute scenes from scratch, sacrificing efficiency and stability. To address these challenges, we propose MemOcc, a novel memory-augmented framework for continuous occupancy mapping using readwriteretrieve operations. MemOcc employs a hierarchical memory design with cooperative short- and long-term tiers. Its Short-Term Memory Cache module uses visibility-gated writes and confidence maps to stabilize voxel predictions and filter occlusion noise, while the Long-Term Memory Bank stores scene priors for rapid retrieval, accelerating convergence in revisited regions. As a plug-and-play module, MemOcc integrates seamlessly with existing 2D-to-3D pipelines without altering backbones or training. Experiments on indoor benchmarks demonstrate MemOcc reduces error propagation by 25% and improves mapping speed over state-of-the-art methods, achieving robust, real-time performance. By selectively retaining reliable evidence and enabling efficient retrieval, MemOcc paves the way for scalable indoor perception in robotics and augmented reality.

Abstract:
Long-distance teleoperation will enable forthcoming scientific and commercial developments on the lunar surface such as in-situ resource utilisation. However, the large distances involved in these applications introduce multi-second signal delays, which may impair user performance and lead to reduced trust in the system. This work presents a user study of 26 participants exploring the impact of open-loop model-mediated teleoperation (MMT) in providing real-time feedback alongside a delayed video stream of the remote regolith simulant sample collection task. In this system, an imperfect but computationally efficient model was employed to visuo-haptically render the simulant. Three conditions were examined: MMT with visual feedback, MMT with visuo-haptic feedback, and direct teleoperation with delayed visual feedback. Users reported greater trust scores in the visual and visual-haptic MMT conditions (+13%, +12%, respectively) compared with delayed direct teleoperation. In addition, they demonstrated more trusting behaviour in the MMT conditions by reducing the duration of wait periods. Performance metrics were also improved in the MMT conditions (faster completion time), although no significant differences were observed between the two MMT feedback types. These results suggest that, despite using an approximate representation of a complex environment, MMT is a valuable tool for improving performance and developing trust in delayed teleoperation systems.

Abstract:
This work presents Kinodynamic Adaptive Robot Coordination (K-ARC), a novel algorithm for multi-robot kinodynamic planning. Our experimental results show the capability of K-ARC to plan for up to 32 planar mobile robots, while achieving up to an order of magnitude of speed-up compared to previous methods in various scenarios. K-ARC is able to achieve this due to its two main properties. First, K-ARC constructs its solution iteratively by planning in segments, where initial kinodynamic paths are found through optimization-based approaches and the inter-robot conflicts are resolved through sampling-based approaches. The interleaving use of sampling-based and optimization-based approaches allows K-ARC to leverage the strengths of both approaches in different sections of the planning process where one is more suited than the other, while previous methods tend to emphasize on one over the other. Second, K-ARC builds on a previously proposed multi-robot motion planning framework, Adaptive Robot Coordination (ARC), and inherits its strength of focusing on coordination between robots only when needed, saving computation efforts. We show how the combination of these two properties allows K-ARC to achieve overall better performance in our simulated experiments with increasing numbers of robots, increasing degrees of problem difficulties, and increasing complexities of robot dynamics.

Abstract:
Underwater acoustic communication, characterized by limited bandwidth, high latency, and low reliability, poses significant challenges for data exchange in bathymetric collaborative simultaneous localization and mapping (CSLAM). In this article, we introduce a novel vector quantization (VQ) method called ID(O) for mapping data compression in bathymetric CSLAM. ID(O) encodes the map into an index map (I), a central depth map (D), and an orientation map (O). To accommodate strict communication constraints, orientations can be partially or fully excluded from transmission, and we propose a method to estimate these orientations during map restoration. Moreover, we integrate ID(O) within a feature-based bathymetric CSLAM framework named TTT CSLAM. Extensive experiments on two large-scale sea trial datasets demonstrate that ID(O) achieves about 40% higher restoration accuracy than the baseline method using principal component analysis. TTT CSLAM with ID(O) can match that with lossless compression regarding mapping accuracy and efficiency, and it is robust against 40% packet loss and large dead reckoning drift errors across diverse environments. To the best of the authors knowledge, ID(O) is the first VQ method for bathymetric data compression, and TTT CSLAM with ID(O) is the first bathymetric CSLAM tested within an underwater communication network employed by acoustic modems.

Abstract:
Humans possess a large reachable space in the 3D world, enabling interactions with objects at varying heights and distances. However, realizing such large-space reaching on humanoids is a complex whole-body control (WBC) problem. Learning from scratch often leads to optimization difficulty and poor sim2real transferability. To address these challenges, we present Real-world-Ready Skill Space (R2S2), a structural skill prior that helps autonomous whole-body-control task execution in an efficient manner while maintaining sim2real transferability. Inheriting knowledge from a set of real-world-ready primitive skills to ease multi-skill learning, R2S2 further expands the capability of primitive skills and learns a unified structural skill representation. By sampling from R2S2, we unleash humanoid reaching potential in many real-world tasks. As a beneficial side effect, R2S2 can also support humanoid whole-body teleoperation with a large reachable space. We validate the generalizability of R2S2 in various challenging goal-reaching tasks across different robot platforms, simulation and real world.

Abstract:
Language-guided robotic grasping in cluttered environments presents significant challenges due to severe occlusions and complex scene structures, which often hinder accurate target localization. Existing approaches typically suffer from limited observational capabilities, resulting in suboptimal exploration of the target object. In this paper, we propose a novel Active-Perceptive Language-Oriented Grasp Policy (APeG) for heavily cluttered scenes. APeG develops an active perception scheme in the grasp pipeline via an occlusion-aware, semantic-guided viewpoint optimization strategy, enabling efficient exploration of cluttered scenes. In addition, a grasp-wise Reinforcement Learning (RL) policy is proposed to select robust grasp poses. Extensive real-world experiments validate the effectiveness of APeG, demonstrating significant improvements in both task success rate and operational efficiency over existing baselines, highlighting its potential for practical deployment in language-conditioned robotic manipulation.

Abstract:
This work explores the indirect herding control problem for a single pursuer agent regulating a single target agent to a goal location. To accommodate the constraints of sensing hardware, an event-triggered inter-agent influence model between the pursuer agent and target agent is considered. Motivated by fielded sensing systems, we present an event-triggered controller and trigger mechanism that satisfies a user-selected minimum inter-event time. The combined pursuer-target system is presented as a switched system that alternates between stable and unstable modes. A dwell-time analysis is completed to develop a closed-form solution for the maximum time the pursuer agent can allow the target agent to evolve in the unstable mode before requiring a control input update. The presented trigger function is designed to produce inter-event times that are upper-bounded by the maximum dwell time. The effectiveness of the proposed approach is demonstrated through both simulated and experimental studies, where a pursuer agent successfully regulates a target agent to a desired goal location.

Abstract:
Parallel evaluation of robotic system environments is becoming increasingly popular in modern robotics applications for machine learning and stochastic control. At the same time, the field of model-based control has matured enough to provide solutions that cover the needs of sophisticated robotics platforms. However, few works address the parallelization of such solvers to be combined with the above approaches and accelerate research in robot planning and control. We present preliminary results towards a novel implementation of a batched SQP solver for equality-constrained optimal control. After linearizing the dynamics in the SQP step, we employ a state-control equality constrained LQR solver. The additional equality constraints yield a structured system at each stage that can be solved via a Riccati-recursion-based block elimination. We evaluate our approach on an inverse-dynamics-based optimal control problem, in contrast to the forward-dynamics formulations typical of related works. Our results demonstrate computational efficiency and structural advantages for massively parallel environments. Our implementation, available here, is developed in PyTorch, taking advantage of the library's batched linear algebra suite for parallelization.

Abstract:
Following its success in natural language processing and computer vision, foundation models that are pre-trained on large-scale multi-task datasets have also shown great potential in robotics. However, most existing robot foundation models rely solely on 2D image observations, ignoring 3D geometric information, which is essential for robots to perceive and reason about the 3D world. In this paper, we introduce FP3, a large-scale 3D foundation policy model for robotic manipulation. FP3 builds on a scalable diffusion transformer architecture and is pre-trained on 60k trajectories with point cloud observations. With the model design and diverse pre-training data, FP3 can be efficiently fine-tuned for downstream tasks while exhibiting strong generalization capabilities. Experiments on real robots demonstrate that with only 80 demonstrations, FP3 is able to learn a new task with over 90 success rates in novel environments with unseen objects, significantly surpassing existing robot foundation models.

Abstract:
In this work, we investigate how a state-of-the-art grasp planner based on deep reinforcement learning performs when applied to a soft-rigid gripper in a decluttering task. The gripper, called Soft ScoopGripper, is endowed with a rigid scoop-shaped part that facilitates the interaction with the environment and with objects. We hypothesize that the clever design of such a gripper can facilitate the learning process, reducing the number of required training steps and eliminating the need for learning non-prehensile actions, such as pushing. To validate our hypothesis, we conducted experiments in both simulated and real-world environments, comparing the selected gripper with a rigid parallel-jaw gripper and a four-fingered soft gripper. Results show that the Soft ScoopGripper learns to effectively declutter scenes using a single action (grasping) instead of two (pushing and grasping). This is due to the fact that the scoop-shaped add-on allows to perform non-prehensile motions during the grasp action.

Abstract:
This paper presents a soft robot finger capable of adaptive-twist deformation to grasp objects by wrapping them. For a soft hand to grasp and pick-up one object from densely contained multiple objects, a soft finger requires the adaptivetwist deformation function in both in-plane and out-of-plane directions. The function allows the finger to be inserted deeply into a limited gap among objects. Once inserted, the soft finger requires appropriate control of grasping force normal to contact surface, thereby maintaining the twisted deformation. In this paper, we refer to this type of grasping as grasping by wrapping. To achieve these two functions by a single actuation source, we propose a variable stiffness mechanism that can adaptively change the stiffness as the pressure is higher. We conduct a finite element analysis (FEA) on the proposed mechanism and determine its design parameter based on the FEA result. Using the developed soft finger, we report basic experimental results and demonstrations on grasping various objects.

Abstract:
We developed a simple peg-in-hole strategy that uses flexible joints and peg rotations. Even when the circle peg in the peg-in-hole assembly contains position and orientation errors, it can be inserted in a passive and robust manner. Additionally, using force-torque sensors to estimate the contact position allows the correction of the orientation of the peg and its insertion into the hole if the initial attempt fails. We conducted horizontal and vertical peg-in-hole experiments with random position and orientation errors to demonstrate the effectiveness of the developed method. This method does not rely on high-frequency sensors or servos, which enables a quick and low-cost peg-in-hole assembly with tolerance to position and orientation errors and direction.

Abstract:
This paper introduces H-MaP, a hybrid sequential manipulation planner that addresses complex tasks requiring both sequential actions and dynamic contact mode switches. Our approach reduces configuration space dimensionality by decoupling object trajectory planning from manipulation planning through object-based waypoint generation, informed contact sampling, and optimization-based motion planning. This architecture enables handling of challenging scenarios involving tool use, auxiliary object manipulation, and bimanual coordination. Experimental results across seven diverse tasks demonstrate H-MaP's superior performance compared to existing methods, particularly in highly constrained environments where traditional approaches fail due to local minima or scalability issues. The planner's effectiveness is validated through both simulation and real-robot experiments. https://sites.google.com/view/h-map/

Abstract:
In daily domestic settings, frequently used objects like cups often have unfixed positions and multiple instances within the same category, and their carriers frequently change as well. As a result, it becomes challenging for a robot to efficiently navigate to a specific instance. To tackle this challenge, the robot must capture and update scene changes and plans continuously. However, current object navigation approaches primarily focus on the semantic level and lack the ability to dynamically update scene representation. In contrast, this paper captures the relationships between frequently used objects and their static carriers. It constructs an open-vocabulary Carrier-Relationship Scene Graph (CRSG) and updates the carrying status during robot navigation to reflect the dynamic changes of the scene. Based on the CRSG, we further propose an instance navigation strategy that models the navigation process as a Markov Decision Process. At each step, decisions are informed by the Large Language Model's commonsense knowledge and visual-language feature similarity. We designed a series of long-sequence navigation tasks for frequently used everyday items in the Habitat simulator. The results demonstrate that by updating the CRSG, the robot can efficiently navigate to moved targets. Additionally, we conducted extensive experiments on a real robot, demonstrating the effectiveness of our method and exploring its limitations. The project page can be found here: https://OpenIN-nav.github.io.

Abstract:
Implicit Neural Representation (INR)-based SLAM has a critical issue where all keyframes must be stored in memory for post-training whenever a remapping is needed due to the neural network's weights themselves representing the map. To address this, previous INR-based SLAM proposed methods to modify INR-based maps without changing the neural network's weights. However, these approaches suffer from low memory efficiency and increased space complexity. In this paper, we introduce a remapping method for INR-based maps that does not require post-traning the neural network's weights and needed low space cost. The problem of function modification, such as updating a map defined as a neural network function, can be viewed as transforming the functions domain. Leveraging function domain transformation, we propose a method to update INR-based maps by identifying the transformation function between the post-optimization and pre-optimization domains. Additionally, to prevent cases where the transformation between the post-optimization and pre-optimization domains does not form a one-to-many relationship, we introduce a temporal domain and propose a method to find the spatial coordinate transformation function accordingly. Evaluations in INR-based techniques demonstrate that our proposed method effectively update to maps while requiring significantly less memory compared to existing remapping approaches.

Abstract:
We present a benchmarking study of vision-based robotic grasping algorithms with distinct approaches, and pro- vide a comparative analysis. In particular, we compare two machine-learning-based and two analytical algorithms using an existing benchmarking protocol from the literature and deter- mine the algorithms strengths and weaknesses under different experimental conditions. These conditions include variations in lighting, background textures, cameras with different noise levels, and grippers. We also run analogous experiments in simulations and with real robots and present the discrepancies. Some experiments are also run in two different laboratories using same protocols to further analyze the repeatability of our results. We believe that this study, comprising 5040 experiments, provides important insights into the role and challenges of systematic experimentation in robotic manipulation, and guides the development of new algorithms by considering the factors that could impact the performance. The experiment recordings and our benchmarking software are publicly available.

Abstract:
We propose combining preference and rating query types into a mixed-type query selection to learn reward functions for robotic decision making to improve scientific data collection. Mixed-type query selection allows the scientist operating a robot to specify the robots tradeoffs and goals in terms of both rating, giving a score to one robot plan, and preferences, selecting a preferred plan to another plan. While previous methods have used active learning to allow the user to specify tradeoffs between objectives using rating and preferences individually, our proposed method considers using multiple query types. We assume a user responds to these queries with some noise on their true preferences. Online estimation of error model parameters is difficult; therefore, we show results with both a tuned known error model and a heuristic mixed-type query selection method. When the error model is known, we show performance increases using our mixed-type query selection versus using only ratings or only preferences. In the more realistic case with an unknown error model, we show our heuristic performs better than the worst case single query type in all cases we tested.

Abstract:
Multi-label image classification is a significant challenge in computer vision due to the presence of multiple interconnected objects in a single image. Traditional convolutional neural networks (CNN) often fail to capture semantic dependencies between labels, limiting performance in complex scenes. To address this issue, we propose a novel framework that combines Knowledge-Guided Graph Convolutional Network (KGGCN) with Darknet53 backbone to improve label dependency modeling. Our method fuses external semantic information from ConceptNet5, which allows the model to learn contextual relationships between labels. Our work evaluate this approach on two benchmark datasets, VOC 2007 and COCO, and obtain state-of-the-art results. KGGCN achieves an Average Precision (mAP) of 96.24% on VOC 2007 and 85.25% on COCO, outperforming existing methods in most categories. Moreover, ablation studies further highlight the benefits of external knowledge integration contributing to higher mAP scores. Finally, our proposed method KGGCN demonstrates the effectiveness of combining deep visual features with structured semantic knowledge for multi-label image classification.

Abstract:
Barometric tactile sensors present a cheap and customizable method for adding tactile sensing to robotic platforms. These sensors consist of commercially available MEMS barometers embedded in an elastomer. However, as the sensing surface and elastomer volume increase in complexity, time-dependent material dynamics reduce sensing accuracy. We present a collection of inference and usage recommendations towards mitigating these dynamics and improving sensor force and localization resolution. Using two custom, curved, barometric tactile sensors as case studies, we demonstrate that a new data collection regime alone can improve normal force predictions by 30.4% compared to prior work. We further introduce a Binned-RNN inference architecture and demonstrate its efficacy through select ablations. Small enough to run on the sensors integrated microcontroller at 100Hz, we find our model achieves a minimum spatial resolution of 0.86 mm on an ellipsoid tactile sensor. Finally, we demonstrate the robustness of these sensing capabilities through freeform contact and controlled object rolling.

Abstract:
There has been a growing interest in autonomous systems designed to operate in adverse conditions (e.g. smoke, dust), where the visible light spectrum fails. In this context, Ultra-wideband (UWB) radar is capable of penetrating through such challenging environmental conditions due to the lower frequency components within its broad bandwidth. Therefore, UWB radar has emerged as a potential sensing technology for Simultaneous Localization and Mapping (SLAM) in vision-denied environments where optical sensors (e.g. LiDAR, Camera) are prone to failure. Existing approaches involving UWB radar as the primary exteroceptive sensor generally extract features in the environment, which are later initialized as landmarks in a map. However, these methods are constrained by the number of distinguishable features in the environment. Hence, this paper proposes a novel method incorporating UWB Angle of Arrival (AOA) measurements into UWB radar-based SLAM systems to improve the accuracy and scalability of SLAM in feature-deficient environments. The AOA measurements are obtained using UWB anchor-tag units which are dynamically deployed by the robot in featureless areas during mapping of the environment. This paper thoroughly discusses prevailing constraints associated with UWB AOA measurement units and presents solutions to overcome them. Our experimental results show that integrating UWB AOA units with UWB radar enables SLAM in vision-denied feature-deficient environments.

Abstract:
A central challenge in lifelong imitation learning (LIL) is enabling agents to acquire new skills from expert demonstrations while retaining knowledge of previously learned tasks. Achieving this requires preserving the low-dimensional manifolds and geometric structures that underlie task representations across sequential learning. However, existing distillation methods, which rely on L2-norm feature matching in the raw feature space, are highly sensitive to noise and high-dimensional variations, often failing to preserve the intrinsic task manifolds. To overcome these limitations, we introduce SPREAD, a geometry-preserving framework that leverages singular value decomposition (SVD) to align the representations of policies from consecutive tasks within low-rank subspaces. This subspace alignment preserves the intrinsic low-dimensional geometry of multimodal features, thereby facilitating stable knowledge transfer, enhancing robustness, and improving generalization across tasks. In addition, we propose a confidence-guided policy distillation strategy that applies a KullbackLeibler divergence loss restricted to the top-M most confident action samples, emphasizing reliable action modes and improving optimization stability. Empirical results on the LIBERO benchmark demonstrate that SPREAD significantly improves knowledge transfer across tasks, mitigates catastrophic forgetting, and achieves superior overall performance compared to state-of-the-art LIL methods.

Abstract:
Vision-Language-Action (VLA) models have recently shown impressive generalization and language-guided manipulation capabilities. However, their performance degrades on tasks requiring precise spatial reasoning due to limited spatial reasoning inherited from Vision-Language Models (VLMs). Existing VLAs rely on extensive action-data pretraining to ground VLMs in 3D space, which reduces training efficiency and is still insufficient for accurate spatial understanding. In this work, we present DepthVLA, a simple yet effective VLA architecture that explicitly incorporates spatial awareness through a pretrained depth prediction module. DepthVLA adopts a mixture-of-transformers design that unifies a VLM, a depth transformer, and an action expert with fully shared attentions, forming an end-to-end model with enhanced spatial reasoning. Extensive evaluations in both real-world and simulated environments show that DepthVLA outperforms state-of-the-art approaches, achieving 78.5% vs. 65.0% progress in real-world tasks, 94.9% vs. 93.6% in the LIBERO simulator, and 74.8% vs. 58.8% in the Simpler simulator. Our code will be made publicly available.

Abstract:
We present ROBO (Riemannian Overlapping Block Optimization), a distributed and parallel approach to multi-robot pose graph optimization (PGO) based on the idea of overlapping domain decomposition. ROBO offers a middle ground between centralized and fully distributed solvers, where the amount of pose information shared between robots at each optimization iteration can be set according to the available communication resources. Sharing additional pose information between neighboring robots effectively creates overlapping optimization blocks in the underlying pose graph, which substantially reduces the number of iterations required to converge. Through extensive experiments on benchmark PGO datasets, we demonstrate the applicability and feasibility of ROBO in different initialization scenarios, using various cost functions, and under different communication regimes. We also analyze the tradeoff between the increased communication and local computation required by ROBOs overlapping blocks and the resulting faster convergence. We show that overlaps with an average inter-robot data cost of only 36 Kb per iteration can converge 3.1x faster in terms of iterations than state-of-the-art distributed PGO approaches. Furthermore, we develop an asynchronous variant of ROBO that is robust to network delays and suitable for real-world robotic applications.

Abstract:
Operating household appliances by reading and understanding user manuals remains a fundamental and challenging problem in robotics. Recent works leverage large language models (LLMs) and vision-language models (VLMs) to interpret manuals, improving appliance operation success. However, these approaches fail when manuals are unavailable or incomplete. In this paper, we introduce an autonomous assistant for robotic appliance operation, built upon an LLMs/VLMs-powered multi-agent collaborative framework. Our system can read, comprehend, and summarize manuals, autonomously infer operational logic, and execute actions on appliances with a robotic arm. Importantly, for unseen appliances without manuals, it can acquire operational knowledge from generalized manuals and on-demand web search. Extensive evaluations on over one thousand tasks show that our framework substantially outperforms baselines and achieves robust performance in simulation and real-world experiments.

Abstract:
Predicting lane change maneuvers is essential for ensuring safe autonomous driving, especially in complex urban environments. Building upon prior multi-modal and graph-based approaches, this work introduces a novel transformer-based architecture for multi-horizon lane change prediction that jointly estimates the lane change maneuver and the lane change phase. The proposed model integrates visual information from surround-view cameras, semantic masks for free space and lane markings, interaction-aware graph representations, and ego-vehicle state signals, within a unified transformer framework to capture spatial-temporal dependencies. In addition, a multi-level uncertainty estimation branch quantifies confidence at the level of modality, fusion, and prediction, to enhance interpretability and reliability. Experiments are conducted on WylonSet++, an extended in-house dataset collected using an instrumented test vehicle, annotated for lane change behavior analysis and maneuver phase transitions. The dataset comprises synchronized front-facing camera images, left and right surround-view camera images, together with vehicle state data. The dataset contains approximately 600 lane change sequences, providing the foundation for this study. Extensive evaluations demonstrate strong performance in anticipating lane change maneuvers and phase progression across short- and long-term prediction horizons in diverse real-world traffic scenarios.

Abstract:
We present AnyThermal, a thermal backbone that captures robust task-agnostic thermal features suitable for a variety of tasks such as cross-modal place recognition, thermal segmentation, and monocular depth estimation using thermal images. Existing thermal backbones that follow task-specific training from small-scale data result in utility limited to a specific environment and task. Unlike prior methods, AnyThermal can be used for a wide range of environments (indoor, aerial, off-road, urban) and tasks, all without task-specific training. Our key insight is to distill the feature representations from visual foundation models such as DINOv2 into a thermal encoder using thermal data from these multiple environments. To bridge the diversity gap of the existing RGB-Thermal datasets, we introduce the TartanRGBT platform, the first open-source data collection platform with synced RGB-Thermal image acquisition. We use this payload to collect the TartanRGBT dataset - a diverse and balanced dataset collected in 4 environments. We demonstrate the efficacy of AnyThermal and TartanRGBT, achieving state-of-the-art results with improvements of up to 36% across diverse environments and downstream tasks on existing datasets

Abstract:
Creating flexible 3D scenes from a single image is vital when direct 3D data acquisition is costly or impractical. We introduce NavCrafter, a novel framework that explores 3D scenes from a single image by synthesizing novel-view video sequences with camera controllability and temporal-spatial consistency. NavCrafter leverages video diffusion models to capture rich 3D priors and adopts a geometry-aware expansion strategy to progressively extend scene coverage. To enable controllable multi-view synthesis, we introduce a multi-stage camera control mechanism that conditions diffusion models with diverse trajectories via dual-branch camera injection and attention modulation. We further propose a collision-aware camera trajectory planner and an enhanced 3D Gaussian Splatting (3DGS) pipeline with depth-aligned supervision, structural regularization and refinement. Extensive experiments demonstrate that NavCrafter achieves state-of-the-art novel-view synthesis under large viewpoint shifts and substantially improves 3D reconstruction fidelity.

Abstract:
Navigation Foundation Models (NFMs) trained on large, cross-embodied datasets have demonstrated powerful generalizability on various scenarios. Adopting in-domain fine-tuning upon an NFM efficiently calibrates the visuomotor policy, promising further improvement even in a novel scenario. However, the fine-tuned models still suffer from poor obstacle avoidance or fail to properly reach the provided goals. Furthermore, such model updates in a small subset of data typically erode the pretrained prior, compromising the pretraining generalization. Consequently, fine-tuning rather deteriorates the models capability of robust and accurate navigation. In this work, we present a novel fine-tuning method that leverages large-scale pretraining while efficiently learning in novel setups, such as environments or camera configurations. In particular, inspired by ControlNet, we fine-tune an NFM by attaching a trainable copy of the pretrained backbone using zero-initialized residual pathways, thereby learning geometric cues. This design enables the model to efficiently acquire in-domain geometry while preserving pretrained knowledge across various behaviors. Despite its simplicity, our comprehensive evaluation of real-world navigation suggests that our proposal effectively enables robust long-horizon navigation with minimal collisions and human intervention. Additionally, our offline analysis exhibits that the proposed strategy maintains or further improves action prediction capability beyond the fine-tuned dataset, providing a key insight into continual learning towards general navigation. The project page: https://toyotafrc.github.io/DCLING-Proj/

Abstract:
Humanoid robot soccer players face fundamental challenges in achieving stable motion execution and ball trajectory control, particularly under balance constraints during single-leg support phases. In this paper, we introduce MAKP (Multi-mode Accurate Kicking Policy), a novel motion generation-based end-to-end kicking paradigm that enables humanoid robots to perform accurate ball kicking while executing diverse kicking motions. MAKP uniquely integrates a diffusion-based motion generator to produce varied kicking trajectories and employs a three-stage learning strategy to address the inherent trade-off between motion similarity and kicking performance. Stage I focuses on stable motion tracking and single-leg balance maintenance, while Stage II optimizes ball kicking capabilities. In Stage III, we introduce a Multi-Critic mechanism combined with curriculum learning to further enhance the balance between kicking accuracy, motion similarity and robot stability. Real-world experiments on the Booster T1 platform validate the effectiveness of our approach.

Abstract:
With the increasing integration of robots into daily life, human-robot interaction has become more complex and multifaceted. A critical component of this interaction is Interactive Visual Grounding (IVG), through which robots must interpret human intentions and resolve ambiguity. Existing IVG models generally lack a mechanism to determine when to ask clarification questions, as they implicitly rely on their learned representations. CLUE addresses this gap by converting the VLM's cross-modal attention into an explicit, spatially grounded signal for deciding when to ask. We extract text to image attention maps and pass them to a lightweight CNN to detect referential ambiguity, while a LoRA fine-tuned decoder conducts the dialog and emits grounding location tokens. We train on a real-world interactive dataset for IVG, and a mixed ambiguity set for the detector. With InViG-only supervision, our model surpasses a state-of-the-art method while using parameter-efficient fine-tuning. Similarly, the ambiguity detector outperforms prior baselines. Overall, CLUE turns the internal cross-modal attention of a VLM into an explicit, spatially grounded signal for deciding when to ask. The data and code are publicly available at: mouadabrini.github.io/clue/

Abstract:
Predicting crowd intentions and trajectories is critical for a range of real-world applications, involving social robotics and autonomous driving. Accurately modeling such behavior remains challenging due to the complexity of pairwise spatial-temporal interactions and the heterogeneous influence of groupwise dynamics. To address these challenges, we propose Hyper-STTN, a Hypergraph-augmented Spatial-Temporal Transformer Network for crowd trajectory prediction. Hyper-STTN constructs crowd hypergraphs with multiscale group sizes to model groupwise correlations, captured through spectral hypergraph convolution based on hypergraph random walk. In parallel, a spatial-temporal transformer is employed to learn pedestrians pairwise latent interactions across multimodal dimensions. Eventually, above heterogeneous groupwise and pairwise features are subsequently incorporated and aligned via a multimodal transformer. Extensive experiments on public pedestrian motion datasets demonstrate that Hyper-STTN consistently outperforms state-of-the-art baselines and ablation models. The project website is available at https://sites.google.com/view/hypersttn.

Abstract:
A significant bottleneck in humanoid policy learning is the acquisition of large-scale, diverse datasets, as collecting reliable real-world data remains both difficult and cost-prohibitive. To address this limitation, we introduce HumanoidExo, a novel system that transfers human motion to whole-body humanoid data. HumanoidExo offers a high-efficiency solution that minimizes the embodiment gap between the human demonstrator and the robot, thereby tackling the scarcity of whole-body humanoid data. By facilitating the collection of more voluminous and diverse datasets, our approach significantly enhances the performance of humanoid robots in dynamic, real-world scenarios. We evaluated our method across three challenging real-world tasks: table-top manipulation, manipulation integrated with stand-squat motions, and whole-body manipulation. Our results empirically demonstrate that HumanoidExo is a crucial addition to real-robot data, as it enables the humanoid policy to generalize to novel environments, learn complex whole-body control from only five real-robot demonstrations, and even acquire new skills (i.e., walking) solely from HumanoidExo data.

Abstract:
Visual localization is critical for AR navigation, AI-driven audio guidance, and mobile robot localization. How- ever, traditional SLAM methods that rely on pre-built 3D maps suffer from high costs, privacy concerns, and sensitivity to environmental changes. Recent floorplan-based localization methods attempt to addresses these challenges by using 2D floorplans, eliminating the need for 3D map construction. Still, existing approaches are often impractical for real-world applications, as they are limited to specific layouts and fail to generalize beyond their training domains. We propose a novel approach that learns to semantically match visual cues from a camera image to a floorplan image with texts and symbols, inspired by human ability to directly localize oneself using a complex floorplan image. To achieve this, we train a single, unified model on a diverse dataset of 1.2M images and 740K floorplans that we curated, which includes a new collection of semantically-rich, real-world floorplans. This allows our model to generalize effectively to previously unseen areas and demonstrates potential towards zero-shot capabilities. Without making assumptions about camera poses or floorplan structures, our end-to-end model significantly outperforms existing methods and exhibits strong robustness to floorplan rotations, lighting changes, and different camera intrinsics, while effectively leveraging semantic cues like text.

Abstract:
Contact-rich robotic manipulation requires representations that encode local geometry. Vision provides global context but lacks direct measurements of properties such as texture and hardness, whereas touch supplies these cues. Modern visuo-tactile sensors capture both modalities in a single fused image, yielding intrinsically aligned inputs that are well suited to manipulation tasks requiring visual and tactile information. Most self-supervised learning (SSL) frameworks, however, compress feature maps into a global vector, discarding spatial structure and misaligning with the needs of manipulation. To address this, we propose SARL, a spatially-aware SSL framework that augments the Bootstrap Your Own Latent (BYOL) architecture with three map-level objectives, including Saliency Alignment (SAL), Patch-Prototype Distribution Alignment (PPDA), and Region Affinity Matching (RAM), to keep attentional focus, part composition, and geometric relations consistent across views. These losses act on intermediate feature maps, complementing the global objective. SARL consistently outperforms nine SSL baselines across six downstream tasks with fused visual-tactile data. On the geometry-sensitive edge-pose regression task, SARL achieves a Mean Absolute Error (MAE) of 0.3955, a 30% relative improvement over the next-best SSL method (0.5682 MAE) and approaching the supervised upper bound. These findings indicate that, for fused visualtactile data, the most effective signal is structured spatial equivariance, in which features vary predictably with object geometry, which enables more capable robotic perception.

Abstract:
Contact feedback is essential for contact-rich robotic manipulation, as it allows the robot to detect subtle interaction changes and adjust its actions accordingly. Six- axis force-torque sensors are commonly used to obtain contact feedback, but their high cost and fragility have discouraged many researchers from adopting them in contact-rich tasks. To offer a more cost-efficient and easy-accessible source of contact feedback, we present ShapeForce, a low-cost, plug- and-play soft wrist that provides force-like signals for contact-rich robotic manipulation. Inspired by how humans rely on relative force changes in contact rather than precise force magnitudes, ShapeForce converts external force and torque into measurable deformations of its compliant core, which are then estimated via marker-based pose tracking and converted into force-like signals. Our design eliminates the need for calibration or specialized electronics to obtain exact values, and instead focuses on capturing force and torque changes sufficient for enabling contact-rich manipulation. Extensive experiments across diverse contact-rich tasks and manipulation policies demonstrate that ShapeForce delivers performance comparable to six-axis force-torque sensors at an extremely low cost. More details of this project can be found at our project page:https://shapeforce.github.io/.

Abstract:
Modern autonomous Cyber-Physical Systems (CPSs), such as self-driving cars, face increasingly complex demands, and yet are expected to act reliably. The black-box nature often characterizing such systems, especially those relying on neural components, makes it impossible to fully verify the system behavior prior to deployment. Unfortunately, unexpected failures--cases when the system does not comply with its specification--are inevitable and may have catastrophic implications. To improve trust in the system and facilitate future mitigation after a failure occurs, it is important to try to derive an explanation for the unexpected system behavior. This paper introduces the novel concept of leveraging the framework of actual causality for CPS failure explanation. Up until now, this framework was only used to derive explanations in the context of simple systems, such as image classifiers. This paper addresses the theoretical gaps and provides the guidance needed to allow for correct explanation derivation in the CPS domain. Beyond the theoretical contribution, the paper presents two novel, practical, system-agnostic explanation derivation algorithms, allowing to prioritize either explanation optimality or derivation efficiency. The approach is demonstrated and evaluated in the context of a neural-network-controlled autonomous car, designed to avoid collisions.

Abstract:
Recent advances in behavior cloning (BC) have enabled impressive visuomotor control policies. However, these approaches are limited by the quality of human demonstrations, the manual effort required for data collection, and the diminishing returns from offline data. In comparison, reinforcement learning (RL) trains an agent through autonomous interaction with the environment and has shown remarkable success in various domains. Still, training RL policies directly on real-world robots remains challenging due to sample inefficiency and safety concerns. These challenges are compounded for high-degree-of-freedom (DoF) systems that must learn from sparse rewards over long horizons. We present a recipe that combines the benefits of BC and RL through a residual learning framework. Our approach leverages BC policies as black-box bases and learns lightweight per-step residual corrections via sample-efficient off-policy RL. We demonstrate that our method requires only sparse binary reward signals and can effectively improve manipulation policies on high-degree-of-freedom (DoF) systems in both simulation and the real world. In particular, we demonstrate, to the best of our knowledge, the first successful real-world RL training on a humanoid robot with dexterous hands. Our results demonstrate state-of-the-art performance in various vision-based tasks, pointing towards a practical pathway for deploying RL in the real world.

Abstract:
Collision detection is a fundamental problem in robotics, but handling collisions between non-convex objects remains challenging. A common approach for representing non-convex geometry is a signed distance function (SDF). Voxel-based SDF (VoxelSDF) enables fast distance queries but suffers from discretization artifacts and high memory costs. Neural implicit SDF (NeuralSDF) provides a continuous and memory-efficient representation with generalization, yet their slow query speed has limited their use in collision detection. To overcome these limitations, this paper proposes a novel amortized NeuralSDFmesh collision detection framework. NeuralSDFmesh collisions are formulated as a constrained optimization problem at the triangle level, and the KarushKuhnTucker conditions are derived to enable the amortization. A learning-based amortized optimization directly predicts collisions in a single forward pass, eliminating iterative optimization procedures. The amortized model adopts an auto-decoder architecture, extending the advantages of NeuralSDF in memory efficiency and category-level generalization to collision detection. Experiments demonstrate substantial speedups over baseline methods while maintaining comparable contact quality and reduced memory usage. The proposed approach also exhibits category-level generalization to unseen objects and can be applied to various robotic simulation scenarios.

Abstract:
Robot bimanual handovers (transferring an object between two arms) require careful coordination of timing, motion, and obstacle avoidance. Efficient, human-like object transfer between cooperating robots demands both spatial and tight temporal coordination. Existing approaches treat these requirements in isolation or rely on pre-computed trajectories that fail when obstacles/disturbances appear, degrading performance, segmented behavior, and introducing desynchronization. This paper introduces a dynamical systems framework that transitions each arm from independent asynchronous motion to coupled synchronous coordination. In this context, coupling denotes both the spatial coordination of the arms and their temporal synchronization. The framework's coordination and synchrony are robust to obstacles/disturbances along its path. Experiments on an upper torso dual-arm platform and on traditional manipulators show seamless handovers that remain stable despite obstructions, always preserving spatial coordination and temporal synchrony.

Abstract:
We propose MADP, a novel diffusion-model-based approach for collaboration in decentralized robot swarms. MADP leverages diffusion models to generate samples from complex and high-dimensional action distributions that capture the interdependencies between agents' actions. Each robot conditions policy sampling on a fused representation of its own observations and perceptual embeddings received from peers. To evaluate this approach, we task a team of holonomic robots piloted by MADP to address coverage control---a canonical multi agent navigation problem. The policy is trained via imitation learning from a clairvoyant expert on the coverage control problem, with the diffusion process parameterized by a spatial transformer architecture to enable decentralized inference. We evaluate the system under varying numbers, locations, and variances of importance density functions, capturing the robustness demands of real-world coverage tasks. Experiments demonstrate that our model inherits valuable properties from diffusion models, generalizing across agent densities and environments, and consistently outperforming state-of-the-art baselines.

Abstract:
In emergency response scenarios, rapid acquisition of critical disaster information supports effective decision-making. Traditional geometric coverage-based path planning often struggles to balance efficiency and information value. To address this, we propose a Disaster-Aware Informative Path Planning (DAIPP) method, which integrates a Siamese UNetbased building damage recognition model and formulates a novel information value function that considers recognition results, model uncertainty, and flight cost. We design an improved Frontier-based path planning algorithm, named the Selective Frontier Algorithm (SFA), which enhances the selection of candidate points to achieve the prioritized exploration of critical regions. To validate its effectiveness, the proposed method is compared with coverage path planning, random planning, and Monte Carlo tree search (MCTS). Experiments on the xView2 dataset demonstrate that the proposed method outperforms baselines in terms of information coverage, semantic target hit rate, and weighted information coverage, providing strong support for efficient disaster perception in emergency response.

Abstract:
Service robots operate in household environments shared with humans, pets, and everyday objects, where they are highly susceptible to failures such as software crashes, hardware degradation, or unpredictable interactions. While roboticists strive to minimize failures, some remain inevitable, making it critical to mitigate their potential consequences for safe and reliable deployment. This paper introduces a novel safety formulation that evaluates both the probability of impactful interactions between robots and surrounding entities during failures, and the severity of their outcomes. By quantifying the impact of failures on different entities, our approach enables robots to make informed planning decisions that balance safety with task efficiency. To support systematic evaluation, we also present FailBench, a MuJoCo-based simulation framework for studying robot-environment interactions under diverse failure modes, including sensing issues and actuator malfunctions. Together, our safety formulation and FailBench provide a foundation for developing safer and more robust motion plans and learned policies in real-world household environments.

Abstract:
Reinforcement learning (RL) is a promising approach for robotic manipulation, but it can suffer from low sample efficiency and requires extensive exploration of large state-action spaces. Recent methods leverage the commonsense knowledge and reasoning abilities of large language models (LLMs) to guide exploration toward more meaningful states. However, LLMs can produce plans that are semantically plausible yet physically infeasible, yielding unreliable behavior. We introduce LLM-TALE, a framework that uses LLMs' planning to directly steer RL exploration. LLM-TALE integrates planning at both the task level and the affordance level, improving learning efficiency by directing agents toward semantically meaningful actions. Unlike prior approaches that assume optimal LLM-generated plans or rewards, LLM-TALE corrects suboptimality online and explores multimodal affordance-level plans without human supervision. We evaluate LLM-TALE on pick-and-place tasks in standard RL benchmarks, observing improvements in both sample efficiency and success rates over strong baselines. Real-robot experiments indicate promising zero-shot sim-to-real transfer. Code and supplementary material are available at https://llm-tale.github.io.

Abstract:
Labelling vision datasets, especially for segmentation tasks, is a laborious and costly process that stymies novel developments in agricultural robotics. In this paper, we present DropClick, a click-guided segmentation tool that simplifies the annotation process. Our system utilises single-click inputs on objects to generate pseudo-labels, which can replace manual annotations. DropClick stands out as it is a semi-automated approach and does not require a click for every object in the scene. It can therefore further reduce the required amount of user input drastically. We evaluate our method on two challenging agricultural robotic datasets, SB20 and BUP20 for plant and fruit segmentation, respectively. DropClick is first trained on a small subset of just 5 images from the original training data. This DropClick model can then be deployed as a one-click segmentation system and achieves comparable or higher performance than other one-click methods achieving an mIoU of 70.0 and 72.6 points, for SB20 and BUP20 respectively. DropClick then excels at maintaining high performance when clicks are not given (e.g. dropped); when 50% of the clicks are missing it still maintains an mIoU of 68.9 and 71.3 points, for SB20 and BUP20 respectively. We validate DropClick as a pseudo-labelling approach by taking its outputs to train a Mask2Former instance-based segmentation model in a semi-supervised manner. In this process, partially removing user input from DropClick yields similar high performance when compared to providing all clicks, at 70.1 vs 70.7 points AP50 for SB20 and no difference for BUP20 at 77.0 for both models; at the same time saving 46.3% of total input for SB20 and 31.9% for BUP20.

Abstract:
This paper addresses the challenge of active perception within autonomous navigation in complex, unknown environments. Revisiting the foundational principles of active perception, we introduce an end-to-end reinforcement learning framework in which a robot must not only reach a goal while avoiding obstacles, but also actively control its onboard camera to enhance situational awareness. The policy receives observations comprising the robot state, the current depth frame, and a particularly local geometry representation built from a short history of depth readings. To couple collision-free motion planning with information-driven active camera control, we augment the navigation reward with a voxel-based information metric. This enables an aerial robot to learn a robust policy that balances goal-directed motion with exploratory sensing. Extensive evaluation demonstrates that our strategy achieves safer flight compared to using fixed, non-actuated camera baselines while also inducing intrinsic exploratory behaviors.

Abstract:
Recent advances in 3D Gaussian Splatting (3DGS) deliver striking photorealism, and extending it to large scenes opens new opportunities for semantic reasoning and prediction in applications such as autonomous driving. Today's state-of-the-art systems for large scenes primarily originate from LiDAR-based pipelines that utilize long-range depth sensing. However, they require costly high-channel sensors whose dense point clouds strain memory and computation, limiting scalability, fleet deployment, and optimization speed. We present MOGS, a monocular 3DGS framework that replaces active LiDAR depth with object-anchored, metrized dense depth derived from sparse visual-inertial (VI) structure-from-motion (SfM) cues. Our key idea is to exploit image semantics to hypothesize per-object shape priors, anchor them with sparse but metrically reliable SfM points, and propagate the resulting metric constraints across each object to produce dense depth. To address two key challenges, i.e., insufficient SfM coverage within objects and cross-object geometric inconsistency, MOGS introduces 1) a multi-scale shape consensus module that adaptively merges small segments into coarse objects best supported by SfM and fits them with parametric shape models, and 2) a cross-object depth refinement module that optimizes per-pixel depth under a combinatorial objective combining geometric consistency, prior anchoring, and edge-aware smoothness. Experiments on public datasets show that, with a low-cost VI sensor suite, MOGS reduces training time by up to 30.4% and memory consumption by 19.8%, while achieving high-quality rendering competitive with costly LiDAR-based approaches in large scenes. The source code is publicly available at https://github.com/ClarenceZSK/MOGS/.

Abstract:
Over the past several decades, robotic musicianship researchers have mainly focused on Western music with only limited efforts addressing musical styles from other regions, such as South Indian Classical music (a.k.a. Carnatic music) - a music form popular in the southern part of India. In this work, we present Hathaani v2, a robotic system capable of performing Carnatic music on the violin. The robot is designed to translate pitch information into left-hand finger placement and amplitude information into bowing changes and dynamics, based on any monophonic audio recording. The left-hand mechanism is capable of reaching arbitrary finger positions along the strings, allowing the robot to play gamakas - continuous pitch ornamentations that are fundamental to Carnatic music. The differential bowing mechanism provides both pressure and angle modulation, while maintaining mechanical rigidity and allowing visual engagement for the audience. We assessed the systems ability to perform gamakas through expert listening studies involving ten professional musicians on intonation, timbre, quality of bowing, hand coordination, gamaka authenticity, and clarity. The proposed robot outperforms the baseline on all of the evaluated parameters, achieving average scores exceeding 4 on a 5-point Likert scale (0.5 increments). This work has the potential to transform education and production of Carnatic music by offering programmatic solutions that support complex gamakas. Compared to software-based emulations, this physical violin-playing robot offers an accurate and expressive medium for conveying the nuances of Carnatic music performance.

Abstract:
This paper addresses the Dynamic UGV-UAV Cooperative Path Planning (DUCPP) problem involving one unmanned ground vehicle (UGV) assisted by one or more unmanned aerial vehicles (UAVs) operating on an uncertain road network with potentially impassable edges. DUCPP is particularly relevant for scenarios such as disaster response, emergency supply transport, and rescue operations, where a UGV must reach a specified destination in the presence of partially unknown road conditions. To enable the UGV to travel safely and efficiently to its destination, the UAV(s) dynamically inspect edges in the environment to identify and prune damaged or impassable edges from consideration. We present multiple strategies, including a bidirectional approach, to optimize UGV-UAV cooperation for finding a safe path in an uncertain road network. Furthermore, we explore the impact of using multiple UAVs on reducing the UGVs travel time, and evaluate the associated computation time. The proposed strategies are implemented and evaluated on 100 urban road networks. The results demonstrate that the bidirectional strategy achieves the best performance in most instances, and using multiple UAVs further reduces UGV travel time at the expense of increased computation time. This paper presents a robust framework for DUCPP to achieve efficient UGV-UAV cooperation for path planning and inspection, offering practical solutions for navigation in challenging and uncertain conditions.

Abstract:
With the rapid development of the low-altitude economy, accurate detection and localization of UAVs have become increasingly important. Conventional radar and visual detection methods have low accuracy, whereas current radar-camera fusion methods are computationally intensive. To overcome these issues, we propose a novel 3D UAV detection approach based on sparse radar-camera fusion, called SRCF-UAV, to achieve high-precision, low-complexity UAV detection in diverse scenarios. Specifically, we first propose an improved query initialization method that incorporates locations from 2D image proposals and radar point clouds. Then, we propose a query update method that sparsely fuses radar and image queries based on features, velocity, and spatial distance. Furthermore, we develop a radar-camera multimodal data collection platform based on real-time kinematic positioning (RTK) and collect a dataset of centimeter-level precision, comprising over 20,000 UAV instances that cover various scenarios, UAV models, and lighting conditions. Finally, extensive experiments on this dataset demonstrate that the proposed approach can achieve an average precision of up to 91.65% and an inference latency as low as 17 ms, validating its effectiveness and efficiency. The dataset and code will be publicly available to support further research.

Abstract:
Simulating realistic environments for robots is widely recognized as a critical challenge in robot learning, particularly in terms of rendering and physical simulation. This challenge becomes even more pronounced in navigation tasks, where trajectories often extend across multiple rooms or even entire floors. In this work, we present NavGSim, a Gaussian Splatting-based simulator designed to generate high-fidelity, large-scale navigation environments. Built upon a hierarchical 3D Gaussian Splatting framework, NavGSim enables photorealistic rendering in expansive scenes spanning hundreds of square meters. To simulate navigation collisions, we introduce a Gaussian Splatting-based slice technique that directly extracts navigable areas from reconstructed Gaussians. Additionally, for ease of use, we provide comprehensive NavGSim APIs supporting multi-GPU development, including tools for custom scene reconstruction, robot configuration, policy training, and evaluation. To evaluate NavGSims effectiveness, we train a Vision-Language-Action (VLA) model using trajectories collected from the NavGSim and assess its performance in both simulated and real-world environments. Our results demonstrate that NavGSim significantly enhances the VLA models scene understanding, enabling the policy to handle diverse navigation queries effectively.

Abstract:
Air-dispersed sensor networks deployed from aerial robotic systems (e.g., UAVs) provide a low-cost approach to wide-area environmental monitoring. However, existing methods often rely on active actuators for mid-air shape or trajectory control, increasing both power consumption and system cost. Here, we introduce a passive elastic-folding hinge mechanism that transforms sensors from a flat, stackable form into a three-dimensional structure upon release. Hinges are fabricated by laminating commercial sheet materials with rigid printed circuit boards (PCBs) and programming fold angles through a single oven-heating step, enabling scalable production without specialized equipment. Our geometric model links laminate geometry, hinge mechanics, and resulting fold angle, providing a predictive design methodology for target configurations. Laboratory tests confirmed fold angles between 10 deg and 100 deg, with a standard deviation of 4 deg and high repeatability. Field trials further demonstrated reliable data collection and LoRa transmission during dispersion, while the Horizontal Wind Model (HWM)-based trajectory simulations indicated strong potential for wide-area sensing exceeding 10 km.

Abstract:
Cooperative autonomous driving, which extends vehicle autonomy by enabling real-time collaboration between vehicles and smart roadside infrastructure, remains a challenging yet essential problem. However, none of the existing testbeds employ smart infrastructure equipped with sensing, edge computing, and communication capabilities. To address this gap, we design and implement a 1:15-scale miniature testbed, CIVAT, for validating cooperative autonomous driving, consisting of a scaled urban map, autonomous vehicles with onboard sensors, and smart infrastructure. The proposed testbed integrates V2V and V2I communication with the publish-subscribe pattern through a shared Wi-Fi and ROS2 framework, enabling information exchange between vehicles and infrastructure to realize cooperative driving functionality. As a case study, we validate the system through infrastructure-based perception and intersection management experiments.

Abstract:
End-to-end autonomous driving, which directly maps raw sensor inputs to low-level vehicle controls, is an crucial part of Embodied AI. Despite successes in applying Multimodal Large Language Models (MLLMs) for high-level traffic scene semantic understanding, it remains challenging to effectively translate these conceptual semantics understandings into low-level motion control commands and achieve cross-scene driving generalization and consensus. We propose Sce2DriveX, a human like chain-of-thought (CoT) driving reasoning MLLM framework, designed to achieve progressive learning from multi-view scene understanding to behavior analysis, motion planning, and vehicle control driving process. Sce2DriveX utilizes multimodal joint learning of local scene videos and global Bird's Eye View (BEV) maps to deeply understand long-range spatiotemporal relationships and road topology, enhancing its 3D dynamic/static scene perception and reasoning capabilities and achieving cross-scene generalization. Meanwhile, it reconstructs the implicit cognitive chain inherent in human driving, further enhancing the consensus between autonomous driving and human thought. To improve model performance, we construct the first comprehensive Visual Question Answering (VQA) driving instruction dataset, which tailored for 3D spatial understanding and long-axis task reasoning, and introduce a task-oriented three-stage training pipeline to support supervised fine-tuning. Extensive experiments demonstrate that Sce2DriveX achieves state-of-the-art performance across tasks from scene understanding to end-to-end driving, as well as robust generalization in handling diverse driving scenes on the CARLA Bench2Drive benchmark.

Abstract:
Learning from Demonstration (LfD) is a well-studied field shown to provide robots with fundamental motion skills for a variety of domains. Significant research into various branches of LfD (e.g., learned dynamical systems and movement primitives) can generally be classified into those that learn ``time-dependent or ``time-independent systems. Each paradigm provides fundamental benefits and drawbacks -- time-independent methods cannot learn overlapping trajectories, while time-dependence can result in undesirable behavior under perturbation. In this paper, we introduce Cluster Alignment for Learned Motions (CALM), an LfD framework dependent upon an alignment with a representative ``mean" trajectory of demonstrated motions rather than pure time- or state-dependence. We also discuss the convergence properties of CALM and introduce an alignment technique able to handle the sudden shifts in alignment possible under perturbation. We show how CALM mitigates the drawbacks of time-dependent and time-independent techniques on 2D datasets and implement our system on a 7-DoF robot learning tasks in three domains.

Abstract:
Efficient learning from demonstration for long-horizon tasks remains an open challenge in robotics. While significant effort has been directed toward learning trajectories, a recent resurgence of object-centric approaches has demonstrated improved sample efficiency, enabling transferable robotic skills. Such approaches model tasks as a sequence of object poses over time. In this work, we propose a scheme for transferring observed object arrangements to novel object instances by learning these arrangements on canonical class frames. We then employ this scheme to enable a simple yet effective approach for training models from as few as five demonstrations to predict arrangements of a wide range of objects including tableware, cutlery, furniture, and desk spaces. We propose a method for optimizing the learned models to enable efficient learning of tasks such as setting a table or tidying up an office with intra-category transfer, even in the presence of distractors. We present extensive experimental results in simulation and on a real robotic system for table setting which, based on human evaluations, scored 73.3% compared to a human baseline. We make the code and trained models publicly available upon acceptance.

Abstract:
In this letter, we present a novel dual-task, closed-loop, visual servoing-based active vision framework in an eye-in-hand configuration. The proposed active vision framework continuously drives the camera motion by coupling continuous Next-Best-View (NBV) planning and visual servo control within a unified formulation, is NBV-objective-agnostic, and enables real-time, closed-loop exploration of objects. We demonstrate how this approach can be applied to the 3D reconstruction of static volumetric objects. The approach is validated in the real world with a diverse set of relevant objects and we observe that the visual servo scheme produces smooth exploration trajectories that keeps the camera focused at the object. We also show that our gradient-based continuous NBV-strategy is highly competitive with baseline strategies that leverage global viewpoint sampling and results in efficient exploration with strong object coverage.

Abstract:
Accurate global localization remains a fundamental challenge in autonomous vehicle navigation, especially in previously unexplored areas lacking prior map information. Traditional methods typically rely on high-definition (HD) maps generated through prior traversals or utilize auxiliary sensors such as a global positioning system~(GPS). However, the above approaches are often limited by high costs, scalability issues, and decreased reliability in environments where GPS is unavailable. Moreover, prior methods require that both query and reference data originate from the same sensor modality, restricting to generalize across different sensor types. To address limitations, we propose a novel cross-modal localization framework that enables Light Detection and Ranging~(LiDAR)-equipped vehicles to estimate their global pose by leveraging publicly available Street View images. The proposed method leverages a shared embedding space, learned via a weight-sharing Vision Transformer~(ViT) encoder, to align heterogeneous sensor modalities, specifically LiDAR intensity images and geo-tagged Street View. Shared embedding space enables cross-modal matching for global localization via place recognition, eliminating the need for prior map construction or sensor calibration. Further, to compensate for heading discrepancies between the two modalities, the framework introduces an equirectangular perspective-n-point (PnP) solver by patch-level feature correspondences. Our proposed method enables 3-degree-of-freedom~(DoF) global localization, from a single LiDAR scan and a publicly available Street View. Experiments demonstrate that the proposed method achieves high recall and accurate heading estimation, offering a scalable solution for global localization without r

Abstract:
LiDAR point cloud is essential for autonomous vehicles, but motion distortions from dynamic objects degrade the data quality. While previous work has considered distortions caused by ego motion, distortions caused by other moving objects remain largely overlooked, leading to errors in object shape and position. This distortion is particularly pronounced in high-speed environments such as highways and in multi-LiDAR configurations, a common setup for heavy vehicles. To address this challenge, we introduce HiMo, a pipeline that repurposes scene flow estimation for non-ego motion compensation, correcting the representation of dynamic objects in point clouds. During the development of HiMo, we observed that existing self-supervised scene flow estimators often produce degenerate or inconsistent estimates under high-speed distortion. We further propose SeFlow++, a real-time scene flow estimator that achieves state-of-the-art performance on both scene flow and motion compensation. Since well-established motion distortion metrics are absent in the literature, we introduce two evaluation metrics: compensation accuracy at a point level and shape similarity of objects. We validate HiMo through extensive experiments on Argoverse 2, ZOD, and a newly collected real-world dataset featuring highway driving and multi-LiDAR-equipped heavy vehicles. Our findings show that HiMo improves the geometric consistency and visual fidelity of dynamic objects in LiDAR point clouds, benefiting downstream tasks such as semantic segmentation and 3D detection. See https://kin-zhang.github.io/HiMo for more details.

Abstract:
Aerial Visual Place Recognition (VPR) is critical for Unmanned Aerial Vehicles (UAVs) localization, especially in environments with unstable or unavailable GPS signals. While neural network-based VPR methods have become mainstream, they face significant challenges on UAV platforms. Traditional CNN-based VPR models are highly sensitive to image rotation, degrading their performance in aerial-domain environments. Meanwhile, Transformer-based models have high computational complexity, making them less suitable for resource-constrained UAVs. In this letter, we propose a lightweight, rotation-invariant aerial VPR method. Our approach combines a rotation-equivariant backbone network with a rotation-invariant aggregation layer to ensure descriptor consistency across different orientations. Additionally, we propose an unsupervised training strategy that constructs higher-dimensional descriptors to optimize the model, while maintaining the lower descriptor dimensionality during application. Experimental results show that our method outperforms state-of-the-art methods across multiple aerial VPR datasets. The code will be released at https://github.com/cbbhuxx/UltraVPR.

Abstract:
Intra-class terrain differences such as water content directly influence a vehicles ability to traverse terrain, yet RGB vision systems may fail to distinguish these properties. We argue that expanding the evaluation of a terrains spectral content beyond red-green-blue channels to the near infrared spectrum provides useful information for such intra-class identification and route planning. However, accurate analysis of this spectral information is highly dependent on ambient illumination. We demonstrate a system architecture to collect and register multi-wavelength, hyperspectral images from a mobile robot and describe an approach to reflectance calibrate cameras under varying illumination conditions. To showcase the practical applications of our system, HYPER DRIVE, we demonstrate the ability to calculate vegetative health indices and soil moisture content from data collected from an off-road mobile robot with greater consistency than at-imager radiance.

Abstract:
We consider the problem of real-time liquid-level estimation for closed-loop robotic pouring. To this end, we propose a fast-slow architecture where a Vision-Language Model handles high-level task reasoning and a sensor-driven fast system provides low-latency feedback. As a first instantiation of the fast system, we present RadarEye, a mmWave radar signal processing pipeline that tracks liquid level during pouring. RadarEye combines (i) AoAToF beamforming for liquid surface localization with (ii) a physics-informed tracker that suppresses multipath interference. In real-robot experiments, RadarEye achieves 0.35 cm median error at 0.62 ms per-update latency, outperforming vision and ultrasound baselines.

Abstract:
Estimating the stiffness of Deformable Linear Objects (DLOs) is crucial for robust manipulation. Inferring this hidden property depends heavily on the physical interaction strategy. Through a 1D CNN-based analysis of predefined probing modes, we first demonstrate that boundary constraints and grasp locations drastically alter stiffness identifiability. While fixed-end setups yield highly informative responses, they are rarely practical in unconstrained tasks. Consequently, we move beyond manual heuristics and reframe DLO parameter identification as an active perception problem. We propose a Reinforcement Learning (RL) framework that autonomously learns informative interaction strategies for free cables. By coupling a Proximal Policy Optimization (PPO) agent with a trajectory-aware estimator, the system dynamically excites the DLO to extract stiffness from diverse, stochastic manipulation sequences. Achieving a Mean Absolute Error (MAE) of 0.0192, our approach provides a robust, active paradigm that overcomes the limitations of static probing in unconstrained environments.

Abstract:
Signal Temporal Logic robustness is a common objective for optimal robot control, but its dependence on history limits the robot's decision-making capabilities when used in model predictive control approaches. In this work, we introduce Signal Temporal Logic robustness-to-go, a new quantitative semantics for the logic that isolates the contributions of suffix trajectories. We prove its relationship to formula progression for Metric Temporal Logic, and show that the robustness-to-go depends only on the suffix trajectory and progressed formula. We implement robustness-to-go as the objective in a model predictive control algorithm and use formula progression to efficiently evaluate it online. We test the algorithm in simulation and compare it to model predictive control using other robustness measures. Our experiments show that using robustness-to-go improves performance compared to using traditional robustness.

Abstract:
Autonomous racing has attracted significant attention recently, presenting challenges in selecting an optimal controller that operates within the onboard system's computational limits and meets operational constraints such as limited track time and high costs. This paper introduces a Linear Parameter-Varying Model Predictive Controller (LPV-MPC) for lateral control. Implemented on an IAC AV-24, the controller achieved stable performance at speeds exceeding 160 mph (71.5 m/s). We detail the controller design, the methodology for extracting model parameters, and key system-level and implementation considerations. Additionally, we report results from our final race run, providing a comprehensive analysis of both vehicle dynamics and controller performance. A Python implementation of the framework is available at: https://tinyurl.com/LPV-MPC-acados.

Abstract:
For high-performance autonomous manipulation of a payload by a mobile manipulator team, or for collaborative manipulation with the human, robots should be able to discover where other robots are attached to the payload, as well as the payload's mass and inertial properties. In this paper, we describe a method for the robots to autonomously discover this information. The robots cooperatively manipulate the payload, and the twist, twist derivative, and wrench data at their grasp frames are used to estimate the transformation matrices between the grasp frames, the location of the payload's center of mass, and the payload's inertia matrix. The method is validated experimentally with a team of three mobile cobots, or mocobots.

Abstract:
We consider the nonprehensile object transportation task known as the waiter's problem---in which a robot must move an object on a tray from one location to another---when the transported object has uncertain inertial parameters. In contrast to existing approaches that completely ignore uncertainty in the inertia matrix or which only consider small parameter errors, we are interested in pushing the limits of the amount of inertial parameter uncertainty that can be handled. We first show how constraints that are robust to inertial parameter uncertainty can be incorporated into an optimization-based motion planning framework to transport objects while moving quickly. Next, we develop necessary conditions for the inertial parameters to be realizable on a bounding shape based on moment relaxations, allowing us to verify whether a trajectory will violate the constraints for any realizable inertial parameters. Finally, we demonstrate our approach on a mobile manipulator in simulations and real hardware experiments: our proposed robust constraints consistently successfully transport a 56 cm tall object with substantial inertial parameter uncertainty in the real world, while the baseline approaches drop the object while transporting it.

Abstract:
This letter introduces a novel semantics-aware inspection planning policy derived through deep reinforcement learning. Reflecting the fact that within autonomous informative path planning missions in unknown environments, it is often only a sparse set of objects of interest that need to be inspected, the method contributes an end-to-end policy that simultaneously performs semantic object visual inspection combined with collision-free navigation. Assuming access only to the instantaneous depth map, the associated segmentation image, the ego-centric local occupancy, and the history of past positions in the robots neighborhood, the method demonstrates robust generalizability and successful crossing of the sim2real gap. Beyond simulations and extensive comparison studies, the approach is verified in experimental evaluations onboard a flying robot deployed in novel environments with previously unseen semantics and overall geometric configurations.

Abstract:
We introduce a planner designed to guide robot manipulators in stably placing objects within complex scenes. Our proposed method reverses the traditional approach to object placement: our planner selects contact points first and then determines a placement pose that solicits the selected points. This is instead of sampling poses, identifying contact points, and evaluating pose quality. Our algorithm facilitates stability-aware object placement planning, imposing no restrictions on object shape, convexity, or mass density homogeneity, while avoiding combinatorial computational complexity. Our proposed stability heuristic enables our planner to find a solution about 20 times faster when compared to the same algorithm not making use of the heuristic and eight times faster than a state-of-the-art method using the traditional sample-and-evaluate approach. The proposed planner is also more successful in finding stable placements than the five other benchmarked algorithms. Derived from first principles and validated in ten real robot experiments, our approach provides a general and scalable solution to the problem of rigid object placement planning.

Abstract:
The perception of deformable linear objects (DLOs) poses significant challenges in robotic manipulation. Crossovers, mergings, and bifurcations of multiple DLOs complicate the identification of individual DLO physical instances. Furthermore, DLOs are often too large to be captured by a single camera, requiring the stitching of multiple overlapping views. This paper presents CVF-DLO, a cross-visual-field route estimation framework of branched DLOs (BDLOs) laid along physical surfaces such as wire harnesses, based on images from multiple viewpoints and pose-aware cameras. CVF-DLO is applicable to various perception tasks involving DLO-like structures, such as verifying connection accuracy and route consistency in cables and pipes. We propose a DLO instance segmentation method that demonstrates superior performance in handling crossings and bifurcations. The extracted DLO paths are projected onto the designed cable-laying surfaces using the camera pose and scene model. Finally, DLO routes are retrieved by searching within the spatial path domain formed by intersecting visual fields. To validate our method on wiring harnesses and intersections, we use two public DLO datasets and introduce a new BDLO dataset to benchmark against state-of-the-art DLO instance segmentation methods. Additionally, we present a cabin wiring harness dataset to evaluate the performance of the cross-visual-field route estimation. We have released all our source code and datasets (with ground truth) at https://github.com/ForNe-tech/CVF-DLO.

Abstract:
Online mapping and end-to-end (E2E) planning in autonomous driving are still largely sensor-centric, leaving rich map priorsHD/SD vector maps, rasterized SD maps, and satellite imageryunderused due to heterogeneity, pose drift, and inconsistent availability at test time. We present emphUMPE, a Unified Map Prior Encoder that can ingest any subset of four priors and fuse them with BEV features for both mapping and planning. emphUMPE has two branches. The vector encoder pre-aligns HD/SD polylines with a frame-wise SE(2) correction, encodes points via multi-frequency sinusoidal features, and produces polyline tokens with confidence scores. BEV queries then apply cross-attention with confidence bias, followed by normalized channel-wise gating to avoid length imbalance and to softly down-weight uncertain sources. The raster encoder shares a ResNet-18 backbone conditioned by FiLM (scaling/shift at every stage), performs SE(2) micro-alignment, and injects priors through zero-initialized residual fusion so the network starts from a do-no-harm baseline and learns to add only useful prior evidence. A vector-then-raster fusion order reflects the inductive bias of geometry first, appearance second. On nuScenes mapping, emphUMPE lifts MapTRv2 from 61.5 �?67.4 mAP (+5.9) and MapQR from 66.4 �?71.7 mAP (+5.3). On Argoverse2, emphUMPE adds +4.1 mAP over strong baselines. emphUMPE is compositional: when trained with all priors, it outperforms single-prior models even when only one prior is available at test time, demonstrating powerset robustness. For E2E planning (VAD backbone, nuScenes), emphUMPE reduces trajectory error from 0.72 �?0.42 m L2 (avg. �?.30 m) and collision rate from 0.22% �?0.12% (�?.10%), surpassing recent prior-injection methods. These results show that a unified, alignment-aware treatment of heterogeneous map priors yields better mapping and better planning.

Abstract:
Enabling legged robots to perform non-prehensile loco-manipulation is crucial for enhancing their versatility. However, learning behaviors such as whole-body object pushing often necessitates sophisticated planning strategies or extensive task-specific reward shaping. In this work, we present CAIMAN, a practical reinforcement learning framework that encourages the agent to gain control over other entities in the environment. CAIMAN leverages causal action influence as an intrinsic motivation objective, allowing legged robots to efficiently acquire object pushing skills even under sparse task rewards. We employ a hierarchical control strategy, combining a low-level locomotion module with a high-level policy that generates task-relevant velocity commands and is trained to maximize the intrinsic reward. To estimate causal action influence, we learn the dynamics of the environment by integrating a kinematic prior with data collected during training. We empirically demonstrate CAIMANs superior sample efficiency and adaptability to diverse scenarios in simulation, as well as its successful transfer to real-world systems without further fine-tuning. A video demo is available at https://www.youtube.com/watch?v=dNyvT04Cqaw.

Abstract:
Robust robotic manipulation in the real world requires coping with incomplete or unreliable sensory input. While vision provides rich information, it often fails in the presence of occlusions, clutter, or poor lighting. In such cases, touch offers a robust alternative, enabling object localisation through contact alone. We present a touch-only global localisation method that operates in continuous state space with a particle belief. Sparse contact/no-contact signals are turned into informative likelihoods via a proximity-aware measurement model, and contact-aware resampling mitigates particle starvation. An information-gathering controller selects actions that maximise expected information gain using a non-parametric entropy estimator sensitive to both observation updates and dynamics. On real hardware, the system reliably localises and then grasps from broad, multi-modal initial beliefs with mode separations up to 0.4 m, far beyond the narrow uncertainty ranges assumed in related work. Information-aware localisation-actions speed up belief convergence and boost grasp success; and ablations in simulation confirm the benefits of the measurement and resampling components.

Abstract:
This paper addresses the challenge of developing a realistic urban-driving simulator to accurately model agent behaviors, a crucial component for self-driving car development. Most previous simulators focus on the plausibility of sensor data synthesis, whereas the plausibility of driving behaviors is poorly explored. To tackle this problem, we propose a hierarchical architecture, which comprises (i) a high-level intention simulation summarizing driving scenarios and (ii) a low-level policy trained by reinforcement algorithms to refine plans. Unlike existing simulators, our approach captures diverse behaviors, even sub-optimal ones, vital for robust policy training and evaluation. We also highlight the importance of interactive simulations over static scenarios for realistic policy development. Extensive experiments demonstrate that our approach significantly improves long-term behavior prediction and closed-loop simulation, enhancing the realism and diversity of urban-driving simulations. The videos of this work are available in our project page: hrefhttps://sites.google.com/ucsd.edu/h-sim/homehttps:/ /sites.google.com/ucsd.edu/h-sim/home.

Abstract:
Coordinated multi-arm manipulation requires satisfying multiple simultaneous geometric constraints across high-dimensional configuration spaces, which poses a significant challenge for traditional planning and control methods. In this work, we propose Adaptive Diffusion Constrained Sampling (ADCS), a generative framework that flexibly integrates both equality (e.g., relative and absolute pose constraints) and structured inequality constraints (e.g., proximity to object surfaces) into an energy-based diffusion model. Equality constraints are modeled using dedicated energy networks trained on pose differences in the Lie algebra space, while inequality constraints are represented via Signed Distance Functions (SDFs) and encoded into learned constraint embeddings, allowing the model to reason about complex spatial regions. A key innovation of our method is a Transformer-based architecture that learns to weigh constraint-specific energy functions at inference time, enabling flexible and context-aware constraint integration. Moreover, we adopt a two-stage batch-wise sampling strategy that improves precision and sample diversity by combining Langevin dynamics with resampling and density-aware re-weighting. Experimental results on dual-arm manipulation tasks show that ADCS significantly improves sample diversity and generalization in settings demanding precise coordination and adaptive constraint handling.

Abstract:
Effective energy management is essential for maximizing information gathering tasks with networked mobile robots, particularly for large-scale, energy-intensive tasks such as agricultural monitoring and wildfire mapping. This paper presents a novel framework that integrates robots energy profiles with confidence bounds of their assigned regions to optimize sampling targets. Designed for persistent, long-term deployments, the framework employs Gaussian Process Regression (GPR) to maximize data acquisition and accurately reconstruct unknown spatial distributions (e.g., algae outbreaks or humidity maps). The method enables seamless transitions among exploration (mapping uncertain regions at high energy), exploitation (refining maps at moderate energy levels), and recharging (navigating to charging stations at low energy), thereby achieving energy-balanced informative path planning. Experiments demonstrate the effectiveness of the approach against state-of-the-art methods in generating energy-efficient and distinct paths for heterogeneous robots, delivering up to 32% energy savings while maintaining high reconstruction accuracy. Hardware experiments closely matched the performance in simulation.

Abstract:
The ability to connect visual observations with human language is increasingly valuable for embodied agents in tasks such as navigation and semantic mapping. Existing visuallanguage map (VLMaps) approach enables this connection but typically depends on depth images to project semantic features into 3D space, which limits scalability due to sensor cost and deployment constraints. In this work, we introduce SC-VLMaps, a depth-free visuallanguage mapping framework that constructs semantic maps using only monocular RGB input. SC-VLMaps leverages a scene coordinate regression (SCR) network to predict dense 3D coordinates from images, bypassing the need for depth supervision and enabling implicit geometry reconstruction. The predicted coordinates are fused into a voxel grid and augmented with language-aligned features from a frozen visuallanguage encoder, producing maps that are both geometrically coherent and semantically enriched. By employing a multi-scene training strategy, SC-VLMaps generalizes from indoor datasets (7Scenes) to challenging outdoor benchmarks (Cambridge Landmarks). Experiments show that SC-VLMaps achieves denser, more compact maps with stronger semantic alignment than VLMaps, while requiring only monocular RGB images.

Abstract:
Imitation learning (IL) enables efficient skill acquisition from demonstrations but often struggles with long-horizon tasks and high-precision control due to compounding errors. Residual policy learning offers a promising, model-agnostic solution by refining a base policy through closed-loop corrections. However, existing approaches primarily focus on local corrections to the base policy, lacking a global understanding of state evolution, which limits robustness and generalization to unseen scenarios. To address this, we propose incorporating global dynamics modeling to guide residual policy updates. Specifically, we leverage Koopman operator theory to impose linear time-invariant structure in a learned latent space, enabling reliable state transitions and improved extrapolation for long-horizon prediction and unseen environments. We introduce KORR (Koopman-guided Online Residual Refinement), a simple yet effective framework that conditions residual corrections on Koopman-predicted latent states, enabling globally informed and stable action refinement. We evaluate KORR on long-horizon, fine-grained robotic furniture assembly tasks under various perturbations. Results demonstrate consistent gains in performance, robustness, and generalization over strong baselines. Our findings further highlight the potential of Koopman-based modeling to bridge modern learning methods with classical control theory.

Abstract:
Abstract Cross-View Geo-Localization (CVGL) localizes a query image via retrieval from georeferenced satellite imagery,yet severe viewpoint variation remains a central challenge. Recent advances often rely on heavy backbones or add-on modules that achieve high accuracy but are impractical on resource-constrained UAVs. To balance accuracy and efficiency, we introduce Cross-Distill, a knowledge-distillation framework for CVGL. Cross-Distill performs Cross-Similarity Ranking Distillation by constructing a teacher-student interaction matrix to enforce ranking consistency and enhance discrimination. Building on this, it introduces Viewpoint Decoupling, which partitions ranking relations into intra-view, intra-to-cross-view, and cross-to-cross-view, enabling precise modeling of cross-view dependencies and improving class compactness and separability. Cross-Distill further employs Multi-Manifold Feature Distillation that jointly enforces angular consistency on the spherical manifold, preserves local distances in Euclidean space, and leverages hyperbolic distance as a negatively curved metric to strengthen teacherstudent alignment. Experiments on University-1652 and SUES-200 show that the distilled student achieves significant gains with low complexity (31.43M parameters, 13.09 GFLOPs),and an inference time of only 62.02 ms per image on an RK3588. For instance, on University-1652 UAV→SAT retrieval, R@1 improves from 75.97% to 94.43% and AP from 79.24% to 95.33%.

Abstract:
Assistive robotic devices, like soft lower-limb exoskeletons or exosuits, are widely spreading with the promise of helping people in everyday life. To make such systems adaptive to the variety of users wearing them, it is desirable to endow exosuits with advanced perception systems. However, exosuits have little sensory equipment because they need to be light and easy to wear. This paper presents a perception module based on machine learning that aims at estimating 3 walking modes (i.e., ascending or descending stairs and walking on level ground) of users wearing an exosuit. We tackle this perception problem using only inertial data from two sensors. Our approach provides an estimate for both future and past timesteps that supports control and enables a self-labeling procedure for online model adaptation. Indeed, we show that our estimate can label data acquired online and refine the model for new users. A thorough analysis carried out on real-life datasets shows the effectiveness of our user-tailored perception module. Finally, we integrate our system with the exosuit in a closed-loop controller, validating its performance in an online single-subject experiment.

Abstract:
Vision-Language-Action (VLA) models such as OpenVLA, Octo, and π0 have shown strong generalization by leveraging large-scale demonstrations, yet their performance is still fundamentally constrained by the quality and coverage of supervised data. Reinforcement learning (RL) therefore provides a promising path for further improving and fine-tuning VLAs through online interaction. However, conventional policy gradient methods are computationally infeasible in the context of flow-matching based models due to the intractability of the importance sampling process, which requires explicit computation of policy ratios. To overcome this limitation, we propose Flow Policy Optimization (FPO) algorithm, which reformulates importance sampling by leveraging per-sample changes in the conditional flow-matching objective. Furthermore, FPO achieves stable and scalable online reinforcement fine-tuning of the π0 model by integrating structure-aware credit assignment to enhance gradient efficiency, clipped surrogate objectives to stabilize optimization, multi-step latent exploration to encourage diverse policy updates, and a Q-ensemble mechanism to provide robust value estimation. We evaluate FPO on the LIBERO benchmark and the ALOHA simulation task against supervised, preference-aligned, diffusion-based, autoregressive online RL, and π0-FAST baselines, observing consistent improvements over the imitation prior and strong alternatives with stable learning under sparse rewards. In addition, ablation studies and analyses of the latent space dynamics further highlight the contributions of individual components within FPO, validating the effectiveness of the proposed computational modules and illustrating the stable convergence of the conditional flow-matching objective during online reinforcement learning.

Abstract:
Semantic segmentation networks, which are essential for robotic perception, often suffer from performance degradation when the visual distribution of the deployment environment differs from that of the source dataset on which they were trained. Unsupervised Domain Adaptation (UDA) addresses this challenge by adapting the network to the robots target environment without external supervision, leveraging the large amounts of data a robot might naturally collect during long-term operation. In such settings, UDA methods can exploit multi-view consistency across the environments map to fine-tune the model in an unsupervised fashion and mitigate domain shift. However, these approaches remain sensitive to cross-view instance-level inconsistencies. In this work, we propose a method that starts from a volumetric 3D map to generate multi-view consistent pseudo-labels. We then refine these labels using the zero-shot instance segmentation capabilities of a foundation model, enforcing instance-level coherence. The refined annotations serve as supervision for self-supervised fine-tuning, enabling the robot to adapt its perception system at deployment time. Experiments on real-world data demonstrate that our approach consistently improves performance over state-of-the-art UDA baselines based on multi-view consistency, without requiring any ground-truth labels in the target domain.

Abstract:
Reliable onboard perception is critical for quadruped robots navigating dynamic environments, where obstacles can emerge from any direction under strict�?reaction time constraints. Single-sensor systems face inherent limitations: LiDAR provides omnidirectional coverage but lacks rich texture information, while cameras capture high-resolution detail but suffer from restricted field of view. We introduce APREBot (Active Perception System for Reflexive Evasion Robot), a novel framework that integrates reflexive evasion with active hierarchical perception. APREBot strategically combines LiDAR-based omnidirectional scanning with camera-based active focusing, achieving comprehensive environmental awareness essential for agile obstacle avoidance in quadruped robots. We validate APREBot through extensive Sim2Real experiments on a quadruped platform, evaluating diverse obstacle types, trajectories, and approach directions. Our results demonstrate substantial improvements over strong baselines in both safety metrics and operational efficiency, highlighting APREBot's potential for dependable autonomy in safety-critical scenarios. Paper homepage: https://aprebot-2026.github.io/.

Abstract:
Robotic throwing enables fast and efficient object placement beyond the robots immediate workspace, but reliable throwing in cluttered environments remains underexplored. Existing approaches, such as TossingBot, learn throwing strategies from visual input but assume obstacle-free settings. In this paper, we address the problem of throwing objects into a target basket while avoiding obstacles placed randomly in the scene. We introduce a potential field state representation that compactly encodes both basket attraction and obstacle repulsion on a fixed-size grid, enabling reinforcement learning (RL) policies to generalize across arbitrary numbers and configurations of obstacles. The policy is initialized from kinesthetic demonstrations and optimized in simulation using three state-of-the-art RL algorithms (SAC, DDPG, TD3). Among these, SAC achieves the most consistent performance across scenarios. We compare the potential field representation against explicit state encodings and demonstrate that it achieves higher success rates and better scalability to unseen obstacle configurations. Real-robot experiments with unseen throwable objects confirm robust sim-to-real transfer, achieving up to 90% success in cluttered scenes. These results demonstrate that PFR provides a practical and robust representation for safe and efficient robotic throwing in unstructured environments. A video showcasing our experiments has been attached to the paper as supplementary material.

Abstract:
Video diffusion models provide powerful real-world simulators for embodied AI but remain limited in controllability for robotic manipulation. Recent works on trajectory-conditioned video generation address this gap but often rely on 2D trajectories or single modality conditioning, which restricts their ability to produce controllable and consistent robotic demonstrations. We present DRAW2ACT, a depth-aware trajectory-conditioned video generation framework that extracts multiple orthogonal representations from the input trajectory, capturing depth, semantics, shape and motion, and injects them into the diffusion model. Moreover, we propose to jointly generate spatially aligned RGB and depth videos, leveraging cross-modality attention mechanisms and depth supervision to enhance the spatio-temporal consistency. Finally, we introduce a multimodal policy model conditioned on the generated RGB and depth sequences to regress the robot's joint angles. Experiments on Bridge V2, Berkeley Autolab, and simulation benchmarks show that DRAW2ACT achieves superior visual fidelity and consistency while yielding higher manipulation success rates compared to existing baselines.

Abstract:
Many automated manufacturing processes rely on industrial robot arms to move process-specific tools along workpiece surfaces. In applications like grinding, sanding, spray painting, or inspection, they need to cover a workpiece fully while keeping their tools perpendicular to its surface. While there are approaches to generate trajectories for these applications, there are no sufficient methods for analyzing the feasibility of full surface coverage. This work proposes a sampling-based approach for continuous coverage estimation that explores reachable surface regions in the configuration space. We define an extended ambient configuration space that allows for the representation of tool position and orientation constraints. A continuation-based approach is used to explore it using two different sampling strategies. A thorough evaluation across different kinematics and environments analyzes their runtime and efficiency. This validates our ability to accurately and efficiently calculate surface coverage for complex surfaces in complicated environments.

Abstract:
Robot manipulation research still suffers from significant data scarcity: even the largest robot datasets are orders of magnitude smaller and less diverse than those that fueled recent breakthroughs in language and vision. We introduce Masquerade, a method that edits in-the-wild egocentric human videos to bridge the visual embodiment gap between humans and robots and then learns a robot policy with these edited videos. Our pipeline turns each human video into "robotized" demonstrations by (i) estimating 3D hand poses, (ii) inpainting the human arms, and (iii) overlaying a rendered bimanual robot that tracks the recovered end-effector trajectories. We pre-train a visual encoder to predict future 2-D robot keypoints on 675K frames of these edited clips. We continue that auxiliary loss while fine-tuning a diffusion-policy head on only 50 robot demonstrations per task. This yields policies that generalize significantly better than prior work. On three long-horizon, bimanual kitchen tasks evaluated in three unseen scenes each, Masquerade outperforms baselines by 5-6×. Ablations show that both the robot overlay and co-training are indispensable, and performance scales logarithmically with the amount of edited human video. These results demonstrate that explicitly closing the visual embodiment gap unlocks a vast, readily available source of data from human videos that can be used to improve robot policies.

Abstract:
Autonomous vehicles (AVs) require extensive testing in simulation, but test case generation for driving scenarios is laborious. The desired scenarios are often out-of-distribution and have precise requirements on interactions with the AV policy under test. Manually programming scenarios allows for precise controllability but is difficult to scale. On the other hand, statistical models can leverage compute and data, but struggle with precise controllability when out-of-distribution. We cast scenario orchestration as a constraint-solving problem and present a language-in, simulation-out scenario orchestrator for closed-loop testing AVs. Our approach leverages foundation model reasoning to translate general, natural language descriptions into a set of constraints as a scenario representation. This then allows us to leverage off the shelf solvers to solve for actor behaviors which meet precise testing intentions in closed-loop. Under a benchmark of carefully crafted and diverse scenario descriptions, our approach greatly outperforms our baselines in orchestration success rate. We further show that our closed-loop approach is especially important for scenarios which require ego-reactive specifications.

Abstract:
The completion, extension, and generation of 3D semantic scenes are an interrelated set of capabilities that are useful for robotic navigation and exploration. Existing approaches seek to decouple these problems and solve them one-off. Additionally, these approaches are often domain-specific,requiring separate models for different data distributions, e.g.indoor vs. outdoor scenes. To unify these techniques and provide cross-domain com- patibility, we develop a single framework that can perform scene completion, extension, and generation in both indoor and outdoor scenes, which we term Octree Latent Semantic Diffusion. Our approach operates directly on an efficient dual octree graph latent representation: a hierarchical, sparse, and memory-efficient occupancy structure. This technique disentangles synthesis into two stages: (i) structure diffusion, which predicts binary split signals to construct a coarse occupancy octree, and (ii) latent semantic diffusion, which generates semantic embeddings decoded by a graph VAE into voxel-level semantic labels. To perform semantic scene completion or extension, our model leverages inference-time latent inpainting, or outpainting respectively. These inference-time methods use partial LiDAR scans or maps to condition generation, without the need for retraining or finetuning. We demonstrate high-quality structure, coherent semantics, and robust completion from single LiDAR scans, as well as zero-shot generalization to out-of-distribution LiDAR data. These results indicate that completion-through-generation in a dual octree graph latent space is a practical and scalable alternative to regression-based pipelines for real-world robotic perception tasks.

Abstract:
Robotic disinfection can relieve human operators from repetitive, labor‑intensive tasks while reducing the risk of pathogen transmission in public spaces. Recent advances in learning-based methods further enhance these systems by enabling robust dynamic task planning and the interpretation of ambiguous instructions. However, disinfection task planning remains a four-dimensional (interaction, logic, spatial and temporal) problem that requires expert knowledge. The robust task planning for autonomous disinfection in dynamic environment remains challenging. This paper proposes a novel framework that integrating the Generative Adversarial Trimodel (GAT) method with embodied framework to solve the four-dimensional problem in the dynamic environment. The GAT method injects expert knowledge and iteratively refines neural network-generated plans against analytical model (AM), driving dual convergence and reducing logic, spatial, and temporal errors. By combining embodied framework and the GAT method into a GAT-enhanced embodied framework, the robot system autonomously perceives objects of unknown shape and pose, long-horizon task sequence plans, and executes disinfection operations. Experimental results demonstrate an improvement in success rate and reduce the average task time and rule violation rates compared with non-GAT methods, demonstrating improved robustness and efficiency in dynamic environment.

Abstract:
Vision-based tactile sensors (VBTS) recover high-resolution contact geometry but typically rely on opaque elastomer layers that prevent visual transparency, while RGB-D cameras provide global depth perception yet degrade significantly at close range. To address this limitation, we present TransTac, a transparent ultraviolet (UV)-encoded binocular VBTS that integrates visual observation and marker-based tactile reconstruction within a single compact device. The system employs a transparent elastomer embedded with UV-reflective markers and a prior-guided Delaunay stereo matching algorithm for robust sparse triangulation. To reliably detect densely distributed semitransparent markers, we develop a lightweight detector that enables stable localization under contact and deformation. The proposed prior-guided Delaunay matching improves correspondence robustness by approximately 21% compared with global assignment baselines while maintaining high reconstruction accuracy. In semantic evaluation, TransTac achieves up to 83.3% zero-shot recognition accuracy on tactile images, exceeding opaque tactile baselines by approximately 50 percentage points. Embedding analysis further reveals substantially stronger cross-modal alignment with natural images, with class-center similarity increasing from around 0.2 to over 0.77. Controlled near-distance experiments quantify the degradation of RGB-D depth reliability and demonstrate extended geometric coverage enabled by visuo-tactile integration. Finally, a compact prototype is implemented with an approximate hardware cost of 70. Code and hardware design are publicly available at https://github.com/87361/TransTac.

Abstract:
The growing interest in exploring other planets calls for innovative robotic systems capable of deploying to and traversing challenging space environments. While wheeled rovers have traditionally fulfilled this role, they face limitations, including configuration dependence (e.g., requiring an upright orientation), susceptibility to impacts, and difficulty overcoming obstacles larger than their wheel radius. Tensegrity-based robotics presents a promising alternative for future rovers. These lightweight, compliant structures offer compactibility, adjustable stiffness, and the ability to absorb impacts without damage. Moreover, their unique form factor naturally protects scientific payloads. Recent research has explored tensegrity robots for rolling-based locomotion, with increasing interest in leveraging their structures for jumping-based movement. However, achieving hardware capable of high jumps greater than the robots body length (BL) and directional jumping control for steerable jumping remains a challenge. This work introduces a tensegrity robot that utilizes structural deformation for jumping locomotion. Through first-principles analyses, simulations, laboratory experiments, and field tests in a planetary analog environment, we demonstrate a robot capable of vertical jumps of 1.18 m (1.93 BLs), directional jumps covering horizontal distances up to 0.59 m (0.97 BLs), and surviving falls from heights of 21.5 m (35.2 BLs).

Abstract:
Cooperative autonomous robotic systems have significant potential for executing complex multi-task missions across space, air, ground, and maritime domains. But they commonly operate in remote, dynamic and hazardous environments, requiring rapid in-mission adaptation without relying on fragile or slow communication links to centralized compute. Fast, onboard replanning algorithms are therefore essential to enhance resilience for these systems, but do not yet exist. Reinforcement Learning (RL) shows strong promise for efficiently solving mission planning tasks formulated as Travelling Salesperson Problems (TSPs), but existing methods: 1) are unsuitable for replanning, where agents do not start at a single location; 2) do not allow cooperation between agents; 3) are unable to model tasks with variable durations; or 4) lack practical considerations for onboard deployment. Here we address this gap by defining the Cooperative Mission Replanning Problem as a novel adaptation of multiple TSP, and develop a new encoder/decoder-based RL model to solve it effectively and efficiently. Using a simple example of cooperative drones, we show our replanner consistently (90% of the time) maintains performance within 10% of the state-of-the-art LKH3 heuristic solver, whilst running 85-370 times faster on a Raspberry Pi. This work paves the way for increased resilience in autonomous multi-agent systems.

Abstract:
Dynamic SLAM methods jointly estimate for the static and dynamic scene components. However, existing approaches, while accurate, are computationally expensive and unsuitable for online applications. In this work, we present a novel factor-graph formulation and system architecture for Dynamic SLAM that inherently supports incremental optimisation and online estimation. This represents the first formulation explicitly designed to leverage incremental inference methods in the dynamic setting.On multiple datasets, we demonstrate that our method achieves camera pose and object motion accuracy equal to or better than state-of-the-art. We further analyse the structural properties of our approach to demonstrate its scalability and provide insight regarding the challenges of solving Dynamic SLAM incrementally. Finally, we show that our formulation leads to problem structure well-suited to incremental solvers, and our system architecture further enhances performance, achieving a 5x speed-up over existing methods. Code is open-sourced.

Abstract:
We propose MoRe-ERL, a framework that combines Episodic Reinforcement Learning (ERL) and residual learning, which refines preplanned reference trajectories into safe, feasible, and efficient task-specific trajectories. This framework is general enough to incorporate into arbitrary ERL methods and motion generators seamlessly. MoRe-ERL identifies trajectory segments requiring modification while preserving critical task-related maneuvers. Then it generates smooth residual adjustments using B-Spline-based movement primitives to ensure adaptability to dynamic task contexts and smoothness in trajectory refinement. Experimental results demonstrate that residual learning significantly outperforms training from scratch using ERL methods, achieving superior sample efficiency and task performance. Hardware evaluations further validate the framework, showing that policies trained in simulation can be directly deployed in real-world systems, exhibiting a minimal sim-to-real gap. exhibiting a minimal sim-to-real gap.

Abstract:
Query-based multi-view 3D object detectors typically rely on a fixed set of learnable queries that jointly predict object categories and locations. However, encoding both semantic and geometric information within a shared query embedding leads to representational conflicts, limiting optimization. While prior works decouple prediction heads to partially address this issue, such decoupling often treats classification and localization as independent tasks, leaving the queries themselves class-agnostic and unaware of the scenes semantic context. In this paper, we present the first 3D object detection framework that constructs class-aware queries using scene-level object class predictions. Specifically, a multi-view image classifier first estimates which object classes are present in the scene, and these predictions are used to generate semantically guided queries for 3D localization within the transformer decoder. This allows our model to initialize each query with class-specific priors, in contrast to conventional uniform query initialization. As a result, queries attend more effectively to relevant regions and objects throughout decoding. Experiments on the nuScenes benchmark show that our method improves mAP by 2.1 points and NDS by 0.9 points over a strong DETR-based baseline. An oracle study further reveals that classification accuracy is a key bottleneck in existing DETR-style detectors, highlighting the benefit of early semantic guidance.

Abstract:
Underwater environments pose significant challenges for visual Simultaneous Localization and Mapping (SLAM) systems due to limited visibility, inadequate illumination, and sporadic loss of structural features in images. Addressing these challenges, this paper introduces a novel, tightly-coupled Acoustic-Visual-Inertial SLAM approach, termed AQUA-SLAM, to fuse a Doppler Velocity Log (DVL), a stereo camera, and an Inertial Measurement Unit (IMU) within a graph optimization framework. Moreover, we propose an efficient sensor calibration technique, encompassing multi-sensor extrinsic calibration (among the DVL, camera and IMU) and DVL transducer misalignment calibration, with a fast linear approximation procedure for real-time online execution. The proposed methods are extensively evaluated in a tank environment with ground truth, and validated for offshore applications in the North Sea. The results demonstrate that our method surpasses current state-of-the-art underwater and visual-inertial SLAM systems in terms of localization accuracy and robustness. The proposed system will be made open-source for the community.

Abstract:
Microplastics that accumulate at the airwater interface pose urgent ecological and health risks. However, existing sampling and collection methods based on surface trawls are hindered by hydrodynamic resistance. We present the first in-motion characterization of an interfacial pump mounted on a small uncrewed surface vehicle (USV) to actively draw surface water into an onboard filter. Experiments combining thruster-driven forward motion with the undulating pump show that low thruster output and moderate pumping frequency maximize particles captured per unit energy, balancing the ram effect of forward speed with the lateral suction of the pump. Scaled towing tests reveal that the pontoon cross-section strongly influences intake flow, indicating that streamlined profiles can further boost filtration efficiency. Finally, flow-visualization confirms that the pumps ability to generate far-field suction without bulk mixing (previously demonstrated only in static tests) persists while the USV is in motion. These results establish interfacial pumping as a promising bio-inspired strategy for manual and autonomous microplastics collection, and highlight design parameters that can guide future development of distributed, high-coverage sampling platforms.

Abstract:
Multi-Robot Systems (MRS) in GPS-denied environments such as indoor spaces, subterranean areas, and urban canyons face the dual challenge of localizing themselves while performing informative path planning (IPP) to model unknown spatial fields. Current IPP methods rely heavily on GPS for localization, limiting their applicability in GPS-denied settings, while existing approaches addressing observation uncertainty fail to account for localization uncertainty that degrades mapping accuracy. This paper presents Anchor-Oriented IPP (AO-IPP), a framework that coordinates robot teams through relative positioning using Access Points and uncertainty-driven transitions between three phases: anchor point localization, informative sampling for field estimation, and spatial coverage optimization. Each robot maintains dual Gaussian Process models with transitions driven by uncertainty levels rather than fixed time schedules. Extensive simulations and real-world experiments demonstrate that AO-IPP achieves performance comparable to GPS-based IPP algorithms while outperforming existing methods in balancing IPP and coverage objectives by up to 54%. The approach exhibits sublinear regret bounds and enables autonomous coordination in challenging environments previously inaccessible to traditional IPP methods, providing a robust solution for environmental monitoring, exploration, and mapping applications requiring both accurate field estimation and comprehensive spatial coverage.

Abstract:
Stable and accurate tracking is essential for marine robotics, yet Global Navigation Satellite System (GNSS) signals vanish immediately below the sea surface. Traditional alternatives suffer from error accumulation, high computational demands, or infrastructure dependence. In this work, we present a multi-drone GNSS-based tracking system for surface and near-surface marine robots. Our approach combines efficient visual detection, lightweight multi-object tracking, GNSS-based triangulation, and a confidence-weighted Extended Kalman Filter (EKF) to provide stable GNSS estimation in real time. We further introduce a cross-drone tracking ID alignment algorithm that enforces global consistency across views, enabling robust multi-robot tracking with cooperative aerial coverage. We validate our system in diversified complex settings to show the accuracy and robustness of the proposed algorithm.

Abstract:
In this paper, we present an in-row and under-canopy autonomous navigation system for cornfields, called the Purdue Agricultural Navigation System or P-AgNav. Our navigation framework is primarily based on range view images from a 3D light detection and ranging (LiDAR) sensor. P-AgNav is designed for an autonomous robot to navigate in the corn rows with collision avoidance and to switch between rows without GNSS assistance or pre-defined waypoints. The system enables robots, which are intended to monitor crops or conduct physical sampling, to autonomously navigate multiple crop rows with minimal human intervention, thereby increasing crop management efficiency. The capabilities of P-AgNav have been validated through experiments in both simulation and real cornfield environments.

Abstract:
Electromagnetic Navigation Systems can be used to remotely guide medical devices such as magnetic catheters or guidewires, holding potential in a variety of minimally invasive surgical applications. This paper introduces a method to simultaneously actuate and localize a tethered magnetic device with embedded sensor pickup coils using a single system. Six-degree-of- freedom localization is achieved by driving the electromagnets of the Electromagnetic Navigation System with mutually orthogonal pulse-width-modulated voltages of different frequencies. The method is demonstrated using a human-scale system composed of three electromagnets to actuate and localize a magnetic catheter prototype with pickup coils embedded at its tip. In this case, the pose is estimated at a rate of 77 Hz, with a typical mean accuracy below 2 mm in position and 2 degrees in orientation.

Abstract:
Legged locomotion demands controllers that are both robust and adaptable, while remaining compatible with task and safety considerations. However, model-free reinforcement learning (RL) methods often yield a fixed policy that can be difficult to adapt to new behaviors at test time. In contrast, Model Predictive Control (MPC) provides a natural approach to flexible behavior synthesis by incorporating different objectives and constraints directly into its optimization process. However, classical MPC relies on accurate dynamics models, which are often difficult to obtain in complex environments and typically require simplifying assumptions. We present Diffusion-MPC, which leverages a learned generative diffusion model as an approximate dynamics prior for planning, enabling flexible test-time adaptation through reward and constraint based optimization. Diffusion-MPC jointly predicts future states and actions; at each reverse step, we incorporate reward planning and impose feasibility projection, yielding trajectories that satisfy task objectives while remaining within physical limits. To obtain a planning model that adapts beyond imitation pretraining, we introduce an interactive training algorithm for diffusion based planner: we execute our reward-and-constraint planner in environment, then filter and reweight the collected trajectories by their realized returns before updating the denoiser. Our design enables strong test-time adaptability, allowing the planner to adjust to new reward specifications without retraining. We validate Diffusion-MPC on real world, demonstrating strong locomotion and flexible adaptation.

Abstract:
Inertial odometry (IO) is an attractive approach for consumer-grade localization. However, existing data-driven IO methods often suffer from significant drift under complex nonlinear motion patterns (e.g., turns), as they struggle to capture the nonlinear relationships between Inertial Measurement Unit (IMU) signals and motion states. To address this issue, we propose a lightweight IO model, StarIO. Specifically, we first apply the Star Operation to project IMU signals into a high-dimensional implicit nonlinear feature space, enabling effective extraction of the complex nonlinear motion characteristics that typically cause drift. We then capture contextual dependencies across both the temporal and channel dimensions to enhance trajectory estimation over long sequences.In addition, we introduce a multi-scale gated unit that fuses fine-grained local motion dynamics with contextual information to achieve a comprehensive representation of motion. Extensive experiments on six representative open-source datasets demonstrate that StarIO achieves a superior trade-off between model lightweightness and localization accuracy.For example, on the RoNIN dataset, our approach reduces the ATE by 5.21% compared to R-ResNet while using only 2.762M parameters.

Abstract:
Humans can flexibly interpret and compose different goal specifications, such as language instructions, spatial coordinates, or visual references, when navigating to a destination. In contrast, most existing robotic navigation policies are trained on a single modality, limiting their adaptability to real-world scenarios where different forms of goal specification are natural and complementary. In this work, we present a training framework for robotic foundation models that enables omni-modal goal conditioning for vision-based navigation. Our approach leverages a high-capacity vision-language-action (VLA) backbone and trains with three primary goal modalities: 2D poses, egocentric images, and natural language, as well as their combinations, through a randomized modality fusion strategy. This design not only expands the pool of usable datasets but also encourages the policy to develop richer geometric, semantic, and visual representations. The resulting model, OmniVLA, achieves strong generalization to unseen environments, robustness to scarce modalities, and the ability to follow novel natural language instructions. We demonstrate that OmniVLA outperforms specialist baselines across modalities and offers a flexible foundation for fine-tuning to new modalities and tasks. We believe OmniVLA provides a step toward broadly generalizable and flexible navigation policies, and a scalable path for building omni-modal robotic foundation models.

Abstract:
End-to-end (E2E) autonomous driving has emerged as a promising paradigm with the pervasive power of model architectures and the availability of large-scale driving datasets. Despite tremendous efforts in recent research, most E2E driving frameworks rely on rather general driving commands, such as "Go Straight" or "Turn Left", which fail to encapsulate the complexities of nuanced driving behaviors and lead to possible semantic ambiguities. Furthermore, such commands are not adequately translated into specific goal locations, which severely limits the planner's capacity to make informed, long-term decisions. This limitation hinders the integration of near-term trajectory planning with long-term goal achievement. To tackle these challenges, we propose the Goal-Driven Planner (GDP), accommodating an appealing plug-and-play feature, which particularly leverages explicit goal points and incorporates two complementary learning objectives: (i) predicting a scene-aware long-term route to the goal, and (ii) refining the near-term trajectory through interaction with the long-term routing. Extensive experiments conducted on the nuScenes and NAVSIM datasets showcase the effectiveness of GDP. When integrated into off-the-shelf E2E autonomous driving frameworks like UniAD, VAD-Tiny, and DiffusionDrive, GDP decreases L2 errors and collision rates, also improves closed-loop metrics in the open-loop evaluation. Essentially, these results highlight the strong generalization capability of GDP and its promising practical significance in enhancing planning reliability and safety in real-world autonomous driving systems.

Abstract:
In this work, we present CollabVLA, a self-reflective vision-language-action framework that transforms a standard visuomotor policy into a collaborative assistant. CollabVLA tackles key limitations of prior VLAs, including domain overfitting, non-interpretable reasoning, and the high latency of auxiliary world models, by integrating VLM-based reflective reasoning with diffusion-based action generation under a mixture-of-experts design. Through a two-stage training recipe of action grounding and reflection tuning, it supports explicit self-reflection and proactively solicits human guidance when confronted with uncertainty or repeated failure. It cuts normalized Time by �?2× and Dream counts by �?4× vs. explicit-reasoning agents, achieving higher success rates, improved interpretability, and balanced low latency compared with existing methods. This work takes a pioneering step toward shifting VLAs from opaque controllers to genuinely assistive agents capable of reasoning, acting, and collaborating with humans.

Abstract:
Adversarial Imitation Learning (AIL) is a prominent paradigm in imitation learning that enables policy acquisition from expert demonstrations without relying on manually crafted reward functions. Although AIL has achieved promising results in certain scenarios, many existing methods suffer from mode collapse and training instability when expert demonstrations are limited. Given that agentenvironment interactions are often abundant, we focus on effectively leveraging such interaction data to address the above challenges. In this paper, we propose a novel adversarial imitation learning framework called Exploration-Driven Adversarial Imitation Learning (EDAIL). First, we introduce exploratory policies that augment the discriminators training data with high-confidence state-action pairs generated by the agent, thereby improving coverage of the solution space under sparse expert data. Second, we design an asymmetric surrogate reward function that shifts the reward-penalty boundary to mitigate discriminator bias caused by class imbalance, enabling more reliable policy optimization. We evaluate our method on six simulated tasks, including robotic manipulation, locomotion, and navigation, using only 1% and 10% of the datasets employed in prior baselines as expert demonstrations. Experimental results show that our method outperforms the baselines, demonstrating both the effectiveness and robustness of our method. In particular, it achieves a success rate of 94% on the FetchPush task using only 1% of expert demonstrations, representing an absolute improvement of 19 points over the state-of-the-art method. Our code will be available at https://github.com/lipengcheng-nudt/EDAIL.

Affiliations: SKL-MAIS, Institute of Automation, Chinese Academy of Sciences; Southeast University; Fudan University; Peking University; University of Chinese Academy of Sciences; Beijing University of Posts and Telecommunications; China University of Mining and Technology; East China University of Science and Technology; University of Science and Technology of China; Beijing Academy of Artificial Intelligence (BAAI); Imperial College London

Abstract:
In recent years, Multimodal Large Language Models (MLLMs) have demonstrated the ability to serve as high-level planners, enabling robots to follow complex human instructions. However, their effectiveness, especially in long-horizon tasks involving dual-arm humanoid robots, remains limited. This limitation arises from two main challenges: (i) the absence of simulation platforms that systematically support task evaluation and data collection for humanoid robots, and (ii) the insufficient embodiment awareness of current MLLMs, which hinders reasoning about dual-arm selection logic and body positions during planning. To address these issues, we present DualTHOR, a new dual-arm humanoids simulator, with continuous transition and a contingency mechanism. Building on this platform, we propose Proprio-MLLM, a model that enhances embodiment awareness by incorporating proprioceptive information with motion-based position embedding and a cross-spatial encoder. Experiments show that, while existing MLLMs struggle in this environment, Proprio-MLLM achieves an average improvement of 19.75% in planning performance. Our work provides both an essential simulation platform and an effective model to advance embodied intelligence in humanoid robotics.

Abstract:
This paper presents SHARP (Supercomputing for High-speed Avoidance and Reactive Planning), a proof-of- concept study demonstrating how high-performance computing (HPC) can enable millisecond-scale responsiveness in robotic control. While modern robots face increasing demands for reactivity in humanrobot shared workspaces, onboard pro- cessors are constrained by size, power, and cost. Offloading to HPC offers massive parallelism for trajectory planning, but its feasibility for real-time robotics remains uncertain due to network latency and jitter. We evaluate SHARP in a stress-test scenario where a 7-DOF manipulator must dodge high-speed foam projectiles. Using a hash-distributed multi-goal A search implemented with MPI on both local and remote HPC clusters, the system achieves mean planning latencies of 22.9 ms (local) and 30.0 ms (remote, 300 km away), with avoidance success rates of 84% and 88%, respectively. These results show that when round-trip latency remains within the tens-of-milliseconds regime, HPC-side computation is no longer the bottleneck, enabling avoidance well below human reaction times. The SHARP results motivate hybrid control architectures: low-level reflexes remain onboard for safety, while bursty, high-throughput planning tasks are offloaded to HPC for scalability. By reporting per-stage timing and success rates, this study provides a reproducible template for assessing the real-time feasibility of HPC-driven robotics. Collectively, SHARP reframes HPC offloading as a viable pathway toward dependable, reactive robots in dynamic environments.

Abstract:
Precise robot manipulation is critical for fine-grained applications such as chemical and biological experiments, where even small errors (e.g., reagent spillage) can invalidate an entire task. Existing approaches often rely on pre-collected expert demonstrations and train policies via imitation learning (IL) or offline reinforcement learning (RL). However, obtaining high-quality demonstrations for precision tasks is difficult and time-consuming, while offline RL commonly suffers from distribution shifts and low data efficiency. We introduce a Role-Model Reinforcement Learning (RM-RL) framework that unifies online and offline training in real-world environments. The key idea is a role-model strategy that automatically generates labels for online training data using approximately optimal actions, eliminating the need for human demonstrations. RM-RL reformulates policy learning as supervised training, reducing instability from distribution mismatch and improving efficiency. A hybrid training scheme further leverages online role-model data for offline reuse, enhancing data efficiency through repeated sampling. Extensive experiments show that RM-RL converges faster and more stably than existing RL methods, yielding significant gains in real-world manipulation: 53% improvement in translation accuracy and 20% in rotation accuracy. Finally, we demonstrate the successful execution of a challenging task, precisely placing a cell plate onto a shelf, highlighting the frameworks effectiveness where prior methods fail.

Abstract:
Locomotion robots with active or passive compliance can show robustness to uncertain scenarios, which can be promising for agricultural, research and environmental industries. However, state estimation for these robots is challenging due to the lack of rigid-body assumptions and kinematic changes from morphing. We propose a method to estimate typical rigid-body states alongside compliance-related states, such as soft robot shape in different morphologies and locomotion modes. Our neural network-based state estimator uses a history of states and a mechanism to directly influence unreliable sensors. We test our framework on the GOAT platform, a robot capable of passive compliance and active morphing for extreme outdoor terrain. The network is trained on motion capture data in a novel compliance-centric frame that accounts for morphing-related states. Our method predicts shape-related measurements within 4.2% of the robots size, velocities within 6.3% and 2.4% of the top linear and angular speeds, respectively, and orientation within 1.5�? We also demonstrate a 300% increase in travel range during a motor malfunction when using our estimator for closed-loop autonomous outdoor operation.

Abstract:
Point cloud registration, which aligns multiple datasets into a unified coordinate system, is critical for mobile applications such as 3D SLAM and autonomous driving. Among existing methods, Iterative Closest Point (ICP) remains a widely used method for rigid registration due to its robustness and simplicity. However, its performance on mobile platforms is hindered by iterative computations and limited memory resources. This paper proposes a high-performance ICP registration framework implemented on FPGA. Building upon an efficient GPU-based method named VAN-ICP, our FPGA-based ICP accelerator achieves greater memory efficiency and faster processing speed, making it ideal for resource-constrained mobile platforms. Experimental results demonstrate a speedup of over 1.5× compared to mobile GPU-based implementations and a 99% reduction in memory usage, validating the effectiveness of the proposed approach for real-world point cloud registration on edge platforms. Beyond these improvements, the proposed framework also facilitates advancements in robotic vision technologies by enabling more accurate and efficient perception under stringent hardware constraints.

Abstract:
We consider the problem of adaptively controlling a fleet of robots to maintain a communication network in an adversarial environment. In particular, a network team of robots is tasked with maintaining a directed communication channel at some data rate from an independent task robot to a fixed base station, accommodating the task robot's motion and adversarial intervention in the form of an omnidirectional jammer and network team robot removals. We utilize a physically-motivated model for directed signal strength between robots in the presence of a jammer, introducing asymmetry into communication which challenges connectivity maintenance approaches. Our main contribution in this paper is the introduction of a strategy for translating this directed model into an undirected graph for which enforcing connectedness is sufficient for maintaining high-rate communication. We demonstrate the efficacy of our approach in simulation using a CBF-based controller, showing that our controller maintains a high-rate connection throughout diverse trajectories, even when more conservative controllers fail.

Abstract:
Chronic wounds, such as diabetic, pressure, and venous ulcers, affect over 6.5 million patients in the United States alone and generate an annual cost exceeding 25 billion. Despite this burden, chronic wound care remains a routine yet manual process performed exclusively by trained clinicians due to its critical safety demands. We envision a future in which robotics and automation support wound care to lower costs and enhance patient outcomes. This paper introduces an autonomous framework for one of the most fundamental yet challenging subtasks in wound redressing: adhesive tape manipulation. Specifically, we address two critical capabilities: tape initial detachment (TID) and secure tape placement. To handle the complex adhesive dynamics of detachment, we propose a force-feedback imitation learning approach trained from human teleoperation demonstrations. For tape placement, we develop a numerical trajectory optimization method based to ensure smooth adhesion and wrinkle-free application across diverse anatomical surfaces. We validate these methods through extensive experiments, demonstrating reliable performance in both quantitative evaluations and integrated wound redressing pipelines. Our results establish tape manipulation as an essential step toward practical robotic wound care automation.

Abstract:
Training robot policies often requires extracting appropriate subsets of data from large and noisy datasets. For example, one might want to extract only robot demonstrations with accurate captions or only those related to cooking. We present RoboSQ, a robot data management system that enables semantic queries. RoboSQ samples temporally distributed frames and overlays projected sensor information from robot trajectories and constructs structured Visual Question Answering (VQA) prompts for Vision-Language Models (VLMs). RoboSQ efficiently handles queries by pipelining data loading, frame extraction, and VLM inference. We evaluate RoboSQ on the DROID dataset with three semantic queries: 1) failure detection, 2) calibration error detection and 3) visual complexity scoring. It filters out the failure trajectories with 78% accuracy and 86% F1 score, and identifies the trajectories with incorrect extrinsic calibration between camera frame and end effector frame at 86% accuracy and 88% F1 score. We evaluate RoboSQ by training a pick-and-place Action Chunking Transformer policy with a UR5 robot arm using mixed quality demonstration data. Data extracted by RoboSQ is closely aligned with the expert-curated data. A policy trained on RoboSQ-selected data achieves 13 successes out of 15 trials, compared to only 1 out of 15 when trained on the full mixed dataset.

Abstract:
Instruction-following navigation is a key step toward embodied intelligence. Prior benchmarks mainly focus on semantic understanding but overlook systematically evaluating navigation agents' spatial perception and reasoning capabilities. In this work, we introduce the NavSpace benchmark, which contains six task categories and 1,228 trajectoryinstruction pairs designed to probe the spatial intelligence of navigation agents. On this benchmark, we comprehensively evaluate 22 navigation agents, including state-of-the-art navigation models and multimodal large language models. The evaluation results lift the veil on spatial intelligence in embodied navigation. Furthermore, we propose SNav, a new spatially intelligent navigation model. SNav outperforms existing navigation agents on NavSpace and real robot tests, establishing a strong baseline for future work.

Abstract:
Autonomous manipulation of articulated objects represents a basic skill for robots deployed in human environments. Current vision-based methods can infer object hidden kinematics, but their estimates are sometimes imprecise in driving reliable actions, especially on previously unseen objects. Tactile methods, on the other hand, excel once contact is made, yet they require a reasonable initial guess about where and how to interact. This observation suggests a natural division of labor: vision provides global, coarse guidance, while touch delivers precise, robust execution. Building on this complementarity, we propose a systematic approach, Vi-TacMan, which uses vision to plan and touch to control. We begin by training a vision module that accurately detects holdable and movable parts. Once identified, these parts are then segmented for further processing. From these detections, the system proposes feasible grasps along with a coarse interaction direction modeled by a von Mises-Fisher (vMF) distribution. To enhance directional reasoning, we explicitly incorporate surface normals on movable regions as a geometric prior. This inductive bias clarifies the expected motion and improves generalization to unseen objects, yielding significant gains over baseline methods (all p-values less than 0.0001). Finally, seeded with the vision-derived grasp and motion direction, a tactile-informed controller establishes and maintains stable interactions, enabling reliable execution of the manipulation. Real-world object experiments on diverse objects further confirm reliable manipulation without explicit kinematic models. These findings establish a paradigm for multi-modal robotic perception that could advance autonomous systems operating in complex, unstructured environments.

Abstract:
A central challenge in robust quadruped locomotion, which relies solely on proprioceptive information, is how to effectively encode the history of observations. While current methods, such as regression, struggle with high-dimensional multi-time-step histories, and Temporal Convolutional Networks (TCNs) incur computational overhead, we propose a more efficient and theoretically grounded alternative. Inspired by the Generic Internal Model (GIM) from control theory, we introduce GIMloco, which maps the history of proprioceptive observations into a compact and stable internal model space through a predesigned first-order integral system with stability and orthogonality guarantees. This encoded representation drives three downstream tasks: state estimation, latent variable learning, and control policy learning. Our experiments show that GIMloco outperforms strong baselines in velocity tracking, system overshoot, response speed. Furthermore, it can navigate more complex terrains while also demonstrating better training stability across random seeds. Crucially, our method reduces training time by two orders of magnitude compared to TCN-based approaches. Our work presents GIMloco as a robust and computationally efficient framework for locomotion based on proprioceptive information.

Abstract:
Enabling robotic assistants to navigate complex environments and locate objects described in free-form language is a critical capability for real-world deployment. While foundation models, particularly Vision-Language Models (VLMs), offer powerful semantic understanding, effectively adapting their web-scale knowledge for embodied decision-making remains a key challenge. We present FiLM-Nav (Fine-tuned Language Model for Navigation), an approach that directly fine-tunes pre-trained VLM as the navigation policy. In contrast to methods that use foundation models primarily in a zero-shot manner or for map annotation, FiLM-Nav learns to select the next best exploration frontier by conditioning directly on raw visual trajectory history and the navigation goal. Leveraging targeted simulated embodied experience allows the VLM to ground its powerful pre-trained representations in the specific dynamics and visual patterns relevant to goal-driven navigation. Critically, fine-tuning on a diverse data mixture combining ObjectNav, OVON, ImageNav, and an auxiliary spatial reasoning task proves essential for achieving robustness and broad generalization. FiLM-Nav sets a new state-of-the-art in both SPL and success rate on HM3D ObjectNav among open-vocabulary methods, and sets a state-of-the-art SPL on the challenging HM3D-OVON benchmark, demonstrating strong generalization to unseen object categories. Our work validates that directly fine-tuning VLMs on diverse simulated embodied data is a highly effective pathway towards generalizable and efficient semantic navigation capabilities.

Abstract:
Automated inspection of steel structures using magnetic climbing robots can reduce costs and improve safety, but many such structures feature interior corners that are challenging for wheeled or tracked robots to traverse. We present the first magnetic-wheeled robot to use X-ray fluorescence for steel structure inspection, Sally, capable of overcoming all interior corner transition types, traversing small obstacles, and maneuvering in tight spaces. By re-purposing its steering and sensor deployment mechanisms, the robot is able to transition back and forth between a steel wall and an adjacent steel ceiling, steel wall, or any floor. We analyze the feasibility of these interior corner transitions and validate the results through experimental demonstrations with Sally. We also demonstrate line scanning, a continuous surface measurement technique enabled by the wheeled design that estimates the average element concentrations along a line, and show it provides greater accuracy and efficiency in both simulation and robot trials compared to the traditional grid point measurement method. Finally, we discuss lessons learned from a field test of Sally at an industrial site.

Abstract:
DUSt3R-based end-to-end scene reconstruction has recently shown promising results in dense visual SLAM. However, most existing methods only use image pairs to estimate pointmaps, overlooking spatial memory and global consistency. To this end, we introduce GRS-SLAM3R, an end-to-end SLAM framework for dense scene reconstruction and pose estimation from RGB images without any prior knowledge of the scene or camera parameters. Unlike existing DUSt3R-based frameworks, which operate on all image pairs and predict per-pair point maps in local coordinate frames, our method supports sequentialized input and incrementally estimates metric-scale point clouds in the global coordinate. In order to improve consistent spatial correlation, we use a latent state for spatial memory and design a transformer-based gated update module to reset and update the spatial memory that continuously aggregates and tracks relevant 3D information across frames. Furthermore, we partition the scene into submaps, apply local alignment within each submap, and register all submaps into a common world frame using relative constraints, producing a globally consistent map. Experiments on various datasets show that our framework achieves superior reconstruction accuracy while maintaining real-time performance.

Abstract:
As autonomous vehicles (AVs) continue to gain prominence in public life, the cost of their failures becomes increasingly drastic, endangering human life. Such failures arise from AVs' inability to meet their safety specifications in the field. Recent works have aimed to improve AVs' compliance with their safety specification through improved training and runtime enforcement. However, these methods are limited, requiring access to system internals or relying on narrow assumptions, which reduces their generality. In this work, we propose a different paradigm, Monitoring for Property Compliance (M4PC), which independently evaluates the system's compliance with the specification. The approach operates in two steps. First, it leverages scene graph abstractions and a specialized graph generator to map sensor data to driving rule preconditions to determine if an intervention is needed. Second, to correct an erroneous system output, M4PC defines a safe region within the control space defined by all relevant postconditions and minimally alters the systems output to ensure it remains within this safe region, thereby preventing property violations. We apply M4PC to improve the specification compliance of three state-of-the-art autonomous vehicles with varying architectures in the CARLA simulator. Our current implementation can improve a baseline system, while our most optimized implementation outperforms state-of-the-art techniques that require system access.

Abstract:
Parameter estimation in robotics and computer vision faces formidable challenges from both outlier contamination and nonconvex optimization landscapes. While M-estimation addresses the problem of outliers through robust loss functions, it creates severely nonconvex problems that are difficult to solve globally. Adaptive reweighting schemes provide one particularly appealing strategy for implementing M-estimation in practice: these methods solve a sequence of simpler weighted least squares (WLS) subproblems, enabling both the use of standard least squares solvers and the recovery of higher-quality estimates than simple local search. However, adaptive reweighting still crucially relies upon solving the inner WLS problems effectively, a task that remains challenging in many robotics applications due to the intrinsic nonconvexity of many common parameter spaces (e.g. rotations and poses). In this paper, we show how one can easily implement adaptively-reweighted M-estimators with certifiably correct solver for the inner WLS subproblems using only fast local optimization over smooth manifolds. Our approach exploits recent work on certifiable factor graph optimization to provide global optimality certificates for the inner WLS subproblems while seamlessly integrating into existing factor graph-based software libraries and workflows. Experimental evaluation on pose-graph optimization and landmark SLAM tasks demonstrates that our adaptively reweighted certifiable estimation approach provides higher-quality estimates than alternative local search-based methods, while scaling tractably to realistic problem sizes.

Abstract:
Contact-implicit trajectory optimization (CITO) enables the automatic discovery of contact sequences, but most methods rely on fine time discretization to capture all contact events accurately, which increases problem size and runtime while tying solution quality to grid resolution. We extend the recently proposed sequential convex programming (SCP) approach for trajectory optimization, continuous-time successive convexification (ct-SCvx), to CITO by introducing integral cross-complementarity constraints, which eliminate the risk of missing contact events between discretization nodes while preserving the flexibility of contact mode changes. The resulting framework, contact-implicit successive convexification (ci-SCvx), models full multibody dynamics in maximal coordinates, including stick-slip friction and partially elastic impacts. To handle complementarity constraints, we embed a backtracking homotopy scheme within SCP for reliable convergence. We implement this framework in a stand-alone Python software, leveraging JAX for GPU acceleration and a custom canonical-form parser for the convex subproblems of SCP to avoid the overhead of general-purpose modeling tools such as CVXPY. We demonstrate ci-SCvx on diverse legged-locomotion tasks. In particular, we validate the approach in MuJoCo with the Gymnasium HalfCheetah model against the MuJoCo MPC baseline, showing that a tracking simulation with the optimized torque profiles from ci-SCvx produces physically consistent trajectories with lesser energy consumption. We also show that the resulting software achieves faster solve times than an existing state-of-the-art SCP toolbox by over an order of magnitude, thereby demonstrating a practically important contribution to scalable real-time trajectory optimization.

Abstract:
Autonomous driving requires rich contextual comprehension and precise predictive reasoning to navigate dynamic and complex environments safely. Vision-Language Models (VLMs) and Driving World Models (DWMs) have independently emerged as powerful recipes addressing different aspects of this challenge. VLMs provide interpretability and robust action prediction through their ability to understand multi-modal context, while DWMs excel in generating detailed and plausible future driving scenarios essential for proactive planning. Integrating VLMs with DWMs is an intuitive, promising, yet understudied strategy to exploit the complementary strengths of accurate behavioral prediction and realistic scene generation. Nevertheless, this integration presents notable challenges, particularly in effectively connecting action-level decisions with high-fidelity pixel-level predictions and maintaining computational efficiency. In this paper, we propose ImagiDrive, a novel end-to-end autonomous driving framework that integrates a VLM-based driving agent with a DWM-based scene imaginer to form a unified imagination-and-planning loop. The driving agent predicts initial driving trajectories based on multi-modal inputs, guiding the scene imaginer to generate corresponding future scenarios. These imagined scenarios are subsequently utilized to iteratively refine the driving agents planning decisions. To address efficiency and predictive accuracy challenges inherent in this integration, we introduce an early stopping mechanism and a trajectory selection strategy. Extensive experimental validation on the nuScenes and NAVSIM datasets demonstrates the robustness and superiority of ImagiDrive over previous alternatives under both open-loop and closed-loop conditions.

Abstract:
Current Vision-based SLAM systems fail catastrophically when motion blur corrupts the visual input, as they attempt the ill-posed inverse problem of recovering sharp content from degraded observations. We present MotionGS-SLAM, which fundamentally reimagines motion blur handling through a paradigm shift: rather than removing blur artifacts, we reformulate the challenge as a well-constrained forward problem that generatively models blur formation within the rendering pipeline. By leveraging event cameras' microsecond temporal resolution and immunity to motion blur, we introduce a novel event-modulated Gaussian kernel that dynamically adapts each Gaussian's rasterization based on precise motion cues. Our dual-modulation mechanism transforms 2D Gaussian projections from isotropic dots into anisotropic, motion-aligned elliptical brush strokes (spatial modulation) while adaptively varying exposure integral sampling density based on local velocity (temporal modulation). This physics-based approach enables joint optimization of intra-exposure camera trajectories and 3D scene geometry through blur-aware photometric and event-based constraints. Extensive experiments demonstrate significant improvements over state-of-the-art methods in trajectory accuracy and map quality under severe high-motion conditions.

Abstract:
Out-Of-Distribution (OOD) detection, the task of identifying when an input falls outside the distribution seen at training time, is critical for deploying safe and reliable systems. Traditional OOD methods require retraining models whenever the in‐distribution has changed. Recent work introduces unified models for OOD detection, where metrics can be constructed from an unconditional diffusion model trained on an arbitrary dataset, and the inlier distribution can be changed without retraining the diffusion model. However, these unified approaches have been largely confined to Euclidean or latent space domains. In contrast, real‐world robotics systems often perceive and act through sequences of 6 degrees-of-freedom poses in the Special Euclidean Group SE(3), taking into account both translations and orientation changes over time. In this work, we extend OOD detection to trajectories in Special Euclidean Group in 3D SE(3) by presenting a Diffusion-based Out-of-distribution detection on SE(3) (DOSE3). DOSE3 constructs an OOD metric from the noise estimator model of a diffusion model over SE(3) to separate outlier samples from inlier distributions. We demonstrate DOSE3's strong performance on OOD detection frameworks through extensive validation on multiple real-world robotics and autonomous systems datasets, covering vehicle and robot manipulator motion trajectories.

Abstract:
This paper presents an analytical framework for parameter conversion between Complete and Parametrically Continuous (CPC) and Product of Exponentials (POE) kinematic models of serial-chain mechanisms. The approach, grounded in Lie group and algebra theory, formulates and proves three key lemmas to enable exact POE-to-CPC parameter conversion. Building upon established POE-DH and CPC-DH transitions, the proposed framework facilitates flexible model selection based on application-specific needs, independent of the initial parameterization. Primarily designed for robot calibration, this method also serves as a unifying tool for comparing and analyzing calibration results across different kinematic conventions. The framework's effectiveness is demonstrated through numerical validation on the PUMA 560 robot, confirming its accuracy and practical applicability.

Abstract:
Swarmalator studies have enabled self-organized collective behaviors that emerge from dual spatial and temporal coupling, without relying on external inputs. These behaviors arise solely from attractive and repulsive interactions modulated by a few global parameters. Here, we treat the swarmalator model as a planner and study how several of the collective behaviors change in terms of space-phase organization when the agents are robots with vehicle dynamics and constraints: including omnidirectional, unicycle, and bicycle dynamics. Furthermore, we use the control barrier function method to guarantee that the collective can navigate around objects, through cluttered environments, and transport objects in between obstacles by exploiting global and local control methods. This work brings us closer to realizing large groups of robotic swarmalators, of heterogeneous dynamics, that can enable shape formation, navigation, and object manipulation in cluttered environments.

Abstract:
Risk stratification is the process of segmenting patients into distinct groups of similar complexity and care needs in order to improve resource allocation. Patients are typically risk stratified using statistical or machine learning methods that generate an individual risk score for some measure of resource use. One of the main limitations of existing methods is reduced interpretability, which is often inherent to artificial intelligence techniques. In this work, we propose a novel risk stratification approach that optimizes the representation of different patient groups and generates interpretable risk profiles. We associate risk scores to patient profiles and determine the optimal com- bination of representative profiles for each patient group using a Mixed Integer Programming (MIP) formulation. We generate continuous ratings for patient risk scores ranging from 0 to 1 that allow for dynamic thresholding. Our method stratifies patients into several risk groups (e.g., low, medium, high risk), which is frequently more clinically significant than binary classification. We apply our approach to both public and proprietary real data in the context of accidental fall risk assessment and show that the generated risk profiles provide clinical insights that can be used for the design of targeted interventions

Abstract:
The existing methods for novel view synthesis depend on dense input images and accurate camera poses, which significantly limits their practical application. We propose a novel framework that enables high-quality sparse-view reconstruction via 3D Gaussian Splatting (3DGS) without knowing camera poses. Our approach leverages MASt3R, a ViT-based multi-view stereo prior, to generate point clouds and coarse camera poses from uncalibrated sparse images. We use the point clouds to initial 3DGS. Additionally, we propose several regularization techniques, including point-rendered LPIPS regularization, geometric regularization (local depth regularization and normal regularization), and semantic regularization to improve the quality of reconstructed scenes and enhance the generalization capability of the model in unseen viewpoint. Due to the inaccuracies in the camera poses output by MASt3R, we optimized the camera poses during both the training and testing phase. Experimental results on the Tanks and Temples and MVImgNet datasets demonstrate that our method outperforms state-of-the-art techniques in novel view synthesis and camera pose estimation under sparse-view settings. Our approach achieves higher fidelity and more photorealistic visual effects.

Abstract:
With the growing acceptance of robotics in daily life there is a growing need for certifiably safe control policies. While simulation provides a safe training environment, policies often fail in sim-to-real transfer. We propose a data-driven certification framework for reinforcement learning based on Pick-to-Learn (P2L), a meta-algorithm that uses data preference ordering to compute probabilistic bounds on the satisfaction of application dependent properties of interest. Our results demonstrate that using P2L maintains high performance while distinguishing between policies that appear similar under domain randomization alone. This work offers a practical method for preparing safe reinforcement learning policies by providing formal safety guarantees prior to hardware deployment.

Abstract:
Contrastive objectives such as InfoNCE align multimodal representations at the instance level but are unable to keep intra-modal geometries, which is called a structural alignment gap. We propose UniOMA, a multimodal structural alignment method using Gromov--Wasserstein (GW) barycenter regularizer to align each modality to a shared structural consensus, scaling linearly to 3+ modalities. Experiments on five robotic benchmarks (vision, force, depth, audio, tactile, proprioception) show consistent improvements in downstream tasks like regression, classification, and cross-modal retrieval.

Abstract:
Reliable Global Navigation Satellite System (GNSS) signals are increasingly denied or jammed in real-world applications, such as search and rescue operations. In such scenarios, Unmanned Aerial Vehicles (UAVs) must rely on downward-facing cameras for absolute localization against reference satellite maps. While Visual Inertial Odometry (VIO) is highly accurate locally, it inevitably accumulates drift over time. Localizing a drone image against a pre-existing satellite map (e.g., Google Earth) via homography estimation is a viable solution, but it is severely challenged by seasonal variations, construction, and vegetation changes. In this paper, we propose Sat-RoMa, an end-to-end robust dense feature matcher adapted from the state-of-the-art RoMa architecture. By utilizing a frozen, pre-trained DinoV3 encoder specifically tuned for satellite imagery, and formulating the task as matching a small drone image to a 4x larger reference map, Sat-RoMa explicitly handles scale discrepancies and temporal appearance changes. Preliminary results demonstrate that Sat-RoMa significantly outperforms baselines like LoFTR and LightGlue, achieving a 16.0% scale error compared to over 100% for existing methods, paving the way for robust GPS-denied UAV navigation.

Abstract:
We propose DistGP: a multi-robot learning method for collaborative learning of a global function using only local experience and computation. We utilise a sparse Gaussian process (GP) model with a factorisation that mirrors the multi-robot structure of the task, and admits distributed training via Gaussian belief propagation (GBP). Our loopy model outperforms Tree-Structured GPs [1] and can be trained online and in settings with dynamic connectivity. We show that such distributed, asynchronous training can reach the same performance as a centralised, batch-trained model, albeit with slower convergence. Last, we compare to DiNNO [2], a distributed neural network (NN) optimiser, and find DistGP achieves superior accuracy, is more robust to sparse communication and is better able to learn continually.

Abstract:
Underwater intervention is an important capability in several marine domains, with numerous industrial, scientific, and defense applications. However, existing perception systems used during intervention operations rely on data from optical cameras, which limits capabilities in poor visibility or lighting conditions. Prior work has examined opti-acoustic fusion methods, which use sonar data to resolve the depth ambiguity of the camera data while using camera data to resolve the elevation angle ambiguity of the sonar data. However, existing methods cannot achieve dense 3D reconstructions in real-time, and few studies have reported results from applying these methods in a turbid environment. In this work, we propose the opti-acoustic fusion method Sonar-MASt3R, which uses MASt3R to extract dense correspondences from optical camera data in real-time and pairs it with geometric cues from an acoustic 3D reconstruction to ensure robustness in turbid conditions. Experimental results using data recorded from an opti-acoustic eye-in-hand configuration across turbidity values ranging from <0.5 to >12 NTU highlight this methods improved robustness to turbidity relative to baseline methods.

Abstract:
Humanoid robots are envisioned to adapt demonstrated motions to diverse real-world conditions while accurately preserving motion patterns. Existing motion prior approaches enable well adaptability with a few motions but often sacrifice imitation accuracy, whereas motion-tracking methods achieve accurate imitation yet require many training motions and a test-time target motion to adapt. To combine their strengths, we introduce AdaMimic, a novel motion tracking algorithm that enables adaptable humanoid control from a single reference motion. To reduce data dependence while ensuring adaptability, our method first creates an augmented dataset by sparsifying the single reference motion into keyframes and applying light editing with minimal physical assumptions. A policy is then initialized by tracking these sparse keyframes to generate dense intermediate motions, and adapters are subsequently trained to adjust tracking speed and refine low-level actions based on the adjustment, enabling flexible time warping that further improves imitation accuracy and adaptability. We validate these significant improvements in our approach in both simulation and the real-world Unitree G1 humanoid robot in multiple tasks across a wide range of adaptation conditions. Videos and code are available at https://taohuang13.github.io/adamimic.github.io/.

Abstract:
We propose a novel walking control scheme based on the dynamics of the Linear Inverted Pendulum (LIP) model. The pattern generation incorporates a model of contact forces, enabling closed-loop control of the humanoid robots state, including the Center of Mass (CoM) position, velocity, and Zero Moment Point (ZMP). No additional control policies are required to maintain static and dynamic balance. Our approach also includes dynamic re-planning of step locations and timings, thus preserving the LIPs boundedness condition. We validated this controller on five different humanoid robots, testing its ro- bustness through various disturbances, including sudden pushes during walking and static phases. Additionally, our controller demonstrated effective locomotion over uneven and compliant terrain. Both simulation and experimental results confirm the effectiveness and robustness of this controller.

Abstract:
The underactuation of conventional aerial vehicles limits their ability to independently control position and attitude, motivating the use of overactuated designs such as tilt-rotor quadrotors. Existing works on tilt-rotor quadrotors primarily focus on determining the minimum thrust-to-weight ratio required for hovering at arbitrary orientations. However, they do not address the maximum allowable attitude range within which independent control is feasible given specific thrust constraints. In this work, we investigate the feasible attitude range within which a tilt-rotor quadrotor can maintain independent control, given rotor thrust limits. First, we formulate the thrust constraints as convex functions and solve them using convex optimization techniques to identify feasible sets. To determine the maximum attitude that allows for independent control under thrust constraints, we pose a nonconvex optimization problem and employ a successive convex approximation technique to compute a optimal solution, which corresponds to the optimal solution of the original nonconvex problem. Given the maximum attitude limits, we then compute the minimum thrust required per rotor to achieve independent control. Furthermore, we determine the maximum allowable disturbance magnitude that the tilt-rotor quadrotor can handle while retaining independent control. The study results are verified through processor-in-the- loop (PIL) simulations and outdoor hardware experiments on a tilt-rotor quadrotor.

Abstract:
Robotic wrists play a crucial role in enhancing the dexterity and stability of robotic end-effectors. Existing rigid robotic wrists tend to be complex and lack flexibility, while soft robotic wrists often struggle with limited load-bearing capacity and lower accuracy. Human wrists feature multi-degrees of freedom and variable stiffness, which help human hands to accomplish daily tasks. This study presents an innovative anthropomorphic soft robotic wrist, VarWrist, equipped with a fiber jamming variable stiffness module, enabling stiffness adjustment through vacuuming. VarWrist consists of three parallel bellows, utilizing a positive-negative pneumatic actuation strategy to mimic human wrist motion. In addition, the trajectory equation of the rotation center was fitted through modeling. We developed a prototype of VarWrist and assessed its performance. Results indicate that the soft wrist surpasses the motion range of human wrists, achieving flexion (81.9°), extension (78.5°), ulnar deviation (70.5°), and radial deviation (70.5°). The bending motion trajectory showed a 73% increase in similarity to human motion compared to fixed-axis rotation, with VarWrist exhibiting a significant range of variable stiffness (resting state: 206%, working state: 155%). Demonstration experiments confirm that this wrist facilitates a dexterous hand in completing grasping tasks that would be unattainable by the hand alone.

Abstract:
Mobile robots have revolutionized various fields, offering solutions for manipulation, environmental monitoring, and exploration. However, payload capacity remains a limitation. This paper presents a novel thrust-based robotic hopper capable of carrying payloads up to 9 times its own weight while maintaining agile mobility over less structured terrain. The 220 gram robot carries up to 2 kg while hopping--a capability that bridges the gap between high-payload ground robots and agile aerial platforms. Key advancements that enable this high-payload capacity include the integration of bidirectional thrusters, allowing for both upward and downward thrust generation to enhance energy management while hopping. Additionally, we present a refined model of dynamics that accounts for heavy payload conditions, particularly for large jumps. To address the increased computational demands, we employ a neural network compression technique, ensuring real-time onboard control. The robot's capabilities are demonstrated through a series of experiments, including leaping over a high obstacle, executing sharp turns with large steps, as well as performing simple autonomous navigation while carrying a 730 g LiDAR payload. This showcases the robot's potential for applications such as mobile sensing and mapping in challenging environments.

Abstract:
Accurately predicting how agents move in dynamic scenes is essential for safe autonomous driving. State-of-the-art motion forecasting models rely on datasets with manually annotated or post-processed trajectories. However, building these datasets is costly, generally manual, hard to scale, and lacks reproducibility. They also introduce domain gaps that limit generalization across environments. We introduce PPT (Pretraining with Pseudo-labeled Trajectories), a simple and scalable pretraining framework that uses unprocessed and diverse trajectories automatically generated from off-the-shelf 3D detectors and tracking. Unlike data annotation pipelines aiming for clean, single-label annotations, PPT is a pretraining framework embracing off-the-shelf trajectories as useful signals for Learning robust representations. With optional finetuning on a small amount of labeled data, models pretrained with PPT achieve strong performance across standard benchmarks, particularly in low-data regimes, and in cross-domain, end-to- end, and multi-class settings. PPT is easy to implement and improves generalization in motion forecasting.

Abstract:
This paper presents a reinforcement learning-based quadrotor navigation method that leverages efficient differentiable simulation, novel loss functions, and privileged information to navigate around large obstacles. Prior learning-based methods perform well in scenes that exhibit narrow obstacles, but struggle when the goal location is blocked by large walls or terrain. In contrast, the proposed method utilizes time-of-arrival (ToA) maps as privileged information and a yaw alignment loss to guide the robot around large obstacles. The policy is evaluated in photo-realistic simulation environments containing large obstacles, sharp corners, and dead-ends. Our approach achieves an 86% success rate and outperforms baseline strategies by 34%. We deploy the policy onboard a custom quadrotor in outdoor cluttered environments both during the day and night. The policy is validated across 20 flights, covering 589 meters without collisions at speeds up to 4 m/s.

Abstract:
Event-based localization research and datasets are a rapidly growing area of interest, with a tenfold increase in the cumulative total number of published papers on this topic over the past 10 years. Whilst the rapid expansion in the field is exciting, it brings with it an associated challenge: a growth in the variety of required code and package dependencies as well as data formats, making comparisons difficult and cumbersome for researchers to implement reliably. To address this challenge, we present Event-LAB: a new and unified framework for running several event-based localization methodologies across multiple datasets. Event-LAB is implemented using the Pixi package and dependency manager, that enables a single command-line installation and invocation for combinations of localization methods and datasets. To demonstrate the capabilities of the framework, we implement two common event-based localization pipelines: Visual Place Recognition (VPR) and Simultaneous Localization and Mapping (SLAM). We demonstrate the ability of the framework to systematically visualize and analyze the results of multiple methods and datasets, revealing key insights such as the association of parameters that control event collection counts and window sizes for frame generation to large variations in performance. The results and analysis demonstrate the importance of fairly comparing methodologies with consistent event image generation parameters. Our Event-LAB framework provides this ability for the research community, by contributing a streamlined workflow for easily setting up multiple conditions.

Abstract:
We propose DemoDiffusion, a simple method for enabling robots to perform manipulation tasks by imitating a single human demonstration, without requiring task-specific training or paired human-robot data. Our approach is based on two insights. First, the hand motion in a human demonstration provides a useful prior for the robots end-effector trajectory, which we can convert into a rough open-loop robot motion trajectory via kinematic retargeting. Second, while this retargeted motion captures the overall structure of the task, it may not align well with plausible robot actions in-context. To address this, we leverage a pre-trained generalist diffusion policy to modify the trajectory, ensuring it both follows the human motion and remains within the distribution of plausible robot actions. Unlike approaches based on online reinforcement learning or paired human-robot data, our method enables robust adaptation to new tasks and scenes with minimal effort. In real-world experiments across 8 diverse manipulation tasks, DemoDiffusion achieves 83.8% average success rate, compared to 13.8% for the pre-trained policy and 52.5% for kinematic retargeting, succeeding even on tasks where the pre-trained generalist policy fails entirely. Project page: https://demodiffusion.github.io/

Abstract:
Surgical robotics have revolutionized medical procedures by offering enhanced precision and reduced complications. However, vitreoretinal surgery still relies heavily on manual techniques, where surgeons manage both a surgical tool and a light pipe, complicating operations and potentially affecting outcomes. To improve efficiency and outcomes while reducing workloads on surgeons, a novel vision-based robot-assisted system with advanced surgical scene understanding ability is proposed. The system automatically positions a light pipe held by a specialized surgical robot through optimization-based visual collaborative control. By identifying target areas for automatic illumination, the system allows surgeons to focus on surgical tasks and supports more complex surgeries such as three-arm procedures. Besides, the system enhances surgical safety by detecting surgical activities and dangerous areas and issuing alerts accordingly. Postoperatively, the system records tool trajectories and detected activities, providing data for surgical reports, skill evaluation, and training. Experiments prove the effectiveness of the control system, visual algorithm, and overall collaborative system.

Abstract:
Robotic manipulation requires explicit or implicit knowledge of the robot's joint positions. Precise proprioception is standard in high-quality industrial robots but is often unavailable in inexpensive robots operating in unstructured environments. In this paper, we ask: to what extent can a fast, single-pass regression architecture perform visual proprioception from a single external camera image, available even in the simplest manipulation settings? We explore several latent representations, including CNNs, VAEs, ViTs, and bags of uncalibrated fiducial markers, using fine-tuning techniques adapted to the limited data available. We evaluate the achievable accuracy through experiments on an inexpensive 6-DoF robot.

Abstract:
In this work, we study Compositional Dexterous Functional Object Manipulation (CD-FOM): tasks such as aiming and actuating a spray bottle on a plant or a glue gun on wood, which require both actuating an object's internal mechanism and controlling its pose to apply the object's function to the environment. These tasks pose significant challenges for robots due to the demanding integration of semantic understandingof the object's function, actuation mode, and application areawith intricate physical dexterityto manage grasp stability, movement trajectory, and actuation. We introduce CoDex, a zero-demonstration framework that autonomously discovers CD-FOM manipulation strategies. CoDex uses visionlanguage models (VLMs) to infer semantic constraints from the task and scene. These constraints guide analytic constrained optimization to generate a short list of functional grasp candidates that can be efficiently refined with reinforcement learning to generate full graspmoveactuate policies transferrable from simulation to the real world. We evaluate CoDex on a 7-DoF robot arm with a 16-DoF multi-fingered hand across six CD-FOM tasks involving previously unseen objects with internal mechanisms (spray bottles, hot glue guns, air dusters, flashlights, pepper grinders) and their application to unseen target objects, showcasing its ability to autonomously discover and execute complex, physically viable dexterous behaviors without human demonstrations. More information at https://robin-lab.cs.utexas.edu/CoDex/.

Abstract:
Deformable objects often appear in unstructured configurations. Tracing deformable objects helps bringing them into extended states and facilitating the downstream manipulation tasks. Due to the requirements for object-specific modeling or sim-to-real transfer, existing tracing methods either lack generalizability across different categories of deformable objects or struggle to complete tasks reliably in real world. To address this, we propose a novel visual-tactile imitation learning method to achieve one-dimensional (1D) and two-dimensional (2D) deformable object tracing with a unified model. Our method is designed from both local and global perspectives based on visual and tactile sensing. Locally, we introduce a weighted loss that emphasizes actions maintaining contact near the center of the tactile image, improving fine-grained adjustment. Globally, we propose a tracing task loss that helps the policy to regulate task progression. On the hardware side, to compensate for the limited features extracted from visual information, we integrate tactile sensing into a low-cost teleoperation system considering both the teleoperator and the robot. Extensive ablation and comparative experiments on diverse 1D and 2D deformable objects demonstrate the effectiveness of our approach, achieving an average success rate of 80% on seen objects and 65% on unseen objects. Demos, code and datasets are available at https://sites.google.com/view/vitac-tracing.

Abstract:
Robotic navigation in dense, cluttered environments such as agricultural canopies presents significant challenges due to physical and visual occlusion caused by leaves and branches. Traditional vision-based or model-dependent approaches often fail in these settings, where physical interaction without damaging foliage and branches is necessary to reach a target. We present a novel reactive controller that enables safe navigation for a robotic arm in a contact-rich, cluttered, deformable environment using end-effector position and real-time tactile feedback. Our proposed framework's interaction strategy is based on a trade-off between minimizing disturbance by maneuvering around obstacles and pushing through them to move towards the target. We show that over 35 trials in 3 experimental plant setups with an occluded target, the proposed controller successfully reached the target in all trials without breaking any branch, and outperformed the established control strategy for dense foliage in reliability and adaptability. This work lays the foundation for safe, adaptive interaction in cluttered, contact-rich deformable environments, enabling future agricultural tasks such as pruning and harvesting in plant canopies.

Abstract:
We demonstrate the surprising real-world effectiveness of a very simple approach to whole-body model- predictive control (MPC) of quadruped and humanoid robots: the iterative linear-quadratic regulator (iLQR) algorithm with MuJoCo dynamics and finite-difference approximated derivatives. Building upon the previous success of model-based behavior synthesis and control of locomotion and manipulation tasks with MuJoCo in simulation, we show that these policies can easily generalize to the real world with few sim-to-real considerations. Our baseline method achieves real-time MPC while leveraging whole-body dynamics collision detection on a variety of hardware experiments, including dynamic quadruped locomotion, quadruped walking on two legs, and full-sized humanoid bipedal locomotion. Additionally, our GUI system enables users to interactively update robot behavior in real-time on the robot hardware, making task-specific objective parameter tuning easy and intuitive. Our code is available at:https://johnzhang3.github.io/mujoco_ilqr

Abstract:
Graph-based SLAM models robot poses as vertices and relative-pose measurements (odometry and loop-closures) as edges. Odometry edges are always kept to preserve connectivity, while loop-closure edges reduce drift but cannot all be stored due to memory or computation limits. Our challenge is to decide online which closures to keep under a strict budget, when the full set of measurements cannot be stored or centralized. Prior work instead addresses an offline problem that assumes access to the complete pose graph and optimizes a log-determinant (D-optimality) surrogate. In the online regime, an additional difficulty arises because the odometry backbone grows over time and the utility of each loop-closure changes as the graph evolves. We formulate this problem as streaming submodular maximization with a time-varying log-determinant objective. We propose a one-pass preemptive greedy policy that operates with exactly k memory slots for loop-closures. We show that, under arbitrary arrival order, it achieves a uniform constant-factor guarantee on the log-determinant improvement beyond an odometry-only baseline, relative to the hindsight-optimal size-k solution. On benchmark data, the proposed method closely matches offline greedy despite the conservative bound, showing that principled streaming selection can recover most of the benefit of loop-closures while respecting resource limits.

Abstract:
Understanding object affordances is essential for enabling robots to perform purposeful and fine-grained interactions in diverse and unstructured environments. However, existing approaches either rely on retrieval, which is fragile due to sparsity and coverage gaps, or on large-scale models, which frequently mislocalize contact points and mispredict post-contact actions when applied to unseen categories, thereby hindering robust generalization. We introduce Retrieval-Augmented Affordance Prediction (RAAP), a framework that unifies affordance retrieval with alignment-based learning. By decoupling static contact localization and dynamic action direction, RAAP transfers contact points via dense correspondence and predicts action directions through a retrieval-augmented alignment model that consolidates multiple references with dual-weighted attention. Trained on compact subsets of DROID and HOI4D with as few as tens of samples per task, RAAP achieves consistent performance across unseen objects and categories, and enables zero-shot robotic manipulation in both simulation and the real world. Project website: https://github.com/SEU-VIPGroup/RAAP.

Abstract:
We introduce RoboMorph, an automated approach for generating and optimizing modular robot designs using large language models (LLMs) and evolutionary algorithms. Each robot design is represented by a structured grammar, and we use LLMs to efficiently explore this design space. Traditionally, such exploration is time-consuming and computationally intensive. Using a best-shot prompting strategy combined with reinforcement learning (RL)-based control evaluation, RoboMorph iteratively refines robot designs within an evolutionary feedback loop. Across four terrain types, RoboMorph discovers diverse, terrain-specialized morphologies, including wheeled quadrupeds and hexapods, that match or outperform designs produced by Robogrammar's graph-search method. These results demonstrate that LLMs, when coupled with evolutionary selection, can serve as effective generative operators for automated robot design. Our project page and code are available at https://robomorph.github.io.

Abstract:
Hierarchical goal-conditioned reinforcement learning (H-GCRL) provides a powerful framework for tackling complex, long-horizon tasks by decomposing them into structured subgoals. However, its practical adoption is hindered by poor data efficiency and limited policy expressivity, especially in offline or data-scarce regimes. In this work, normalizing flow-based bierarchical implicit Q-learning (NF-HIQL), a novel framework that replaces unimodal gaussian policies with expressive normalizing flow policies at both the high- and low-levels of the hierarchy is introduced. This design enables tractable log-likelihood computation, efficient sampling, and the ability to model rich multimodal behaviors. New theoretical guarantees are derived, including explicit KL-divergence bounds for real-valued non-volume preserving (RealNVP) policies and PAC-style sample efficiency results, showing that NF-HIQL preserves stability while improving generalization. Empirically, NF-HIQL is evaluted across diverse long-horizon tasks in locomotion, ball-dribbling, and multi-step manipulation from OGBench. NF-HIQL consistently outperforms prior goal-conditioned and hierarchical baselines, demonstrating superior robustness under limited data and highlighting the potential of flow-based architectures for scalable, data-efficient hierarchical reinforcement learning.

Abstract:
In the past decade, manipulator arms with non- traditional architectures once found mainly in space and painting applications have become popular as collaborative robots. Examples include ABBs YuMi and GoFa, Kinovas Link 6, and Fanucs CRX. These cobots lack closed-form inverse kinematics solutions, making it impossible to unambiguously select one configuration among the 16 (or infinitely many) that correspond to a given end-effector pose, which may create safety risks. Moreover, they exhibit far more singularities than typical manipulators, and most of them are far more complex to describe. Nevertheless, many authors argue these manipulators can provide improved dexterity and a larger workspace. In this paper, we analyze the singularities of ABBs GoFa using Grassmann line geometry and provide straightforward, sufficient (though conservative) conditions for avoiding them. Then, while GoFa can exhibit over a dozen distinct singularities compared to only three (wrist, shoulder, and elbow) for traditional robot arms, we attempt to quantify which architecture actually possesses a greater amount of singular and near-singular configurations.

Abstract:
Temporal action segmentation (TAS) has long been a key area of research in both robotics and computer vision. In robotics, algorithms have primarily focused on leveraging proprioceptive information to determine skill boundaries, with recent approaches in surgical robotics incorporating vision. In contrast, computer vision typically relies on exteroceptive sensors, such as cameras. Existing multimodal TAS models in robotics integrate feature fusion within the model, making it difficult to reuse learned features across different models. Meanwhile, pretrained vision-only feature extractors commonly used in computer vision struggle in scenarios with limited object visibility. In this work, we address these challenges by proposing M2R2, a multimodal feature extractor tailored for TAS, which combines information from both proprioceptive and exteroceptive sensors. We introduce a novel training strategy that enables the reuse of learned features across multiple TAS models. Our method sets a new state-of-the-art performance on three robotic datasets REASSEMBLE, (Im)PerfectPour, and JIGSAWS. Additionally, we conduct an extensive ablation study to evaluate the contribution of different modalities in robotic TAS tasks.

Abstract:
The ocean is warming and acidifying, increasing the risk of mass mortality events for temperature-sensitive shellfish such as oysters. This motivates the development of long-term monitoring systems. However, human labor is costly and long-duration underwater work is highly hazardous, thus favoring robotic solutions as a safer and more efficient option. To enable underwater robots to make real-time, environment-aware decisions without human intervention, we must equip them with an intelligent brain. This highlights the need for persistent, wide-area, and low-cost benthic monitoring. To this end, we present DREAM, a Vision Language Model (VLM)-guided autonomy framework for long-term underwater exploration and habitat monitoring. The results show that our framework is highly efficient in finding and exploring target objects (e.g., oysters, shipwreck) without prior location information. In the oyster-monitoring task, our framework takes 31.5% less time than previous baseline with the same amount of oysters. Compared to the vanilla VLM, it uses 23% fewer steps while covering 8.88% more oysters. In shipwreck scenes, our framework successfully explores and maps the wreck without collisions, requiring 27.5% fewer steps than the vanilla model and achieving 100% coverage, while the vanilla model achieves 60.23% average coverage in our shipwreck environments.

Abstract:
Robotic manipulation of unfamiliar objects in new environments is challenging due to limited generalisation capabilities. We propose a new skill transfer framework, GIFT (Geometry-Induced Functional Transfer), which enables a robot to transfer complex object manipulation skills and constraints from a single human demonstration. Our approach addresses the challenge of skill acquisition and task execution by deriving geometric representations from demonstrations focusing on object-centric interactions. By leveraging the Functional Maps (FMC) framework, we efficiently map interaction functions between objects and their environments, allowing the robot to replicate task operations across objects of similar topologies or categories, even when they have significantly different shapes. Additionally, our method incorporates screw interpolation (ScLERP) for generating smooth, geometrically-aware robot paths to ensure the transferred skills adhere to the demonstrated task constraints. We validate the effectiveness and adaptability of our approach through extensive experiments, demonstrating successful skill transfer and task execution in diverse real-world environments without requiring additional training.

Abstract:
Soft robotic fingers require precise proprioception of both global deformation and local contact to enable safe and dexterous manipulation. Vision-based methods can reconstruct overall shape but struggle under severe occlusion, while audio-only approaches provide complementary cues but lack spatial detail. We present DeepCoFi, a lightweight multimodal proprioception framework that fuses internal camera images with acoustic spectrograms to jointly recover finger geometry and contact. The framework leverages the complementary strengths of vision and acoustics and employs a FoldingNet-based two-stage decoder that first reconstructs global bending and then refines local contact deformations. To support this integration, we introduce a soft finger design that incorporates an exoskeleton-mounted camera and microphone in a single molding step, preserving compliance while enabling multimodal sensing. Experiments on a comprehensive dataset and real-world grasping tasks show that DeepCoFi achieves robust proprioception under occlusion and generalizes effectively to unseen deformations and contact conditions. Open-source resources and project updates are available at https://ai4ce.github.io/DeepCoFi/.

Abstract:
Recent advances in data-driven reinforcement learning and motion tracking have substantially improved humanoid locomotion, yet critical practical challenges remain. In particular, while low-level motion tracking and trajectory-following controllers are mature, whole-body referenceguided methods are difficult to adapt to higher-level command interfaces and diverse task contexts: they require large, high-quality datasets, are brittle across speed and pose regimes, and are sensitive to robot-specific calibration. To address these limitations, we propose the Parameterized Motion Generator (PMG), a real-time motion generator grounded in an analysis of human motion structure that synthesizes reference trajectories using only a compact set of parameterized motion data together with high-dimensional control commands. Combined with an imitation-learning pipeline and an optimization-based sim-to-real motor parameter identification module, we validate the complete approach on our humanoid prototype ZERITH Z1 and show that, within a single integrated system, PMG produces natural, human-like locomotion, responds precisely to high-dimensional control inputsincluding VR-based teleoperationand enables efficient, verifiable sim-to-real transfer. Together, these results establish a practical, experimentally validated pathway toward natural and deployable humanoid control. Website：https://pmg-icra26.github.io/

Abstract:
Humans engage in alternating locomotion patterns in daily life by continuously adjusting step placement. Step placement control in powered prostheses could benefit prosthesis users by supporting speed-adaptation and improving gait stability. This paper uses a data-driven predictive step placement model and a task-space swing controller to achieve human-like step placement patterns on a powered prosthesis platform in simulation. We designed the predictive model to estimate future desired step placement from current user-prosthesis states by analyzing biological gait patterns from a motion-capture dataset. We also present a novel 3D human-prosthesis simulation for evaluating prosthesis controllers with inputs from human walking experiments. In this simulation, we demonstrate our step placement controller with 22 subject models, each with 28 steady-state and 35 non-steady-state walking conditions. Simulation results show that this speed-adaptive control framework achieves human-like step placement and Margin of Stability patterns with respect to walking speed.

Abstract:
The control of high-dimensional systems, such as soft robots, requires models that faithfully capture complex dynamics while remaining computationally tractable. This work presents a framework that integrates Graph Neural Network (GNN)-based dynamics models with structure-exploiting Model Predictive Control to enable real-time control of high-dimensional systems. By representing the system as a graph with localized interactions, the GNN preserves sparsity, while a tailored condensing algorithm eliminates state variables from the control problem, ensuring efficient computation. The complexity of our condensing algorithm scales linearly with the number of system nodes, and leverages Graphics Processing Unit (GPU) parallelization to achieve real-time performance. The proposed approach is validated in simulation and experimentally on a physical soft robotic trunk. Results show that our method scales to systems with up to 1,000 nodes at 100 Hz in closed-loop, and demonstrates real-time reference tracking on hardware with sub-centimeter accuracy, outperforming baselines by 63.6%. Finally, we show the capability of our method to achieve effective full-body obstacle avoidance.

Abstract:
Accurate visualinertial simultaneous localization and mapping (VI SLAM) for underwater robots remains a significant challenge due to frequent visual degeneracy and insufficient inertial measurement unit (IMU) motion excitation. In this paper, we present GeVI-SLAM, a gravity-enhanced stereo VI SLAM system designed to address these issues. By leveraging the stereo camera's direct depth estimation ability, we eliminate the need to estimate scale during IMU initialization, enabling stable operation even under low-acceleration dynamics.With precise gravity initialization, we decouple the pitch and roll from the pose estimation and solve a 4 degrees of freedom (DOF) Perspective‑n‑Point (PnP) problem for pose tracking. This allows the use of a minimal 3-point solver, which significantly reduces computational time to reject outliers within a Random Sample Consensus (RANSAC) framework. We further propose a bias-eliminated 4-DOF PnP estimator with provable consistency, ensuring the relative pose converges to the true value as the feature number increases. To handle dynamic motion, we refine the full 6-DOF pose while jointly estimating the IMU covariance, enabling adaptive weighting of the gravity prior. Extensive experiments on simulated and real-world data demonstrate that GeVI-SLAM achieves higher accuracy and greater stability compared to state-of-the-art methods.

Abstract:
Ensuring safety of vision-based control systems remains a major challenge hindering their deployment in critical settings. Safety filters have gained increased interest as effective tools for ensuring the safety of classical control systems, but their applications in vision-based control settings have so far been limited. Pre-trained vision representations (PVRs) have been shown to be effective perception backbones for control in various robotics domains. In this paper, we are interested in examining their effectiveness when used for designing vision-based safety filters. We use them as backbones for classifiers defining failure sets, for HamiltonJacobi (HJ) reachability-based value functions, and for latent world models. We discuss the trade-offs between training from scratch, fine-tuning the PVRs, and freezing the PVRs when training the models they are backbones for. We also evaluate whether one of the PVRs is superior across all tasks, evaluate whether learned world models or Q-functions are better for switching decisions to safe policies, and discuss practical considerations for deploying these PVRs on resource-constrained devices. Our experiments show that compared to training representations from scratch, using PVRs as perception backbones for vision-based safety filters can reduce violation rates by 12.2%, and fine-tuning PVRs to the task can reduce them by 73.7%, while maintaining or improving task performance. Code is available at https: //github.com/tabz23/Latent-Safety-Filters.

Abstract:
Diffusion policies have shown to be very efficient at learning complex, multi-modal behaviors for robotic manipulation. However, errors in generated action sequences can compound over time which can potentially lead to failure. Some approaches mitigate this by augmenting datasets with expert demonstrations or learning predictive world models which might be computationally expensive. We introduce Performance Predictive Guidance (PPGuide), a lightweight, classifier-based framework that steers a pre-trained diffusion policy away from failure modes at inference time. PPGuide makes use of a novel self-supervised process: it uses attention-based multiple instance learning to automatically estimate which observation-action chunks from the policy's rollouts are relevant to success or failure. We then train a performance predictor on this self-labeled data. During inference, this predictor provides a real-time gradient to guide the policy toward more robust actions. We validated our proposed PPGuide across a diverse set of tasks from the Robomimic and MimicGen benchmarks, demonstrating consistent improvements in performance.

Abstract:
Automated fabric manipulation offers great potential for reducing labor requirements in textile manufacturing and domestic services. Yet, even the basic task of separating a single fabric layer poses substantial challenges for robots. Adhesive-based end-effectors suffer from limited material compatibility and environmental adaptability, while gripper-based designs, which primarily target crease grasping and rely on passive separation, frequently demonstrate unreliability. Current vision and tactile systems fail to detect the fabric separation surface. Given these mechanical and sensing constraints, existing separation solutions lack the ability to adjust the number of layers post-grasping, relying solely on single-attempt success. In this work, we propose a novel tactile-enhanced gripper capable of human-like rubbing motion for reliable cloth separation, which integrates a magnetic sensing system to monitor the separation process. Based on these, we further develop a pipeline to realize rubbing-based separation. Extensive experiments show our gripper achieves a 96.67% separation success rate across 15 fabrics with varying weaving patterns, and the tactile system reaches 87.00% accuracy in sliding surface detection. Our work provides a novel mechanism for fabric layer separation, facilitating subsequent cloth manipulation.

Abstract:
We present a modular, wireless biosignal acquisition platform designed to enable scalable electromyography (EMG) and inertial measurement unit (IMU) sensing for wearable robotics applications. The system supports up to 64 EMG channels and integrates a 9-axis IMU, leveraging a distributed Leader-Follower board architecture. In this work, we demonstrate synchronised acquisition of 32 EMG channels together with IMU motion data in a fully wireless setup. The embedded firmware ensures low-latency, high-fidelity streaming at 1.4 kHz over a 2.4-GHz industrial, scientific and medical (ISM) band link. Benchmarking shows that the platform maintains uniformly strong performance across noise, power, footprint, bandwidth, and scalability, in contrast to existing designs that optimize only a single metric. Experimental demonstrations confirm reliable acquisition of high-density EMG and IMU signals across functional activities, highlighting the devices robustness and wearability. The proposed system provides a compact and flexible solution for intent-aware wearable technologies, with applications in assistive exosuits, rehabilitation, and humanrobot interaction.

Abstract:
Human-humanoid collaboration shows significant promise for applications in healthcare, domestic assistance, and manufacturing. While compliant robot-human collaboration has been extensively developed for robotic arms, enabling compliant human-humanoid collaboration remains largely unexplored due to humanoids complex whole-body dynamics. In this paper, we propose a proprioception-only reinforcement learning approach, COLA, that combines leader and follower behaviors within a single policy. The model is trained in a closed-loop environment with dynamic object interactions to predict object motion patterns and human intentions implicitly, enabling compliant collaboration to maintain load balance through coordinated trajectory planning. We evaluate our approach through comprehensive simulator and real-world experiments on collaborative carrying tasks, demonstrating the effectiveness, generalization, and robustness of our model across various terrains and objects. Simulation experiments demonstrate that our model reduces human effort by 24.7% compared to baseline approaches while maintaining object stability. Real-world experiments validate robust collaborative carrying across different object types (boxes, desks, stretchers, etc.) and movement patterns (straight-line, turning, slope climbing). Human user studies with 23 participants confirm an average improvement of 27.4% compared to baselines. Our method enables compliant human-humanoid collaborative carrying without requiring external sensors or complex interaction models, offering a practical solution for real-world deployment.

Abstract:
Robotic systems often need to be configured for different dynamic execution environments, hardware, or non-functional properties, such as energy consumption. Configuration options, a.k.a. features, can be used to enable, disable, and calibrate different parts of the system, ranging from whole subsystems over components to lines of code. While configuration mechanisms are abundant in robotics, they are limited in expressiveness, and the configuration files are often distributed over the codebase in different artifacts, challenging the consistent declaration and enforcement of dependencies. In addition, robotic systems require flexibility, since features need to be activated or changed at different times in the lifecycle of the system, which can cause intricate dependencies, especially when they depend on other static or dynamic features. To prevent misconfiguration and undefined behavior, such configuration spaces need to be properly declared and managed. We present FeatX, a model-based configuration technique. It uses and extends feature models, but accounts for the specific needs in robotics. Specifically, it allows declaring features, their dependencies, as well as the allowed binding times and binding modes, while the configurator enforces correct configuration and reconfiguration, considering intricate semantics of such models. We designed the syntax and semantics of the FeatX language and implemented them in the configurator. Our prototype is implemented for ROS2 with a command-line interface (ros2cli). We evaluated it upon realistic (re-)configuration scenarios.

Abstract:
Garment manipulation has attracted increasing attention due to its critical role in home-assistant robotics. However, the majority of existing garment manipulation works assume an initial state consisting of only one garment, while piled garments are far more common in real-world settings. To bridge this gap, we propose a novel garment retrieval pipeline that can not only follow language instruction to execute safe and clean retrieval but also guarantee exactly one garment is retrieved per attempt, establishing a robust foundation for the execution of downstream tasks (e.g., folding, hanging, wearing). Our pipeline seamlessly integrates vision-language reasoning with visual affordance perception, fully leveraging the high-level reasoning and planning capabilities of VLMs alongside the generalization power of visual affordance for low-level actions. To enhance the VLM's comprehensive awareness of each garment's state within a garment pile, we employ visual segmentation model (SAM2) to execute object segmentation on the garment pile for aiding VLM-based reasoning with sufficient visual cues. A mask fine-tuning mechanism is further integrated to address scenarios where the initial segmentation results are suboptimal. In addition, a dual-arm cooperation framework is deployed to address cases involving large or long garments, as well as excessive garment sagging caused by incorrect grasping point determination, both of which are strenuous for a single arm to handle. The effectiveness of our pipeline are consistently demonstrated across diverse tasks and varying scenarios in both real-world and simulation environments. Project page: https://garmentpile2.github.io/.

Abstract:
Active scene reconstruction enables robots/UAVs to autonomously plan trajectories and reconstruct environments without costly manual data acquisition. Unlike passive methods, active reconstruction requires real-time construction of high-confidence occupancy maps for collision-free navigation. Existing approaches rely on depth sensors for occupancy map updates, increasing platform cost and weight. To advance spatial intelligence, we aim for a vision-only monocular solution. However, current monocular scene reconstruction methods operate offline and fail to deliver globally consistent dense depth at the frame rates required for robots/UAVs navigation. To bridge this gap, we introduce ActMVS, the first framework for monocular active reconstruction. Our framework integrates a view factor graph construction for informed Multi-View Stereo depth prediction, along with a global depth optimization, to enable the online generation of high-quality, globally consistent dense depth maps. This enables monocular robots/UAVs to maintain reliable occupancy maps for safe trajectory planning during reconstruction. Experiments on Replica datasets demonstrate performance competitive with RGB-D methods. Our code and data are available at urlhttps://github.com/TrickyGo/ActMVS.

Abstract:
Attitude estimation methods typically rely on full vector measurements from inertial sensors such as accelerometers and magnetometers. This paper shows that reliable estimation can also be achieved using only scalar measurements, which naturally arise either as components of vector readings or as independent constraints from other sensing modalities. We propose nonlinear deterministic observers on SO(3) that incorporate gyroscope bias compensation and guarantee uniform local exponential stability under suitable observability conditions. A key feature of the framework is its robustness to partial sensing: accurate estimation is maintained even when only a subset of vector components is available. Experimental validation on the BROAD dataset confirms consistent performance across progressively reduced measurement configurations, with estimation errors remaining small even under severe information loss. To the best of our knowledge, this is the first work to establish fundamental observability results showing that two scalar measurements under suitable excitation suffice for attitude estimation, and that three are enough in the static case. These results position scalar-measurement-based observers as a practical and reliable alternative to conventional vector-based approaches.

Abstract:
Following the relative maturity of single-robot Simultaneous Localization And Mapping (SLAM) techniques, works addressing collaborative SLAM have started emerginglately. Driven by the need for robust and scalable multirobot systems, the community has been targeting Distributed Pose Graph Optimization (DPGO), with current DPGO methods falling into two categories: optimization-based methods providing favorable convergence properties at the expense of excessive communication rounds among participants, and belief-propagation methods that exhibit better scalability and faster computation, albeit risking divergence on loopy and noisy graphs. Inspired by the need for more effective DPGO techniques, this work introduces Contractive Belief Sharing (CBS), a two-stage message-passing algorithm that combines Maximum-A-Posteriori (MAP) optimization with belief propagation with a Hellinger-distance-based damping rule. In this way, CBS ensures fast and reliable convergence while maintaining fully distributed computation and communication withneighbors only. Experiments on benchmarks show that CBS reaches convergence substantially faster and more efficient and scalable than the state-of-the-art methods while maintaining high trajectory accuracy, opening up new capabilities for collaborative SLAM.

Abstract:
Dexterous in-hand manipulation, especially involving interactions between grasped objects and external environments, remains a formidable challenge in robotics. This study tackles the complexities of in-hand manipulation under extrinsic contact through a representative three-finger handwriting task. We propose a hybrid arm-hand coordination framework that combines reinforcement learning with compliance control, offering both flexibility and robustness. Leveraging tactile sensors embedded in each finger, our tactile-driven estimation model dynamically predicts in-hand object pose and external contact, eliminating the need for fixed contact states. The proposed framework is first validated in simulation, where it successfully executes diverse writing tasks with accurate contact sensing. Sim-to-Real transfer is achieved through systematic calibration of finger joints and tactile sensors, supported by domain randomization. Real-world experiments further demonstrate the system's adaptability to writing tools with varying physical propertiessuch as radius, length, mass, and frictionwhile maintaining stability across different trajectories. This work advances robotic manipulation capabilities in unstructured environments.

Abstract:
Monocular visual odometry (VO) is accurate in controlled settings yet drifts sharply under aggressive motion and sensor noise. We offer a fundamental rethinking of VO robustness as a training-schedule problem rather than an architectural challenge, introducing a novel dual-paradigm curriculum learning framework that operates at both trajectory and loss-component levels. (i) A motion-based curriculum orders trajectories by measured motion complexity. (ii) A hierarchical component curriculum adaptively re-weights optical-flow, pose, and rotation losses via Self-Paced and in-training Reinforcement Learning (RL) schedulers. Integrated into an unmodified DPVO baseline, these strategies cut TartanAir ATE by 33% with only 31% extra training wall-time, and reach baseline accuracy 47% faster (Self-Paced). Without fine-tuning, the same models improve zero-shot performance on EuRoC (-13%), TUM-RGBD (-9%), and ICL-NUIM (-32%). We show that explicit difficulty progression or adaptive loss weighting provides a practical, zero-inference-overhead path to robust monocular VO and could extend to other geometric vision tasks.

Abstract:
Path Integral methods have demonstrated remarkable capabilities for solving non-linear stochastic optimal control problems through sampling-based optimization. However, their computational complexity grows linearly with the prediction horizon, limiting long-term reasoning, while constraints are merely enforced through handcrafted penalties. In this work, we propose a unified and efficient framework for enabling long-horizon reasoning and constraint enforcement within Model Predictive Path Integral (MPPI) control. First, we introduce a practical method to incorporate a terminal value function, learned offline via temporal-difference learning, to approximate the long-term cost-to-go. This allows for significantly shorter roll-outs while enabling infinite-horizon reasoning, thereby improving computational efficiency and motion performance. Second, we propose a discount modulation strategy that adjusts the return of sampled trajectories based on constraint violations. This provides a more interpretable and effective mechanism for enforcing constraints compared to traditional cost shaping. Our formulation retains the flexibility and sampling efficiency of MPPI while supporting structured integration of long-term objectives and constraint handling. We validate our approach on both simulated and real-world robotic locomotion tasks, demonstrating improved performance, constraint-awareness, and generalization under reduced computational budgets.

Abstract:
This paper introduces a reliable and fast method for scene representation from a single RGB frame, even with human occlusion. Our goal is to enhance vision-based spatial reasoning in dynamic environments where human presence varies over time. Once humans are detected, the method addresses two key challenges: estimating the level of visual obstruction and generating a scene descriptor with humans removed. The first is handled via a novel visual obstruction measure that prevents descriptor generation under high occlusion. The second is addressed by adapting the previously presented bubble descriptor so that surface regions corresponding to detected humans are deformed using a modified spherical interpolation methodeliminating the need for inpainting or reconstruction and enabling rapid computation. We validate our approach through extensive comparisons across multiple datasets, including two new datasets collected using both stationary and mobile robots. Results show comparable representation quality with a 14-44× reduction in computation time.

Abstract:
Robotic manipulators hold significant untapped potential for manufacturing industries, particularly when deployed in multi-robot configurations that can enhance resource utilization, increase throughput, and reduce costs. However, industrial manipulators typically operate in isolated one-robot, one-machine setups, limiting both utilization and scalability. Even mobile robot implementations generally rely on centralized architectures, creating vulnerability to single points of failure and requiring robust communication infrastructure. This paper introduces SMAPPO (Scalable Multi-Agent Proximal Policy Optimization), a scalable input-size invariant multi-agent reinforcement learning model for decentralized multi-robot management in industrial environments. MAPPO (Multi-Agent Proximal Policy Optimization) represents the current state-of-the-art approach. We optimized an existing simulator to handle complex multi-agent reinforcement learning scenarios and designed a new multi-machine tending scenario for evaluation. Our novel observation encoder enables SMAPPO to handle varying numbers of agents, machines, and storage areas with minimal or no retraining. Results demonstrate SMAPPOs superior performance compared to the state-of-the-art MAPPO across multiple conditions: full retraining (up to 61% improvement), curriculum learning (up to 45% increased productivity and up to 49% fewer collisions), zero-shot generalization to significantly different scale scenarios (up to 272% better performance without retraining), and adaptability under extremely low initial training (up to 100% increase in parts delivery).

Abstract:
Traditional cloth manipulation often employs low- DOF grippers to simplify hardware. However, this approach presents significant challenges for algorithms due to the complex and dynamic behavior of fabrics. To address these limitations, we propose a novel approach based on a human-hand-like de- sign that integrates an underactuated grasping mechanism with a suction system. By incorporating multiple single-layer suction principles, the robotic hand achieves greater adaptability and flexibility, enabling it to perform tasks that would typically require high-DOF hands and complex control strategies. This paper outlines the design of the robot hand with a suction system and evaluates its performance in various tasks.

Abstract:
Neural Radiance Fields (NeRF) have significantly advanced photorealistic novel view synthesis. Recently, 3D Gaussian Splatting has emerged as a promising technique with faster training and rendering speeds. However, both methods rely heavily on clear images and precise camera poses, limiting performance under motion blur. To address this, we introduce Event-Informed 3D Deblur Reconstruction with Gaussian Splatting(EiGS), a novel approach leveraging event camera data to enhance 3D Gaussian Splatting, improving sharpness and clarity in scenes affected by motion blur. Our method employs an Adaptive Deviation Estimator to learn Gaussian center shifts as the inverse of complex camera jitter, enabling simulation of motion blur during training. A motion consistency loss ensures global coherence in Gaussian displacements, while Blurriness and Event Integration Losses guide the model toward precise 3D representations. Extensive experiments demonstrate superior sharpness and real-time rendering capabilities compared to existing methods, with ablation studies validating the effectiveness of our components in robust, high-quality reconstruction for challenging motion-blurred environments.

Abstract:
In this work, we propose a novel artificial vector field for robot navigation in n-dimensional path-following tasks, designed to ensure safety and convergence with a smoothed control law. Unlike previous methods based on discontinuous Euclidean distance functions, our approach uses a smooth Euclidean-like function to achieve a continuous control law formulation and a field combination to balance the objectives of avoiding obstacles and following the path. This results in a navigation method that follows a target path while preventing robots from approaching obstacles, which can be used in different applications. We provide formal proofs for safety using barrier functions concepts and path convergence via Lyapunov theory. The methodology is validated through extensive numerical simulations and real-world experiments. Those include extrapolations of the methodology in more complex cases, such as quadcopters and multi-robot systems to underline the method's advantages in achieving safe and reliable robot navigation.

Abstract:
Machine learning for robot manipulation promises to unlock generalization to novel tasks and environments. But how should we measure the progress of these policies towards generalization? Evaluating and quantifying generalization is the Wild West of modern robotics, with each work proposing and measuring different types of generalization in their own, often difficult to reproduce settings. In this work, our goal is (1) to outline the forms of generalization we believe are important for robot manipulation in a comprehensive and fine-grained manner, and (2) to provide reproducible guidelines for measuring these notions of generalization. We first propose STAR-Gen, a taxonomy of generalization for robot manipulation structured around visual, semantic, and behavioral generalization. Next, we instantiate STAR-Gen with two case studies on real-world benchmarking: one based on open-source models and the Bridge V2 dataset, and another based on the bimanual ALOHA 2 platform that covers more dexterous and longer horizon tasks. Our case studies reveal many interesting insights: for example, we observe that open-source vision-language-action models often struggle with semantic generalization, despite pre-training on internet-scale language datasets. We provide videos and other supplementary material at our website stargen-taxonomy.github.io.

Abstract:
Robotic grippers have been extensively developed to enable stable and efficient object manipulation across diverse applications. While soft grippers offer high adaptability and safety, their performance remains constrained by an inherent trade-off between flexibility and load-bearing capacity. This study was undertaken with the objective of addressing these challenges by proposing a compact weaving gripper that exploits structurally induced interlocking. Additionally, a prediction model is developed to predict and control grasping configurations. The proposed gripper is integrated with continuum robot, enabling operation in confined environments, and demonstrates applicability across diverse robotic platforms.

Abstract:
Collectively exploring and understanding an environment is an open challenge, particularly in dynamic settings where agents must rely on limited information that may only be intermittently available. In this paper, we focus on how agents can maximize information capture in these contexts. As agents encounter an informant with information to communicatesuch as a human collaborator sharing a transient observationthey aggregate this data into a textual description of the environment using an LLM. We show that agents capture environmental information faster when sharing information with other members of the swarm. While strictly ephemeral information may never be fully captured, social learning enables agents to acquire significantly more information, demonstrating the critical importance of information sharing between agents.

Abstract:
In this paper, design and experimental validation of a novel Spiral Chain Actuator and its application in a three-degree-of-freedom positioning platform are presented. Unlike previous spiral zipper actuators that rely on flexible bands and face structural integrity limitations under tensions and moment loads, the proposed design employs rigid chain pieces that interlock during rotation. This rigid architecture enables an improved load-bearing capacity while maintaining the compact, lightweight advantages of spiral actuation. We developed a positioning platform equipped with three Spiral Chain Actuators arranged in a tetrahedral configuration and validated position control through experimental testing. The results demonstrated successful position tracking across all three translational axes, which establishes foundation to develop full VTT system and it's performance evaluation in the future.

Abstract:
Behavior Trees (BTs) provide a structured and reactive framework for decision-making, commonly used to switch between sub-controllers based on environmental conditions. Reinforcement Learning (RL), on the other hand, can learn near-optimal controllers but sometimes struggles with sparse rewards, safe exploration, and long-horizon credit assignment. Combining BTs with RL has the potential for mutual benefit: a BT design encodes structured domain knowledge that can simplify RL training, while RL enables automatic learning of the controllers within BTs. However, naive integration of BTs and RL can lead to some controllers counteracting other controllers, possibly undoing previously achieved subgoals, thereby degrading the overall performance. To address this, we propose progress constraints, a novel mechanism where feasibility estimators constrain the allowed action set based on theoretical BT convergence results. Empirical evaluations in a 2D proof-of-concept and a high-fidelity warehouse environment demonstrate improved performance, sample efficiency, and constraint satisfaction, compared to prior methods of BT-RL integration.

Abstract:
Modern robots vary significantly in shape, size, and sensor configurations used to perceive and interact with their environments. However, most navigation policies are embodiment-specifica policy trained on one robot typically fails to generalize to another, even with minor changes in body size or camera viewpoint. As custom hardware becomes increasingly common, there is a growing need for a single policy that generalizes across embodiments, eliminating the need to (re-)train for each specific robot. In this paper, we introduce RING (Robotic Indoor Navigation Generalist), an embodiment-agnostic policy that turns any mobile robot into an effective indoor semantic navigator. Trained entirely in simulation, RING leverages large-scale randomization over robot embodiments to enable robust generalization to many real-world platforms. To support this, we augment the AI2-THOR simulator to instantiate robots with controllable configurations, varying in body size, rotation pivot point, and camera parameters. On the visual object-goal navigation task, RING achieves strong cross-embodiment (XE) generalization72.1% average success rate across 5 simulated embodiments (a 16.7% absolute improvement on the Chores-S benchmark) and 78.9% across 4 real-world platforms, including Stretch RE-1, LoCoBot, and Unitree Go1matching or even surpassing embodiment-specific policies. We further deploy RING on the RB-Y1 wheeled humanoid in a real-world kitchen environment, showcasing its out-of-the-box potential for mobile manipulation platforms.

Abstract:
Medical ultrasound (US) has been widely used to examine vascular structure in modern clinical practice. However, the traditional US examination often faces challenges related to inter- and intra-operator variation. The robotic ultrasound system (RUSS) appears as a potential solution for such challenges because of its superiority in terms of stability and reproducibility. Given the complex anatomy of human vasculature, it is common for multiple vessels to appear in US images, or for a single vessel to bifurcate into multiple branches, complicating the examination process. To tackle this challenge, this work presents a gaze-guided RUSS for vascular applications. A gaze tracker is integrated to capture the eye movements of the human operator. The extracted gaze signal is utilized to guide the RUSS to follow the correct vessel when it bifurcates. Additionally, a gaze-guided segmentation network is proposed to enhance the segmentation robustness by exploiting the gaze information. However, gaze signals are often noisy, requiring interpretation to accurately discern the operator's true intentions. To this end, this study first proposed a stabilization module to process the raw gaze data. The inferred attention heatmap is then utilized as a region proposal to aid in segmentation and to serve as a trigger signal when the operator needs to adjust the scanning target, such as when a bifurcation appears in the current images. To ensure appropriate contact between the probe and the surface during the scanning, an automatic US confidence-based orientation correction method is developed as well. In the experiments, we demonstrated the efficiency of the proposed gaze-guided segmentation pipeline by comparing it with other segmentation methods. Besides, the performance of the proposed

Abstract:
Autoregressive Transformers are increasingly being deployed as end-to-end robot and autonomous vehicle (AV) policy architectures, owing to their scalability and potential to leverage internet-scale pretraining for generalization. Accordingly, tokenizing sensor data efficiently is paramount to ensuring the real-time feasibility of such architectures on embedded hardware. To this end, we present an efficient triplane-based multi-camera tokenization strategy that leverages recent advances in 3D neural reconstruction and rendering to produce sensor tokens that are agnostic to the number of input cameras and their resolution, while explicitly accounting for their geometry around an AV. Experiments on large-scale AV datasets and a state-of-the-art neural simulator demonstrate that our approach yields significant savings over current image patch-based tokenization strategies, producing up to 72% fewer tokens, resulting in up to 50% faster policy inference while achieving the same open-loop motion planning accuracy and improved offroad rates in closed-loop driving simulations.

Abstract:
Ego-vision-based navigation in cluttered environments is crucial for mobile systems, particularly agile quadrotors. While learning-based methods have shown promise recently, head-to-head comparisons with cutting-edge optimization-based approaches are scarce, leaving open the question of where and to what extent they truly excel. In this paper, we introduce FlightBench, the first comprehensive benchmark that implements various learning-based methods for ego-vision-based navigation and evaluates them against mainstream optimization-based baselines using a broad set of performance metrics. More importantly, we develop a suite of criteria to assess scenario difficulty and design test cases that span different levels of difficulty based on these criteria. Our results show that while learning-based methods excel in high-speed flight and faster inference, they struggle with challenging scenarios like sharp corners or view occlusion. Analytical experiments validate the correlation between our difficulty criteria and flight performance. reviseMoreover, we verify the trend in flight performance within real-world environments through full-pipeline and hardware-in-the-loop experiments. We hope this benchmark and these criteria will drive future advancements in learning-based navigation for ego-vision quadrotors. Code and documentation are available at https://github.com/thu-uav/FlightBench.

Abstract:
The multi-contact nonlinear complementarity problem (NCP) is a naturally arising challenge in robotic simulations. Achieving high performance in terms of both accuracy and efficiency remains a significant challenge, particularly in scenarios involving intensive contacts and stiff interactions. In this article, we introduce a new class of multi-contact NCP solvers based on the theory of the Augmented Lagrangian (AL). We detail how the standard derivation of AL in convex optimization can be adapted to handle multi-contact NCP through the iteration of surrogate problem solutions and the subsequent update of primal-dual variables. Specifically, we present two tailored variations of AL for robotic simulations: the Cascaded Newton-based Augmented Lagrangian (CANAL) and the Subsystem-based Alternating Direction Method of Multipliers (SubADMM). We demonstrate how CANAL can manage multi-contact NCP in an accurate and robust manner, while SubADMM offers superior computational speed, scalability, and parallelizability for high degrees-of-freedom multibody systems with numerous contacts. Our results showcase the effectiveness of the proposed solver framework, illustrating its advantages in various robotic manipulation scenarios.

Abstract:
Navigation using a monocular camera is pivotal for autonomous operation on tiny aerial robots due to their perfect balance of versatility, cost and accuracy. In this paper, we introduce MinNav, a navigation stack based on optical flow and its uncertainty to fly through a scene with static and dynamic obstacles and unknown-shaped gaps without any prior knowledge of the scene components and/or their locations/ordering. We further improve success rate by using the activeness of the robot to move around in an exploratory way to find obstacles and navigate. We successfully evaluate and demonstrate the proposed approach in many real-world experiments in various environments with static and dynamic obstacles and unknown-shaped gaps with an overall success rate of 70%. To the best of our knowledge, this is the first solution to tackle all the aforementioned navigation cases without prior knowledge using a monocular camera. Our approach is on par in performance with depth based methods with factors of magnitude less computation required and can readily run onboard tiny aerial robots.

Abstract:
This study introduces a robotic high-speed scooping technique, an effective solution for rapidly picking thin objects from a hard support surface. High-speed scooping involves dynamic and impactful manipulation using a two-fingered gripper. One digit dynamically penetrates beneath the object lying on a support surface while the other digit helps form a cage and subsequently secures a firm grip. This entire process is executed within a fractional-second time frame. We develop a theoretical model of manipulation for high-speed scooping and implement it using our custom direct-drive gripper designed for enhanced environment-adaptability. Extensive experiments verify the viability of our high-speed scooping approach.

Abstract:
Reconstructing dynamic scenes with multiple interacting humans and objects from sparse-view inputs is a critical yet challenging task, essential for creating high-fidelity digital twins for robotics and VR/AR. This problem, which we term Multi-Human Multi-Object (MHMO) rendering, presents two significant obstacles: achieving view-consistent representations for individual instances under severe mutual occlusion, and explicitly modeling the complex and combinatorial dependencies that arise from their interactions. To overcome these challenges, we propose MM-GS, a novel hierarchical framework built upon 3D Gaussian Splatting. Our method first employs a Per-Instance Multi-View Fusion module to establish a robust and consistent representation for each instance by aggregating visual information across all available views. Subsequently, a Scene-Level Instance Interaction module operates on a global scene graph to reason about relationships between all participants, refining their attributes to capture subtle interaction effects. Extensive experiments on challenging datasets demonstrate that our method significantly outperforms strong baselines, producing state-of-the-art results with high-fidelity details and plausible inter-instance contacts.

Abstract:
Motion planning problems for physically-coupled multi-robot systems in cluttered environments are challenging due to their high dimensionality. Existing methods combining sampling-based planners with trajectory optimization produce suboptimal results and lack theoretical guarantees. We propose Physically-coupled discontinuity-bounded Conflict-Based Search (pc-dbCBS), an anytime kinodynamic motion planner that extends discontinuity-bounded CBS to rigidly-coupled systems. Our approach proposes a tri-level conflict detection and resolution framework that includes the physical coupling between the robots. Moreover, pc-dbCBS alternates iteratively between state space representations, thereby preserving probabilistic completeness and asymptotic optimality while relying only on single-robot motion primitives. Across 25 simulated and six real-world problems involving multirotors carrying a cable-suspended payload and differentialdrive robots linked by rigid rods, pc-dbCBS solves up to 90% more instances than a state-of-the-art baseline, planning trajectories up to 60% faster with significantly reduced planning time.

Abstract:
In this work, we propose an evaluation protocol for examining the performance of robotic manipulation policies in cluttered scenes. Contrary to prior works, we approach evaluation from a psychophysical perspective, therefore we use a unified measure of clutter that accounts for environmental factors as well as the distractors quantity, characteristics, and arrangement. Using this measure, we systematically construct evaluation scenarios in both hyper-realistic simulation and real-world and conduct extensive experimentation on manipulation policies, in particular vision-language-action (VLA) models. Our experiments highlight the significant impact of scene clutter, lowering the performance of the policies, by as much as 34% and show that despite achieving similar average performance across the tasks, different VLA policies have unique vulnerabilities and a relatively low agreement on success scenarios. We further show that our clutter measure is an effective indicator of performance degradation and analyze the impact of distractors in terms of their quantity and occluding influence. At the end, we show that finetuning on enhanced data, although effective, does not equally remedy all negative impacts of clutter on performance.

Abstract:
Embodied Chain-of-Thought (ECoT) reasoning enhances vision-language-action (VLA) models by improving performance and interpretability through intermediate reasoning steps. However, its sequential autoregressive token generation introduces significant inference latency, limiting real-time deployment. We propose Fast ECoT, an inference-time acceleration method that exploits the structured and repetitive nature of ECoT to (1) cache and reuse high-level reasoning across timesteps and (2) parallelise the generation of modular reasoning steps. Additionally, we introduce an asynchronous scheduler that decouples reasoning from action decoding, further boosting responsiveness. Fast ECoT requires no model changes or additional training and easily integrates into existing VLA pipelines. Experiments in both simulation (LIBERO) and real-world robot tasks show up to a 7.5× reduction in latency with comparable or improved task success rate and reasoning faithfulness, bringing ECoT policies closer to practical real-time deployment.

Abstract:
Inertial Odometry (IO) has gained attention in quadrotor applications due to its sole reliance on inertial measurement units (IMUs), attributed to its lightweight design, low cost, and robust performance across diverse environments. However, most existing learning-based inertial odometry systems for quadrotors either use only IMU data or include additional dynamics-related inputs such as thrust, but still lack a principled formulation of the underlying physical model to be learned. This lack of interpretability hampers the models ability to generalize and often limits its accuracy. In this work, we approach the inertial odometry learning problem from a different perspective. Inspired by the aerodynamics model and IMU measurement model, we identify the key physical quantityrotor speed measurements required for inertial odometry and design a transformer-based inertial odometry. By incorporating rotor speed measurements, the proposed model improves velocity prediction accuracy by 36.9%. Furthermore, the transformer architecture more effectively exploits temporal dependencies for denoising and aerodynamic modeling, yielding an additional 22.4% accuracy gain over previous results. To support evaluation, we also provide a real-world quadrotor flight dataset capturing IMU measurements and rotor speed for high-speed motion. Finally, combined with an uncertaintyaware extended Kalman filter (EKF), our framework is validated across multiple datasets and real-time systems, demonstrating superior accuracy, generalization, and real-time performance. We share the code and data to promote further research (https://github.com/SJTU-ViSYS-team/AI-IO).

Abstract:
Unaligned Scene Change Detection aims to detect scene changes between image pairs captured at different times without assuming viewpoint alignment. To handle viewpoint variations, current methods rely solely on 2D visual cues to establish cross-image correspondence to assist change detection. However, large viewpoint changes can alter visual observations, causing appearance-based matching to drift or fail. Additionally, supervision limited to 2D change masks from small-scale SCD datasets restricts the learning of generalizable multi-view knowledge, making it difficult to reliably identify visual overlaps and handle occlusions. This lack of explicit geometric reasoning represents a critical yet overlooked limitation. In this work, we introduce geometric priors for the first time to address the core challenges of unaligned SCD, for reliable identification of visual overlaps, robust correspondence establishment, and explicit occlusion detection. Building on these priors, we propose a training-free framework that integrates them with the powerful representations of a visual foundation model to enable reliable change detection under viewpoint misalignment. Through extensive evaluation on the PSCD, ChangeSim, and PASLCD datasets, we demonstrate that our approach achieves superior and robust performance. Our code will be released at https://github.com/ZilingLiu/GeoSCD.

Abstract:
Fixed-wing unmanned aerial vehicles (UAVs) offer endurance and efficiency but lack low-speed agility because its highly-coupled dynamical model. We present an end-to-end sensing-to-control pipeline that combines bio-inspired hardware instrumentation, physics-informed dynamics learning, and convex control allocation. Measuring oncoming flow on a small airframe is difficult as near-body aerodynamics, propeller slipstream, control surfaces actuation, and the present of gusts would distort pressures and make sensor signals input dependent variables. To raise the signal-to-noise ratio, and to gain invaluable response time, inspired by the Narwhal's tusk, we protrude our in-house developed multi-hole probes far ahead into the upstream, and complement it with sparse yet carefully placed wing pressure sensors for local flow measurement, with systematically-introduced gust of significant magnitude. A data-driven calibration maps pressures signal of the probes to airspeed and flow angles. We then learn a control-affine model of aerodynamic forces with a soft left/right symmetry regularizer that improves identifiability under partial observability and limits confounding between wing pressures and aileron inputs. Desired wrenches (forces output) are realized by a regularized least-squares optimizer that yields smooth, trimmed actuation. Wind-tunnel studies across multiple airspeeds and gust conditions show that adding wing pressures reduces force-estimation error by 25%30%, the proposed model degrades far less under distribution shift (about 12% versus 44% for an unstructured baseline), and force tracking improves with smoother inputs, including a 27% reduction in normal-force RMSE relative to a plain affine model and 34% relative to an unstructured baseline.

Abstract:
We present HINT-3D, a human-in-the-loop test-time adaptation framework for 3D semantic segmentation. A few corrective clicks are converted into region masks by a promptable 3D interface (PointSAM). These masks supervise stability-aware updates to a pre-trained backbone at inference. We persist the updates so later scenes start from improved weights, enabling cumulative learning. The wrapper is backbone-agnostic: it requires only logits, a mask-to-index bridge, plus access to a small trainable parameter set, we instantiate it on KPConv, RandLA-Net, and Point Transformer v1. On S3DIS Area-5, HINT-3D delivers strong effort-accuracy gains within a scene, consistent zero-click improvements across scenes, and reduced Expected Calibration Error (ECE), while maintaining responsiveness with head-only updates and uncertainty-gated training. We report mIoU versus saved masks, cross-scene transfer, ECE, latency, and class-specific corrections on common indoor failure modes.

Abstract:
Learning robust visuomotor policies for robotic manipulation remains a challenge in real-world settings, where visual distractors can significantly degrade performance and safety. In this work, we propose an effective and scalable framework, Naturalistic Inpainting for Context Enhancement (NICE). Our method minimizes out-of-distribution (OOD) gap in imitation learning by increasing visual diversity through construction of new experiences using existing demonstrations. By utilizing image generative frameworks and large language models, NICE performs three editing operations, object replacement, restyling, and removal of distracting (non-target) objects. These changes preserve spatial relationships without obstructing target objects and maintain action-label consistency. Unlike previous approaches, NICE requires no additional robot data collection, simulator access, or custom model training, making it readily applicable to existing robotic datasets. Using real-world scenes, we showcase the capability of our framework in producing photo-realistic scene enhancement. For downstream tasks, we use NICE data to finetune a vision-language model (VLM) for spatial affordance prediction and a vision-language-action (VLA) policy for object manipulation. Our evaluations show that NICE successfully minimizes OOD gaps, resulting in over 20% improvement in accuracy for affordance prediction in highly cluttered scenes. For manipulation tasks, success rate increases on average by 11% when testing in environments populated with distractors in different quantities. Furthermore, we show that our method improves visual robustness, lowering target confusion by 6%, and enhances safety by reducing collision rate by 7%.

Abstract:
Designing robotic manipulators often requires balancing dexterity, speed, and payload capacity. While traditional serial-link and cable-driven manipulators offer high dexterity, they struggle to concurrently achieve high speed, and often lack the strength and stiffness required for many applications. To address these limitations, we present TriCoSphere, a novel 12-degree-of-freedom, three-fingered manipulator designed to optimize all three attributes. Each 4-DoF finger employs a Coaxial Spherical Parallel Mechanism (CSPM), which positions all actuators at the base. This parallel architecture minimizes finger inertia for high-speed motion and distributes loads across multiple linkages, enhancing payload capacity. We provide a complete kinematic analysis and develop an efficient inverse kinematics solver for precise fingertip control. Experiments demonstrate that each finger can support a 4.1 kg payload and achieve a motion bandwidth of 6.5 Hz. The manipulators grasp range and dexterity are showcased by handling objects from a 20 mm sphere to a 300 mm acrylic ball, as well as performing complex in-hand manipulation tasks. TriCoSphere is cost-effective, robust, and open-sourced to support future research.

Abstract:
A key challenge in robot manipulation lies in developing policy models with consistent spatial understandingthe ability to reason about 3D geometry, object relations, and robot state. Existing mainstream models take 2D images as input, without performing explicit 3D modeling, and thus lack spatial understanding capabilities as well as 3D and embodiment generalization. To address this, we propose SEM (Spatial Enhanced Manipulation), a diffusion-based policy framework that constructs a unified spatial representation by projecting multi-view image features and joint-centric robot states into a unified 3D space. This spatially aligned representation provides a consistent feature space for the diffusion policy to condition on during action generation. Extensive experiments demonstrate that SEM significantly improves spatial understanding, leading to robust and generalizable manipulation across diverse tasks that outperform existing baselines.

Abstract:
In this paper, we study multi-robot laser tag, a simplified yet practical shooting-game-style task. Classic modular approaches on these tasks face challenges such as limited observability and reliance on depth mapping and interrobot communication. To overcome these issues, we present an end-to-end visuomotor policy that maps images directly to robot actions. We train a high-performing teacher policy with multi-agent reinforcement learning and distill its knowledge into a vision-based student policy. Technical designs, including a permutation-invariant feature extractor and depthheatmap input, improve performance over standard architectures. Our policy outperforms classic methods by 16.7% in hitting accuracy and 6% in collision avoidance, and is successfully deployed on real robots. Code will be released publicly.

Abstract:
Classical methods in robot motion planning, such as sampling-based and optimization-based methods, often struggle with scalability towards higher-dimensional state spaces and complex environments. Diffusion models, known for their capability to learn complex, high-dimensional and multi-modal data distributions, provide a promising alternative when applied to motion planning problems and have already shown interesting results. However, most of the current approaches train their model for a single environment, limiting their generalization to environments not seen during training. The techniques that do train a model for multiple environments rely on a specific camera to provide the model with the necessary environmental information and therefore always require that sensor. To effectively adapt to diverse scenarios without the need for retraining, this research proposes Context-Aware Motion Planning Diffusion (CAMPD). CAMPD leverages a classifier-free denoising probabilistic diffusion model, conditioned on sensor-agnostic contextual information. An attention mechanism, integrated in the well-known U-Net architecture, conditions the model on an arbitrary number of contextual parameters. CAMPD is evaluated on a 7-DoF robot manipulator and benchmarked against state-of-the-art approaches on real-world tasks, showing its ability to generalize to unseen environments and generate high-quality, multi-modal trajectories, at a fraction of the time required by existing methods.

Abstract:
Temporal Action Segmentation (TAS) requires dividing videos into action segments, yet the vast space of activities and alternative breakdowns makes collecting comprehensive datasets infeasible. Existing methods remain limited to closed vocabularies and fixed label sets. In this work, we explore the largely unexplored problem of Open-Vocabulary Zero-Shot Temporal Action Segmentation (OVTAS) by leveraging the strong zero-shot capabilities of VisionLanguage Models (VLMs). We introduce a training-free pipeline that follows a segmentation-by-classification design: (i) FrameAction Embedding Similarity (FAES) matches video frames to candidate action labels, and (ii) Similarity-Matrix Temporal Segmentation (SMTS) enforces temporal consistency. Beyond proposing OVTAS, we present a systematic study across 14 diverse VLMs, providing the first broad analysis of their suitability for open-vocabulary action segmentation. Experiments on standard benchmarks show that OVTAS achieves strong results without task-specific supervision, underscoring the potential of VLMs for structured temporal understanding. We release code and embeddings at our project page.

Abstract:
Robot assistive navigation (RAN) is critical for enhancing the mobility and independence of the growing population of mobility-impaired individuals. However, existing systems often rely on interfaces that fail to replicate the intuitive and efficient physical communication observed between a person and a human caregiver, limiting their effectiveness. In this paper, we introduce Tac-Nav, a RAN system that leverages a cylindrical tactile skin mounted on a Stretch 3 mobile manipulator to provide a more natural and efficient interface for human navigational intent recognition. To robustly classify the tactile data, we developed the Cylindrical Kernel Support Vector Machine (CK-SVM), an algorithm that explicitly models the sensor's cylindrical geometry and is consequently robust to the natural rotational shifts present in a user's grasp. Comprehensive experiments were conducted to demonstrate the effectiveness of our classification algorithm and the overall system. Results show that CK-SVM achieved superior classification accuracy on both simulated (97.1%) and real-world (90.8%) datasets compared to four baseline models. Furthermore, a pilot study confirmed that users more preferred the Tac-Nav tactile interface over conventional joystick and voice-based controls. Code and video are available at: https://sites.google.com/view/tac-nav/home.

Abstract:
Recent advancements in vision-language models (VLMs) offer potential for robot task planning, but challenges remain due to VLMs tendency to generate incorrect action sequences. To address these limitations, we propose VeriGraph, a novel framework that integrates VLMs for robotic planning while verifying action feasibility. VeriGraph employs scene graphs as an intermediate representation, capturing key objects and spatial relationships to improve plan verification and refinement. The system generates a scene graph from input images and uses it to iteratively check and correct action sequences generated by an LLM-based task planner, ensuring constraints are respected and actions are executable. Our approach significantly enhances task completion rates across diverse manipulation scenarios, outperforming baseline methods by 58% for language-based tasks and 30% for image-based tasks.

Abstract:
The increasing proliferation and evolution of robotics and its capabilities is having a significant impact on smart manufacturing and remanufacturing. Within the popular frameworks of Industry 4.0 and Industry 5.0, Human-robot collaboration (HRC) has emerged to integrate the best capabilities of humans like their problem solving with those of robots like their precision. These systems are continuing to scale rapidly and are beginning to introduce multi-human multi-robot collaboration (MHMRC) environments, offering a greater degree of productivity and flexibility. Both HRC and MHMRC are still faced with underexplored challenges, such as task allocation and scheduling. In this study, we propose a nature-inspired, objective function-constrained task scheduling optimization solution for multi-human multi-robot collaborative remanufacturing. Different objective functions for the Dingo Optimization Algorithm are developed to investigate how human participants perceive task assignments and interpret the disassembly process under varying objectives in MHMRC. We conduct a real-world multi-human multi-robot collaborative remanufacturing user study in which participants disassemble an end-of-life desktop computer in a shared workspace with two robots to test and validate the proposed approach. Participants are surveyed using the NASA-TLX, along with additional questions. Experimental results demonstrate the effectiveness of the developed approach, and directions for future work are also discussed.

Abstract:
Localization for autonomous vehicles on highways remains under-explored compared to urban roads, and state-of-the-art methods for urban scenes degrade when directly applied to highways. We identify key challenges including environment change under information homogeneity, heavy occlusion, degraded GNSS signals, and stringent downstream requirements on accuracy and latency. We propose a robust localization system to address highway challenges, which uses a dual-likelihood LiDAR front end that decouples 3D geometry structure and 2D road-texture cues to handle environment changes; a Control-EKF further leverages steering and acceleration commands to reduce lag and improve closed-loop behavior. An automated offline mapping and ground-truth pipeline keep maps fresh at high cadence for optimal localization performance. To catalyze progress, we release a public dataset covering both urban roads and highways while focusing on representative challenging highway clips, totaling 163 km; benchmarking is standardized using product-oriented accuracy metrics and certified ground truth. Compared to Apollo and Autoware, our system performs similarly on urban roads but shows superior robustness on challenging highway scenarios. The system has been validated by over one million kilometers of road testing.

Abstract:
We propose a system-level synthesis (SLS) framework for robust dynamic games with nonlinear dynamics corrupted by state-dependent additive noise, and nonlinear agent-specific and shared constraints. Each agent designs a nominal trajectory and a causal affine error feedback law to minimize their own cost while ensuring that its own constraints and the shared constraints are satisfied, even under worst-case noise realizations. Building on these nonlinear safety certificates, we define the novel notion of a robustly constrained Nash equilibrium (RCNE). We then present an Iterative Best Response (IBR)-based algorithm that iteratively refines the optimal trajectory and controller for each agent until approximate convergence to the RCNE. We evaluated our method on simulations and hardware experiments involving large numbers of robots with high-dimensional nonlinear dynamics, as well as state-dependent dynamics noise. Across all experiment settings, our method generated trajectory rollouts which robustly avoid collisions, while a baseline game-theoretic algorithm for producing open-loop motion plans failed to generate trajectories that satisfy constraints.

Abstract:
Recently, end-to-end robotic manipulation models have gained significant attention for their generalizability and scalability. However, they often suffer from limited robustness to camera viewpoint changes when training with a fixed camera. In this paper, we propose VistaBot, a novel framework that integrates feed-forward geometric models with video diffusion models to achieve view-robust closed-loop manipulation without requiring camera calibration at test time. Our approach consists of three key components: 4D geometry estimation, view synthesis latent extraction, and latent action learning. VistaBot is integrated into both action-chunking (ACT) and diffusion-based (pi_0) policies and evaluated across simulation and real-world tasks. We further introduce the View Generalization Score (VGS) as a new metric for comprehensive evaluation of cross-view generalization. Results show that VistaBot improves VGS by 2.79× and 2.63× over ACT and pi_0, respectively, while also achieving high-quality novel view synthesis. Our contributions include a geometry-aware synthesis model, a latent action planner, a new benchmark metric, and extensive validation across diverse environments. The code and models of VistaBot will be open-sourced to facilitate future research on view-robust robotic manipulation.

Abstract:
Direct visual servoing (DVS) uses raw pixel intensities to control robot motion, yielding high accuracy at convergence. However, the associated photometric cost function is highly nonconvex, which leads to a narrow domain of convergence due to local minima. This work addresses that issue by adapting a Gaussian homotopy framework for cost function smoothing from cross-correlation to the sum of squared differences (SSD) objective used in DVS. The result is a spatially varying, transformation-domain kernel that depends on the motion model, producing smoother cost landscapes and enlarging the convergence basin. We first apply the smoothing to an SSD cost, derive its corresponding transformation kernel for the motion model in the camera domain, and then incorporate it into a DVS control law. The method is compared against uniform image domain blurring via Photometric Gaussian Mixtures. Experiments with an eye-in-hand robotic arm setup over three degrees of freedom translation and with different initial poses show that cost smoothing significantly increases the convergence domain while preserving the accuracy of DVS.

Abstract:
It is currently challenging to deploy visuomotor robots at scale due to the potential of anomalous failures degrading performance, causing damage, or endangering human life. Bimanual manipulators are no exception; these robots have vast state spaces comprised of high-dimensional images and proprioceptive signals. Explicitly defining failure modes within such state spaces is infeasible. In this work, we overcome these challenges by training a probabilistic, history informed, world model within the compressed latent space of a pretrained vision foundation model (NVIDIA's Cosmos Tokenizer). The model outputs uncertainty estimates alongside its predictions that serve as non-conformity scores within a conformal prediction framework. We use these scores to develop a runtime monitor, correlating periods of high uncertainty with anomalous failures. To test these methods, we use the simulated Push-T environment and the Bimanual Cable Manipulation dataset, the latter of which we introduce in this work. This new dataset features trajectories with multiple synchronized camera views, proprioceptive signals, and annotated failures on a challenging data center maintenance task. We benchmark our methods against baselines from the anomaly detection and out-of-distribution detection literature, and show that our approach considerably outperforms statistical techniques. Furthermore, we show that our approach outperforms the next best learning-based approach by 3.8% in terms of failure detection rate, despite requiring approximately one twentieth of the trainable parameters (due to our use of foundation models for image compression). This level of robustness is a crucial step toward safely deploying manipulator robots at scale in real-world environments where reliability is non-negotiable.

Abstract:
Navigation in cluttered environments often requires robots to tolerate contact with movable or deformable objects to maintain efficiency. Existing contact-tolerant motion planning (CTMP) methods rely on indirect spatial representations (e.g., prebuilt map, obstacle set), resulting in inaccuracies and a lack of adaptiveness to environmental uncertainties. To address this issue, we propose a direct contact-tolerant (DCT) planner, which integrates vision-language models (VLMs) into direct point perception and navigation, including two key components. The first one is VLM point cloud partitioner (VPP), which performs contact-tolerance reasoning in image space using VLM, caches inference masks, propagates them across frames using odometry, and projects them onto the current scan to generate a contact-aware point cloud. The second innovation is VPP guided navigation (VGN), which formulates CTMP as a perception-to-control optimization problem under direct contact-aware point cloud constraints, which is further solved by a specialized deep neural network (DNN). We implement DCT in Isaac Sim and a real car-like robot, demonstrating that DCT achieves robust and efficient navigation in cluttered environments with movable obstacles, outperforming representative baselines across diverse metrics. The code is available at: https://github.com/ChrisLeeUM/DCT.

Abstract:
This work presents a discrete-time trajectory optimization framework that achieves near real-time performance for robotic manipulators. This is achieved by drastically speeding up constraint gradient computations. The approach leverages the analytical and structural properties of B-splines to introduce three key speedups: exploiting gradient sparsity from local control, using a hybrid-analytical method to replace most finite differences with closed-form derivatives, and aggregating constraints per knot-span to reduce the problem size. Validated on a simulated UR5e across 64 tasks in a cluttered workspace, these cumulative speedups reduce computation time by up to 96.3% (a 26.9x speedup) relative to a finite-difference baseline, without compromising trajectory quality, success rate, or fidelity to kinematic, dynamic, and collision constraints.

Abstract:
Depth completion from sparse LiDAR points and images is a key perception task for autonomous robots, enabling dense 3D understanding in challenging environments. However, most recent researches achieve accuracy gains by greatly enlarging network size, making them unsuitable for realtime deployment on power- and compute-constrained platforms. This paper proposes an ultra-lightweight depth completion framework optimized for embedded systems. Our approach integrates a re-parameterized encoderdecoder with fewer than 5M parameters and a two-stage hybrid distillation strategy. The first stage progressively densifies sparse depth supervision, while the second preserves edge fidelity through a combination of metric and structural losses. A full TensorRT FP16 pipeline further ensures efficient deployment. Extensive experiments on KITTI Depth Completion, NYU-v2 . demonstrate that our method achieves competitive accuracy while maintaining high efficiency. On a Jetson Xavier NX, the system runs at over 30 FPS with sub-33 ms latency within a 20 W power envelope, showing strong potential for real-world micro-robotic platforms. We will open-source the code to benefit the community. Our open source website: https://github.com/2463450186Q/JetsonCompletion.git

Abstract:
Accurate localization remains a key challenge in swarm robotics, particularly for self-reconfigurable systems that must identify relative positions to form diverse structures. Most existing approaches rely on external tracking infrastructure or high-cost sensors, which limit scalability and deployment in unstructured environments. In this paper, we propose a novel contact-driven localization method for modular robots that leverages only local communication through binary contact information (whether two robots are physically connected or not). To exploit these contact cues, we introduce a virtual-force framework in which robots iteratively refine their posesattracting toward dock-connected neighbors and repelling from non-connected ones. The method requires no external infrastructure and relies only on minimal onboard sensing. Simulations show effective localization during the assembly of towers and cantilevers, enabling accurate, scalable, free-form self-assembly.

Abstract:
Robots often face manipulation tasks in environments where vision is inadequate due to clutter, occlusions, or poor lightingfor example, reaching a shutoff valve at the back of a sink cabinet or locating a light switch above a crowded shelf. In such settings, robots, much like humans, must rely on contact feedback to distinguish free from occupied space and navigate around obstacles. Many of these environments often exhibit strong structural priorsfor instance, pipes often span across sink cabinetsthat can be exploited to anticipate unseen structure and avoid unnecessary collisions. We present a theoretically complete and empirically efficient framework for manipulation in the blind that integrates contact feedback with structural priors to enable robust operation in unknown environments. The framework comprises three tightly coupled components: (i) a contact detection and localization module that utilizes joint torque sensing with a contact particle filter to detect and localize contacts, (ii) an occupancy estimation module that uses the history of contact observations to build a partial occupancy map of the workspace and extrapolate it into unexplored regions with learned predictors, and (iii) a planning module that accounts for the fact that contact localization estimates and occupancy predictions can be noisy, computing paths that avoid collisions and complete tasks efficiently without eliminating feasible solutions. We evaluate the system in simulation and in the real world on a UR10e manipulator across two domestic tasks(i) manipulating a valve under a kitchen sink surrounded by pipes and (ii) retrieving a target object from a cluttered shelf. Results show that the framework reliably solves these tasks, achieving up to a 2x reduction in task completion time compared to baselines, with ablations confirming the contribution of each module.

Abstract:
Recent advancements in whole-body control through deep reinforcement learning have enabled humanoid robots to achieve remarkable progress in real-world challenging locomotion skills. However, existing approaches often struggle with flexible transitions between distinct skills, creating safety concerns and practical limitations. To address this challenge, we introduce a hierarchical multi-skill system, Switch, enabling seamless skill transitions at any moment. Our approach comprises three key components: (1) a Skill Graph (SG) that establishes potential cross-skill transitions based on kinematic similarity within multi-skill motion data, (2) a whole-body tracking policy trained on this skill graph through deep reinforcement learning, and (3) an online skill scheduler to drive the tracking policy for robust skill execution and smooth transitions. For skill switching or significant tracking deviations, the scheduler performs online graph search to find the optimal feasible path, which ensures efficient, stable, and real-time execution of diverse locomotion skills. Comprehensive experiments demonstrate that Switch empowers humanoid to execute agile skill transitions with high success rates while maintaining strong motion imitation performance.

Abstract:
Intelligent surgical robots have the potential to revolutionize clinical practice by enabling more precise and automated surgical procedures. However, the automation of such robot for surgical tasks remains under-explored compared to recent advancements in solving household manipulation tasks. These successes have been largely driven by (1) advanced models, such as transformers and diffusion models, and (2) large-scale data utilization. Aiming to extend these successes to the domain of surgical robotics, we propose a diffusion-based policy learning framework, called Diffusion Stabilizer Policy (ours), which enables training with imperfect, perturbed or even failed trajectories. Our approach consists of two stages: first, we train the diffusion stabilizer policy using only clean data. Then, the policy is continuously updated using a mixture of clean and perturbed data, with filtering based on the prediction error on actions. Comprehensive experiments conducted in both simulation and real-world demonstrate the superior performance of our method under different types of perturbations. Code will be released upon acceptance.

Abstract:
Deep reinforcement learning (DRL) has achieved remarkable success in robot control. However, DRL with tactile feedback still faces challenges in contact-rich tasks involving visual occlusion or high-speed dynamics. The challenges stem from two primary sources. First, the complexity and diversity of real-world tactile sensors make them difficult to simulate and transfer to reality. Second, existing high-fidelity simulators are often too computationally intensive for large-scale DRL, forcing a trade-off between accuracy and speed. To address this, we design a high-speed tactile simulation model for tactile arrays, enabling efficient, large-scale DRL training on GPUs. We then propose the Contrastive Tactile (ConTact) framework, which leverages contrastive learning to align tactile features for sim-to-real transfer. ConTact employs a dedicated spatiotemporal encoder that explicitly models temporal changes to capture the dynamic features of contact events. We then validate it on two kinds of manipulation tasks, Single and Composite Object Tracking (SOT/COT), which rely solely on tactile information and proprioception. Moreover, policies trained with ConTact from simulation are directly deployed in the real world without finetuning, achieving zero-shot transfer.

Abstract:
Existing navigation methods are primarily designed for specific robot embodiments, limiting their generalizability across diverse robot platforms. In this paper, we introduce X-Nav, a novel framework for end-to-end cross-embodiment navigation where a single unified policy can be deployed across various embodiments for both wheeled and quadrupedal robots. X-Nav consists of two learning stages: 1) multiple expert policies are trained using deep reinforcement learning with privileged observations on a wide range of randomly generated robot embodiments; and 2) a single general policy is distilled from the expert policies via navigation action chunking with transformer (Nav-ACT). The general policy directly maps visual and proprioceptive observations to low-level control commands, enabling generalization to novel robot embodiments. Simulated experiments demonstrated that X-Nav achieved zero-shot transfer to both unseen embodiments and photorealistic environments. A scalability study showed that the performance of X-Nav improves when trained with an increasing number of randomly generated embodiments. An ablation study confirmed the design choices of X-Nav. Furthermore, real-world experiments were conducted to validate the generalizability of X-Nav in real-world environments.

Abstract:
Reliable navigation in unstructured, real-world environments remains a significant challenge for embodied agents, especially when operating across diverse terrains, weather conditions, and sensor configurations. In this paper, we introduce GeNIE (Generalizable Navigation System for In-the-Wild Environments), a robust navigation framework designed for global deployment. GeNIE integrates a generalizable traversability prediction model built on SAM2 with a novel path fusion strategy that enhances planning stability in noisy and ambiguous settings. We deployed GeNIE in the Earth Rover Challenge (ERC) at ICRA 2025, where it was evaluated across six countries spanning three continents. GeNIE took first place and achieved 79% of the maximum possible score, outperforming the second-best team by 17%, and completed the entire competition without a single human intervention. These results set a new benchmark for robust, generalizable outdoor robot navigation. We will release the codebase, pretrained model weights, and newly curated datasets to support future research in real-world navigation.

Abstract:
In future intelligent transportation systems, autonomous cooperative planning (ACP), becomes a promising technique to increase the effectiveness and security of multi-vehicle interactions. However, multiple uncertainties cannot be fully addressed for existing ACP strategies, e.g. perception, planning, and communication uncertainties. To address these, a novel deep reinforcement learning-based autonomous cooperative planning (DRLACP) framework is proposed to tackle various uncertainties on cooperative motion planning schemes. Specifically, the soft actor-critic (SAC) with the implementation of gate recurrent units (GRUs) is adopted to learn the deterministic optimal time-varying actions with imperfect state information occurred by planning, communication, and perception uncertainties. In addition, the real-time actions of autonomous vehicles (AVs) are demonstrated via the Car Learning to Act (CARLA) simulation platform. Evaluation results show that the proposed DRLACP learns and performs cooperative planning effectively, which outperforms other baseline methods under different scenarios with imperfect AV state information.

Abstract:
Event cameras, as bio-inspired sensors, are asynchronously triggered with high-temporal resolution compared to intensity cameras. Recent work has focused on fusing the event measurements with inertial measurements to enable ego-motion estimation in high-speed and HDR environments. However, existing methods predominantly rely on IMU preintegration designed mainly for synchronous sensors and discrete-time frameworks. In this paper, we propose GPO, a continuous-time preintegration framework that can efficiently achieve tightly-coupled fusion of fully asynchronous sensors. Concretely, we model the preintegration as two local Temporal Gaussian Process (TGP) trajectories and leverage a light-weight two-step optimization to infer the continuous preintegration pseudo-measurements. We show that the Jacobians of arbitrary queried states can be naturally propagated using our framework, which enables GPO to be involved in the asynchronous fusion. Our method realizes a linear and constant time cost for optimization and query, respectively. To further validate the proposal, we leverage GPO to design an asynchronous event-inertial odometry and compare with other asynchronous fusion schemes. Experiments conducted on both public and own-collected datasets demonstrate that the proposed GPO offers significant advantages in terms of accuracy and efficiency, outperforming existing approaches in handling asynchronous sensor fusion. Our method will be made open source to benefit the community.

Abstract:
We aim to solve the problem of temporal-constraint learning from demonstrations to reproduce demonstration-like logic-constrained behaviors. Learning logic constraints is challenging due to the combinatorially large space of possible specifications and the ill-posed nature of non-Markovian constraints. To this end, we introduce inverse logic-constraint learning (ILCL), a novel temporal-constraint learning method formulated as a two-player zero-sum game between 1) a genetic algorithm-based temporal-logic mining (GA-TL-Mining) and 2) logic-constrained reinforcement learning (Logic-CRL). GA-TL-Mining efficiently constructs syntax trees for parameterized truncated linear temporal logic (TLTL) without predefined templates. Subsequently, Logic-CRL finds a policy that maximizes task rewards under the constructed TLTL constraints via a novel constraint redistribution scheme. Our evaluations show ILCL outperforms state-of-the-art baselines in learning and transferring TL constraints on four temporally constrained tasks. We also demonstrate successful transfer to real-world peg-in-shallow-hole tasks.

Abstract:
Compared to conventional cameras, event cameras provide a high dynamic range and low latency, offering greater robustness to rapid motion and challenging lighting conditions. Although the potential of event cameras for visual place recognition (VPR) has been established, the development of robust VPR frameworks under severe illumination changes remains an open research problem. In this paper, we introduce an ensemble-based approach to event camera place recognition that combines sequence-matched results from multiple event-to-frame reconstructions, VPR feature extractors, and temporal resolutions. Unlike previous event-based ensemble methods, limited to temporal resolution, our broader fusion strategy delivers significantly improved robustness under varied lighting conditions (afternoon, sunset, night, etc.), achieving up to 77% relative improvement in Recall@1 across day-night transitions. We evaluate our approach on two long-term driving datasets (with 8 km per traverse) without metric subsampling, thereby preserving natural variations in speed and stop duration that influence event density. We also conduct a comprehensive analysis of key design choices, including binning strategies, reconstruction methods, and feature extractors, to identify the most critical components for robust performance. Additionally, we propose a modification to the standard sequence matching framework that enhances performance at longer sequence lengths. To facilitate future research, we will release our codebase and benchmarking framework.

Abstract:
Manipulating tangled hoses, cables, or ropes can be challenging for both robots and humans. Humans often approach these perceptually demanding tasks by pushing or pulling tangled cables and observing the resulting motions. We follow a similar idea to aid robotic cable manipulation. In this letter, we integrate visual and proprioceptive perception to segment a grasped cable by moving it even when the robot or the grasped cable sometimes perturb neighboring cables. We formulate the cable interactive segmentation problem in such a way that our methods do not require robot arm segmentation masks. Furthermore, a novel grasp sampling method can propose new cable grasp points given a partial cable segmentation to improve the segmentation via additional cable-robot interaction. We evaluate the proposed motion correlation (MCor) method on data sequences recorded by our physical robotic setup and show that the method outperforms an earlier motion segmentation (MSeg) baseline.

Abstract:
In on-orbit servicing missions using robotic manipulators, certain challenging scenarios require the use of combined control i.e. actuation of spacecraft and the manipulator, to meet mission requirements. The low frequency of the controller of the spacecraft compared to the manipulator can compromise the stability margin of the combined control. In this paper, we first design a combined control strategy to carefully decouple the high-rate manipulator control from the spacecrafts low-rate control. Second, we design a novel discrete controller accounting for the first-order effects of the servicers low sampling rate. This is realized by augmenting a classical proportional-derivative (PD) control scheme. The operational bounds of the discrete controller are first benchmarked on a one-DoF system and further investigated for performance using a multi-DoF orbital manipulator. The results shed light on the regions of enhanced performance in terms of stability and impulse utilization as a measure of efficiency. Simulation results and hardware-in-the-loop (HIL) experiments are performed to validate the proposed method.

Abstract:
The inverse kinematics (IK) of serial manipulators admits multiple solutions, making the selection of the desired one potentially challenging depending on the application at hand. For robots with a non-cuspidal architecture, a suitable strategy is to partition the joint space into independent regions, known as Uniqueness Domains (UDs), which are separated by surfaces defined by singular configurations. Within each UD, a single IK-solution branch corresponds to a specific robot posture. In this case, when an assigned task-space trajectory requires the robot to transition between UDs, a singular configuration must be crossed. Despite the practical importance of this issue, the existing literature lacks straightforward techniques for enabling such transitions. This paper proposes a method that facilitates switching between IK-solution branches when a task-space trajectory requires crossing a singularity, ensuring continuity and differentiability of the joint variables. The proposed method is evaluated against competing ones, both in simulation and experimentally, showing significant advantages.

Abstract:
In this paper, we enable new mobility and manipulation modes for wheeled planetary exploration rovers through the use of terramechanics modeling and field experiments. Useful modes of wheel-based soil manipulation and examples of rovers driving with degraded mobility systems are first demonstrated in lunar and Martian analog environments. We show a full-scale rover use its wheels to dig trenches up to 10.6 cm deep, dig holes to estimate soil characteristics, and modify terrain to make it accessible to a smaller robot. We also measure the impact of actuator failure on a rover in lunar simulant. Here, we show the slip doubled on moderate slopes for a damaged drive motor, which would exceed the rover's operational limits for slip, motivating the need for driving strategies that mitigate mobility loss. We then develop an optimization framework which uses a recently developed terramechanics model to automatically generate both open and closed-loop driving strategies for planetary rovers performing terrain manipulation or operating in a degraded state with no need for hand tuning of behaviors. Finally, we demonstrate the generated driving strategies for soil manipulation and mobility compensation on a rover in a controlled lab setting, where we show that 1) mobility is maintained while manipulating soil; and 2) mobility is regained while experiencing failure of steer and drive actuators.

Abstract:
We propose a novel framework for data-efficient black-box robot learning under constraints. Our approach integrates probabilistic inference with Lagrangian optimization. With the guide of a learned Gaussian process model, the Lagrange multiplier is controlled by the probability of whether the constraints would be satisfied. This reduces the typical oscillations seen in primal-dual updates and therefore improves both data efficiency and safety during learning. Both synthetic results and robot experiments demonstrate that our method is a scalable and effective solution for constrained robot learning problems.

Abstract:
Driving style refers to the behavioral preferences that drivers maintain during driving, shaped by their diverse experiences, habits, and needs, and is typically reflected in varying levels of aggressiveness. If humans choose to use autonomous driving systems, they would expect the driving style of the systems to closely resemble their own habit. However, this is challenging for current industrial autonomous driving systems. To address this, we developed a style controllable action generation method, STAGE, for driving tasks. Its training process is based on imitation learning, incorporating both style value and latent value action modality encoding. Preference learning is then used to identify the user's driving style as a continuous, monotonic style value. And to reduce the cost of human involvement in the preference training process, we also developed a set of rules to compare driving style in data pairs. Then, during inference, the user inputs the style value to control the generated action patterns, dynamically meeting the user's expectations. Using the STAGE method, we verified that the style-controlled action generation results in several typical road scenarios significantly align with human expectations. Furthermore, through comparisons between the STAGE method and various other approaches, we reveal the unique functionalities of STAGE, including its style controllability, style continuity, driving style alignment capability and driving safety. The code for this work is available at: github.com/CarlDegio/STAGE

Abstract:
Freespace detection in autonomous driving is limited by the lack of explicit geometric modeling, hindering generalization across complex terrains. Existing approaches are predominantly data-driven and neglect the physical structure of drivable surfaces. We propose Terrain Flat (TerrFlat), a physics-driven geometric representation that models road surfaces along three interpretable dimensions: lateral smoothness, longitudinal consistency, and vertical deviation. TerrFlat is constructed through geometric reasoning and projected into pixel-aligned maps via a differentiable projection, ensuring geometricvisual consistency. Building on this representation, we introduce a symmetric feature fusion module (SFFM) to integrate TerrFlat with visual features through bidirectional recalibration, improving semantic discrimination and boundary localization. Together, TerrFlat and SFFM form TerrFlat-Seg, a unified framework for physics-aware freespace perception. Experiments on KITTI-Road, Semantic-KITTI, and ORFD datasets demonstrate consistent improvements over existing baselines. Real-world validation on an automated guided vehicle platform further confirms the robustness of our approach.

Abstract:
Due to their stability and robustness, Stable Dynamical Systems (SDS) have received attention as means of representing motions in learning from demonstration tasks. Designing vector fields that fit complex trajectories while ensuring stability still remains a key challenge; although recent deep learning-based methods have shown progress, their tendency to overfit demonstration trajectories often leads to undesirable behaviors, particularly as tasks deviate from demonstrations. Fundamentally, the only reliable way to address this lack of generalization is to provide supervision in out-of-demonstration regions. Focusing on mimicking and contracting behaviors, we propose a Behavior-Controllable Stable Dynamics Model (BCSDM), a one-parameter family of SDS that allows users to adjust the system's overall behavior depending on user intent. We show how to extend BCSDM to accommodate demonstrations of multiple tasks, and propose a Deep Operator Vector Field (DeepOVec) for memory-efficient encoding of multiple dynamical systems. Experiments on tasks that involve mimicking or contracting behaviors demonstrate the advantages of BCSDMs over existing state-of-the-art methods.

Abstract:
The administration and monitoring of shared workspaces are crucial for seamlessly integrating robots to operate in close interactions with humans. Adaptive, versatile, and reliable robot movements are key to achieving effective and successful human-robot synergy. In situations involving unexpected or unintended collisions, robots must react appropriately to minimize risks to humans while still staying focused on their primary tasks or safely resuming them. Although collision detection and identification algorithms are well-established, more advanced robot reactions beyond basic stop-and-wait reactions have not yet been widely adopted and understood. This limitation highlights the need for more sophisticated robot responses to better handle complex collision scenarios, ensuring both safety and task continuity. This letter introduces a novel complete robotic system that leverages the potential of on-board proximity sensor equipment to seamlessly furnish compatible robot reactions while operating in close interactions. With on-board distributed proximity sensors, the robot gains a continuous close workspace awareness, facilitating a transparent negotiation of potential collisions while executing tasks. The proposed system and framework are validated in a collaborative industrial task scenario composed of sub-tasks allocated to the human and the robot and performed within shared regions of the workspace, demonstrating the efficacy of the approach.

Abstract:
Multi-Agent Path Finding (MAPF) plans are increasingly deployed on real multi-robot fleets, where communication dropouts, actuator faults, and sensor noise routinely cause individual robots to deviate from the planned trajectory. We propose Cascading Velocity Modulation (CVM), a continuous execution controller that maps the temporal margin on each dependency edge into a proportional velocity command and propagates an exponentially attenuated damping signal along the dependency chain. CVM runs a three-step control loop: self-recovery, direct cushioning, and cascade propagation. CVM reduces the makespan by about 25 percent on average compared to a binary baseline, over ten randomized scenarios with 5 to 8 agents, each containing a malfunctioning agent that suffers an unexpected delay. An experiment with eight e-puck2 robots reproduces about a 35 percent reduction under two simultaneously malfunctioning agents.

Abstract:
Ensuring human safety in collaborative robotics can compromise efficiency because traditional safety measures increase robot cycle time when human interaction is frequent. This paper proposes a safety-aware approach to mitigate efficiency losses without assuming prior knowledge of safety logic. Using a deep-learning model, the robot learns the relationship between system state and safety-induced speed reductions based on execution data. Our framework does not explicitly predict human motions but directly models the interaction effects on robot speed, simplifying implementation and enhancing generalizability to different safety logics. At runtime, the learned model optimizes task selection to minimize cycle time while adhering to safety requirements. Experiments on a pick-and-packaging scenario demonstrated significant reductions in cycle times.

Abstract:
Modular robots can be reconfigured into multiple morphologies, offering high adaptability for diverse tasks. However, reinforcement learning (RL)-based motion generation typically requires separate policy training for each morphology, and end-to-end training often fails to exploit module-specific roles. This paper proposes a hierarchical policy framework that explicitly separates control at the module level, learning reusable motion skills for each module and coordinating them with an upper-level policy for whole-body control. A single lower-level reaching policy, shared across all arm modules, is trained once and reused across morphologies, ensuring that module-specific functions are preserved even as complexity increases. The method is evaluated on the modular robot emphMoonBot in simulation, demonstrating scalable control of diverse morphologies and improved learning efficiency and interpretability over non-hierarchical baselines.

Abstract:
Predictive models can be particularly helpful for robots to effectively manipulate terrains in construction sites and extraterrestrial surfaces. However, terrain state representations become extremely high-dimensional especially to capture fine-resolution details and when depth is unknown or unbounded. This paper introduces L-GBND, a learning-based approach for terrain dynamics modeling and manipulation, leveraging the Graph-based Neural Dynamics (GBND) framework to represent terrain deformation as motion of a graph of particles. Based on the principle that the moving portion of a terrain is usually localized, our approach builds a large terrain graph (potentially millions of particles) but only identifies a very small active subgraph (hundreds of particles) for predicting the outcomes of robot-terrain interaction. To minimize the size of the active subgraph we introduce a learning-based approach that identifies a small region of interest (RoI) based on the robot's control inputs and the current scene. We also introduce a novel domain boundary feature encoding that allows GBNDto perform accurate dynamics prediction in the RoI interior while avoiding particle penetration through RoI boundaries. Our proposed method is both orders of magnitude faster than naive GBND and it achieves better overall prediction accuracy. We further evaluated our framework on excavation and shaping tasks on terrain with different granularity.

Abstract:
Despite advancements in self-supervised monocular depth estimation, challenges persist in dynamic scenarios due to the dependence on assumptions about a static world. In this paper, we present Manydepth2, to achieve precise depth estimation for both dynamic objects and static backgrounds, all while maintaining computational efficiency. To tackle the challenges posed by dynamic content, we incorporate optical flow and coarse monocular depth to create a pseudo-static reference frame. This frame is then utilized to build a motion-aware cost volume in collaboration with the vanilla target frame. Furthermore, to improve the accuracy and robustness of the network architecture, we propose an attention-based depth network that effectively integrates information from feature maps at different resolutions by incorporating both channel and non-local attention mechanisms. Compared to methods with similar computational costs, Manydepth2 achieves a significant reduction of approximately five percent in root-mean-square error for self-supervised monocular depth estimation on the KITTI-2015 dataset. The code could found https://github.com/kaichen-z/Manydepth2.

Abstract:
Visual Place Recognition (VPR) in mobile robotics enables robots to localize themselves by recognizing previously visited locations using visual data. While the reliability of VPR methods has been extensively studied under conditions such as changes in illumination, season, weather and viewpoint, the impact of motion blur is relatively unexplored despite its relevance not only in rapid motion scenarios but also in low-light conditions where longer exposure times are necessary. Similarly, the role of image deblurring in enhancing VPR performance under motion blur has received limited attention so far. This paper bridges these gaps by introducing a new benchmark designed to evaluate VPR performance under the influence of motion blur and image deblurring. The benchmark includes three datasets that encompass a wide range of motion blur intensities, providing a comprehensive platform for analysis. Experimental results with several well-established VPR and image deblurring methods provide new insights into the effects of motion blur and the potential improvements achieved through deblurring. Building on these findings, the paper proposes adaptive deblurring strategies for VPR, designed to effectively manage motion blur in dynamic, real-world scenarios.

Abstract:
Accurate state estimation is essential for an autonomous agricultural robots reliable operations. The effectiveness of state estimation is influenced by a number of factors, such as sensor-fusion algorithms, the environment, and sensor quality. When the robot traverses in large-scale scenarios, the distance travelled and high-speed mobility produce a drift in the estimation process and should be carefully considered. Moreover, the time-varying noise in sensors affects odometry accuracy further; this is especially noticeable in long travel. This research work is related to the multi-constraints-based state estimation in large unstructured environments with uneven terrain, with a focus on agricultural applications. Using LiDAR-IMU based fusion, our goal is to provide a reliable & accurate localization solution in complex environments like agricultural fields. Furthermore, the agricultural environments become more challenging due to the uneven terrain and lack of features. Our research proposes a hybrid framework which combines factor graph-based optimization & adaptive Kalman filtering to address these challenges in complex environments. Furthermore, performance evaluation is conducted on self-collected datasets from agricultural environments as well as on open-access datasets such as GRACO & KITTI.

Abstract:
In this paper, we address the problem of cooperative manipulation of a cable-suspended load by a team of aerial robots. Unlike classical approaches that rely on centralized controllers, we propose a Distributed Nonlinear Model Predictive Control (DNMPC) framework in which the UAVs communicate over a peer-to-peer network a reduced amount of variables. In the proposed method, each robot handles only a small subset of the global optimization problem. The optimal motion computed by the distributed DNMPC loop is then used as a reference for local nonlinear controllers that track the trajectory and compute the robot's actuation inputs. We validate the proposed scheme both through numerical simulations and real-world experiments on the Fly-Crane system: a rigid platform connected to three robots by pairs of cables.

Abstract:
Direct laser deposition, a specialized form of additive manufacturing, shows good potential in numerous high-value applications such as the repair of aeroengine blades. However, the traditional setup for this technique is bulky and not suited for in-situ repair, requiring the costly disassembly of the aeroengine. This letter presents a miniaturized high-repeatability tendon-driven robot that showed good potential for delivering additive manufacturing equipment for in-situ techniques like direct laser deposition. The integrated actuation and ruggedized control unit make the robot portable and suitable for a variety of aeroengines. The design of the robot actuation prevents excessive bending and damage to the fiber optic. Continuum robots have the advantage of flexible and redundant structures but present limited accuracy and repeatability. The optimized kinematics and actuation of the robot presented permitted to achieve an excellent repeatability with a standard deviation of 0.02 mm on a linear path and below 0.1 mm on a path that simulates the reconstruction of a blade. The robot showed excellent linearity on each segment of the path with a coefficient of determination to the 3D best-fit line of 0.999, while maintaining the commanded velocity magnitude of the end effector with a standard deviation along the whole path of 0.05 mm/s.

Abstract:
Decentralized Collaborative Simultaneous Localization And Mapping (C-SLAM) techniques often struggle to identify map overlaps due to significant viewpoint variations among robots. Motivated by recent advancements in 3D foundation models, which can register images despite large viewpoint differences, we propose a robust loop closing approach that leverages these models to establish inter-robot measurements. In contrast to resource-intensive methods requiring full 3D reconstruction within a centralized map, our approach integrates foundation models into existing SLAM pipelines, yielding scalable and robust multi-robot mapping. Our contributions include: (1) integrating 3D foundation models to reliably estimate relative poses from monocular image pairs within decentralized C-SLAM; (2) introducing robust outlier mitigation techniques critical to the use of these relative poses; and (3) developing specialized pose graph optimization formulations that efficiently resolve scale ambiguities. We evaluate our method against state-of-the-art approaches, demonstrating improvements in localization and mapping accuracy, alongside significant gains in computational and memory efficiency. These results highlight the potential of our approach for deployment in large-scale multi-robot scenarios.

Abstract:
The rapid advancement of 3D scene understanding techniques presents a significant opportunity for enhancing autonomous driving simulation systems. As these systems are increasingly required to operate in complex, large-scale, and unbounded real-world environments, efficient and high-fidelity 3D reconstruction of common outdoor scenes has become a critical prerequisite for realistic and extensible autonomous driving simulation. 3D Gaussian Splatting has achieved state-of-the-art performance in novel view synthesis, coupled with real-time rendering efficiency. However, large-scale reconstruction for autonomous driving scenarios faces several challenges as scenes grow in complexity: (1) limited views with insufficient pose diversity, (2) inadequate representation of geometric structural details, and (3) complex lighting conditions involving saturation and shadow variations. To cope with these challenges, we propose LSADS-Gaussian, a novel model for large-scale autonomous driving scene reconstruction. The model consists of a Multimodal Gaussian Network (MGN) module composed of two Gaussian sub-networks, designed to perform Gaussian aggregation and optimization from multi-sensor data, a Geometric Representation Guidance (GRG) module refines and enhances geometric consistency, and a Lighting Enhancement (LE) module introduces learnable illumination coefficients to maintain illumination consistency. Extensive experiments show that LSADS-Gaussian outperforms the state-of-the-art methods.

Abstract:
Vision-and-Language Navigation in Continuous Environments (VLN-CE), which links language instructions to perception and control in the real world, is a core capability of embodied robots. Recently, large-scale pretrained foundation models have been leveraged as shared priors for perception, reasoning, and action, enabling zero-shot VLN without task-specific training. However, existing zero-shot VLN methods depend on costly perception and passive scene understanding, collapsing control to point-level choices. As a result, they are expensive to deploy, misaligned in action semantics, and short-sighted in planning. To address these issues, we present DreamNav that focuses on the following three aspects: (1) for reducing sensory cost, our EgoView Corrector aligns viewpoints and stabilizes egocentric perception; (2) instead of point-level actions, our Trajectory Predictor favors global trajectory-level planning to better align with instruction semantics; and (3) to enable anticipatory and long-horizon planning, we propose an Imagination Predictor to endow the agent with proactive thinking capability. On VLN-CE and real-world tests, DreamNav sets a new zero-shot state-of-the-art (SOTA), outperforming the strongest egocentric baseline with extra information by up to 7.49% and 18.15% in terms of SR and SPL metrics. To our knowledge, this is the first zero-shot VLN method to unify trajectory-level planning and active imagination while using only egocentric inputs.

Abstract:
Autonomous drone racing has attracted increasing interest as a research topic for exploring the limits of agile flight. However, existing studies primarily focus on obstacle free racetracks, while the perception and dynamic challenges introduced by obstacles remain underexplored, often resulting in low success rates and limited robustness in realworld flight. To this end, we propose a novel vision-based curriculum reinforcement learning framework for training a robust controller capable of addressing unseen obstacles in drone racing. We combine multi-stage cu rriculum learning, domain randomization, and a multi-scene updating strategy to address the conflicting challenges of obstacle avoidance and gate traversal. Our end-to-end control policy is implemented as a single network, allowing high-speed flight of quadrotors in environments with variable obstacles. Both hardware-in-the-loop and real-world experiments demonstrate that our method achieves faster lap times and higher success rates than existing approaches, effectively advancing drone racing in obstacle-rich environments. The video and code are available at: https://github.com/SJTU-ViSYS-team/CRL-Drone-Racing.

Abstract:
This study presents dynamic scoop-and-flick manipulation, a robotic technique that achieves desired projectile motions of target objects through rapid, non-prehensile physical interactions. The method allows a robot to scoop objects resting on a surface and quickly launch them into projectile trajectories. We formulate a theoretical model of the technique and realize it through a hybrid approach that combines model-based reasoning and data-driven learning. The advantages-namely, rapid and accurate pick-and-place with reduced planning complexity-are validated in experiments conducted with a particularly challenging class of objects: low-profile items with small thickness.

Abstract:
To advance 3D reconstruction from static digital replicas towards semantically interactive Living Maps responsive to an agent's queries, we propose ARTEMIS, a system for Active Real-time Textured Environment Meshing with Interactive Semantics. At its core, our Semantic Brush is a methodology comprised of tightly-coupled modules for segmentation, constraint, and refinement that operate in a two-stage, coarse-to-fine pipeline. Initially, its segmentation and constraint modules translate natural language into a semantically-aware mesh, enforcing sharp object boundaries with a unified energy function. Subsequently, its refinement module computes a unified reliability metric from color and depth consistency to guide the joint optimization of the texture map and semantic labels. This holistic process inherently filters unreliable measurements, establishing a complete interactive workflow from language input to real-time highlighting on a high-fidelity textured mesh. We evaluated ARTEMIS on public datasets and in real-world scenarios. The results demonstrate its state-of-the-art accuracy in mesh reconstruction, while simultaneously attaining high fidelity in both texture and semantics. To share our findings and make contributions to the community, our code will be made publicly available.

Abstract:
We provide a novel end-to-end framework for the execution of an assembly operation by two robotic arms, given the digital CAD models of the parts and their desired relative placement in their assembled state. We analyze and demonstrate the advantages of using two robotic arms simultaneously in tight assembly operations, compared to single-arm systems. Our method is implemented in both simulation and using physical robots. It provides theoretical guarantees on execution time and trajectory accuracy, supported by empirical evidence. In particular, we show that coordinated movement of two arms reduces average execution time by more than 50% compared to using a single arm only, produces higher-quality trajectories, and accelerates the search for valid robot placements. Furthermore, we establish bounds on the required dimensions of the robotic cell. Our open source software together with real-life video demonstrations are available in our project page.

Abstract:
We aim to solve the problem of learning user-intended granular skills from multi-granularity demonstrations. Traditional learning-from-demonstration methods typically rely on extensive fine-grained data, interpolation techniques, or dynamics models, which are ineffective at encoding or decoding the diverse granularities inherent in skills. To overcome it, we introduce a novel diffusion-SSM based policy (DiSPo) that leverages a state-space model, Mamba, to learn from diverse coarse demonstrations and generate multi-scale actions. Our proposed step-scaling mechanism in Mamba is a key innovation, enabling memory-efficient learning, flexible granularity adjustment, and robust representation of multi-granularity data. DiSPo outperforms state-of-the-art baselines on coarse-to-fine benchmarks, achieving up to an 81% improvement in success rates while enhancing inference efficiency by generating inexpensive coarse motions where applicable. We validate DiSPo's scalability and effectiveness on real-world manipulation scenarios. Code and Videos are available at https://robo-dispo.github.io.

Abstract:
Vision-Language Pre-training models (VLMs) have emerged as a highly promising solution to the generative problem, achieving remarkable success in the field of 2D image generation. However, extending these 2D paradigms to 3D domains is still unexplored due to the scarcity of text-3D pairs and shape ambiguity. To address this challenge, we introduce UM3D, a two-stage pre-training architecture towards unified multimodal 3D shape generation. Our approach first optimizes a Finite Scalar Quantization based Autoencoder (FSQ-AE) to learn a compact yet powerful implicit representation with improved codebook utilization. We then encode sketch features into CLIP's multimodal embedding space to incorporate additional geometric information. This unified space conditions our well-designed Instance-Normalized Glow model (Glow-IN) to model the distribution of 3D shape representations while mitigating distribution shift issues. During inference, UM3D can accept individual text, image, sketch, or combined inputs to generate corresponding 3D shapes. Quantitative and qualitative evaluations confirm our method's effectiveness in synthesizing high-fidelity, input-consistent 3D geometries.

Abstract:
The YOLO series of models are pivotal for real-time object detection, yet their deployment on resource-constrained edge devices necessitates effective model compression. Post-Training Quantization (PTQ) offers a promising, low-cost solution, but existing methods, primarily designed for classification tasks, often lead to significant performance degradation when applied to YOLO models. In this paper, we systematically analyze the key challenges in quantizing YOLO architectures. We identify three primary obstacles: (1) the high sensitivity of detection tasks to quantization errors, exacerbated by the non-linear IoU metric; (2) the pronounced long-tail distribution of activations, particularly with the SiLU function, which complicates low-bit quantization; and (3) the structural heterogeneity of the multi-scale, multi-task detection head, which renders conventional block-wise quantization strategies ineffective. To address these challenges, we propose a novel framework, Task-Aware and Structure-Knowledge-guided Quantization (TASKQ). Our framework introduces three key components: a sparse quantization strategy to mitigate the impact of long-tailed activations, a Detection-aware Task Regularization (DTR) mechanism that incorporates IoU-based loss to guide parameter fine-tuning, and a Scale-and-Task-Aware Head-wise Quantization (STAHQ) scheme that aligns quantization granularity with the head's functional structure. Extensive experiments on various YOLO models demonstrate that TASKQ significantly outperforms existing PTQ methods, especially in low-bit scenarios, establishing a new state-of-the-art for end-to-end YOLO quantization.

Abstract:
Planar structures, ubiquitous in man-made indoor environments, enable compact and accurate scene abstraction for various downstream tasks. Recent methods distill planar features into learning-based MVS geometries to obtain coherent 3D plane estimation from multi-view inputs. However, the lack of explicit planar instance definitions hinders semanticgeometry alignment, leading to distorted geometry and mismatched semantics. To address this, we propose PIPS, a planar-instance 3D reconstruction method that leverages planar structural priors for both single-view planar segmentation (SGPS module) and multi-view instance association (MVPI module). The planar instance point clouds are regularized by planar distances and then converted into complete planar meshes via an instance-level planar meshing strategy. Extensive experiments on hundreds of indoor scenes demonstrate the superior performance of our method, which is less dependent on annotations and requires no feature optimization. The effectiveness of each component is further verified through comprehensive ablation studies. The project page of PIPS is available at https://pips325.github.io.

Abstract:
LiDAR-Inertial Odometry (LIO) is crucial for robot navigation and autonomous driving. Most existing methods rely on the assumption of a static environment, indiscriminately using all LiDAR measurements for localization. However, LiDAR data acquired in urban scenes often contain dynamic objects such as vehicles and pedestrians, which can adversely affect localization accuracyparticularly when using solid-state LiDAR with a relatively narrow field of view. To address this issue, we propose a novel Real-time and Robust solid-state LiDAR-Inertial Odometry (R2-LIO) framework that remove the dynamic objects to improve the localization accuracy and robustness. Specifically, we design a dynamic point removal mechanism based on voxel state changes, which removes dynamic points and preserving most static points to effectively reduce interference from dynamic objects. In addition, we introduce a line search mechanism into the Error State Iterated Kalman Filter (ESIKF) to improve the localization accuracy. Experimental results on the challenging YULAN and HeLiPR datasets show that R2-LIO surpasses existing methods, verifying its effectiveness in improving the localization accuracy and robustness.

Abstract:
The emerging field of Vision-Language-Action (VLA) for humanoid robots faces several fundamental challenges, including the high cost of data acquisition, the lack of a standardized benchmark, and the significant gap between simulation and the real world. To overcome these obstacles, we propose RealMirror, a comprehensive, open-source embodied AI VLA platform. RealMirror builds an efficient, low-cost data collection, model training, and inference system that enables end-to-end VLA research without requiring a real robot. To facilitate model evolution and fair comparison, we also introduce a dedicated VLA benchmark for humanoid robots, featuring multiple scenarios, extensive trajectories, and various VLA models. Furthermore, by integrating generative models and 3D Gaussian Splatting to reconstruct realistic environments and robot models, we successfully demonstrate zero-shot Sim2Real transfer, where models trained exclusively on simulation data can perform tasks on a real robot seamlessly, without any fine-tuning. In conclusion, with the unification of these critical components, RealMirror provides a robust framework that significantly accelerates the development of VLA models for humanoid robots. Project page: https://terminators2025.github.io/RealMirror.github.io

Abstract:
We introduce and study the Joint Task Assistance Planning problem which generalizes prior work on optimizing assistance in robotic collaboration. In this setting, two robots operate over predefined roadmaps, each represented as a graph corresponding to its configuration space. One robot, the task robot, must execute a timed mission, while the other, the assistance robot, provides sensor-based support that depends on their spatial relationship. The objective is to compute a path for both robots that maximizes the total duration of assistance given. Solving this problem is challenging due to the combinatorial explosion of possible path combinations together with the temporal nature of the problem (time needs to be accounted for as well). To address this, we propose a nested Branch and Bound framework that efficiently explores the space of robot paths in a hierarchical manner. We empirically evaluate our algorithm and demonstrate a speedup of up to two orders of magnitude when compared to a baseline approach.

Abstract:
Precise control in modern robotic applications is always an open issue due to unknown time-varying disturbances. Existing meta-learning-based approaches require a shared representation of environmental structures, which lack flexibility for realistic non-structural disturbances. Besides, representation error and the loss of model generalizability can lead to heavy degradation in prediction accuracy. This work presents a generalizable disturbance estimation framework that builds on meta-learning and feedback-calibrated online adaptation. By extracting features from a finite time window of past observations, a unified representation that effectively captures general non-structural disturbances can be learned without predefined structural assumptions. The online adaptation process is subsequently calibrated by a state-feedback mechanism to attenuate the learning residual. Theoretical analysis shows that simultaneous convergence of both the online learning error and the disturbance estimation error can be achieved. Through the unified meta-representation, our framework effectively estimates multiple rapidly changing disturbances, as demonstrated by quadrotor flight experiments.

Abstract:
Successfully executing grasping tasks within highly cluttered spaces is still a significant hurdle in robotics, especially in scenarios involving severe target occlusion. To tackle this, we present a novel self-supervised framework driven by deep reinforcement learning that enables robots to acquire pushgrasp synergy for reliable manipulation under occlusions. The core contribution of this research is the target switching mechanism that dynamically selects alternative targets when the goal object is severely occluded. Furthermore, we utilize a strategy for selecting actions based on object masks to reduce the action space, thereby improving efficiency and minimizing ineffective operations. Comprehensive evaluations across both simulated and physical environments confirm that our method achieves robust grasping performance under severe or complete occlusions. Notably, the learned policy is readily transferable to physical environments and generalizes effectively to previously unseen objects.

Abstract:
Microassembly is becoming increasingly critical in modern smart manufacturing, placing higher demands on system performanceparticularly for in situ and in vivo applications in biomedicine, photonics, sensors, and microrobotics. Non-contact mechanical microassembly has emerged as a promising solution, addressing challenges such as part contamination, limited environmental compatibility, and undesired microscopic forces. This paper presents an ultrasonic-driven non-contact microassembly system capable of performing a representative peg-in-hole assembly. The system primarily consists of an ultrasonic phased transducer array, which serves as a holographic acoustic end-effector, and a microscope that provides visual feedback. Elliptical and semi-elliptical holographic acoustic end-effectors are designed and generated by generative adversarial networks. A homogeneous transformation strategy is employed to generate pre-planned phase-only hologram (POH) sequence, while a closed-loop control strategy dynamically adjusts the end-effectors pose by integrating real-time visual feedback. Experimental results demonstrate that, with disturbances compensated by the closed-loop strategy, the system can stably adjust the pegs position and orientation to achieve acceptable alignment accuracy. It successfully manipulates high-aspect-ratio objects to complete the peg-in-hole assembly in fluidic and strong magnetic environments. Moreover, the system requires no preset object position or orientation and does not alter the objects form or structure during operation, demonstrating strong potential for broader in situ applications.

Abstract:
Knotting plastic bags is a common task in daily life, yet it is challenging for robots due to the bags' infinite degrees of freedom and complex physical dynamics. Existing methods often struggle in generalization to unseen bag instances or deformations. To address this, we present DexKnot, a framework that combines keypoint affordance with diffusion policy to learn a generalizable bag-knotting policy. Our approach learns a shape-agnostic representation of bags from keypoint correspondence data collected through real-world manual deformation. For an unseen bag configuration, the keypoints can be identified by matching the representation to a reference. These keypoints are then provided to a diffusion transformer, which generates robot action based on a small number of human demonstrations. DexKnot enables effective policy generalization by reducing the dimensionality of observation space into a sparse set of keypoints. Experiments show that DexKnot achieves reliable and consistent knotting performance across a variety of previously unseen instances and deformations.

Abstract:
The ability to interpret and reason about spatial relations is fundamental for robotic manipulation tasks. For instance, a robot must understand that "inside" requires different geometric constraints than "touching", and "closer" involves dynamic changes in distance relationships. Despite progress in modeling spatial relations, existing approaches face two critical limitations: they either oversimplify object geometry to points or bounding boxes, or they lack generative capabilities for synthesizing new spatial configurations. This paper introduces a novel generative and probabilistic model that jointly encodes object sizes, distances, and orientations within a unified representation, which captures distance-based, directional, and topological spatial relations while providing explicit uncertainty quantification. The model learns both static and dynamic semantic spatial relations from one or a few visual demonstrations and generalizes to novel contexts and configurations. We evaluate our approach across a set of spatial reasoning and robot manipulation tasks, demonstrating the model's robust performance with varied object shapes, sizes, and spatial arrangements. Videos and source code are available at https://sites.google.com/view/spatial-relations.

Abstract:
Planning in dynamic environments often relies on explicit future observation prediction or value-based estimation, both of which can be brittle or hard to generalize in uncertain settings. We propose a novel model-based reinforcement learning framework that performs trajectory rollout and optimization entirely in a learned latent space. Instead of predicting future observations explicitly, our method evaluates candidate trajectories through multi-step reward prediction and terminal Q-value estimation in the latent domain, enabling robust and generalizable planning in dynamic environments. A policy model generates an initial trajectory in latent space, which is then refined via a smoothness-regularized optimization using Model Predictive Path Integral (MPPI), guided by the predicted cumulative reward and Q-values. This avoids the complexity of future state reconstruction while ensuring dynamically feasible execution. To enhance the model's deployment performance in crowded or interactive scenarios, we further introduce a lightweight social reward that penalizes unsafe overtaking and encourages yielding behavior. Experiments in both simulation and real-world environments show improved success rate, efficiency, and social acceptability compared to strong baselines.

Abstract:
Autonomous Driving Systems (ADS) require rigorous and complex testing under diverse conditions to fulfill various demands and purposes of testing tasks, such as occlusion-triggered events, necessitating semantic-level control in scenario generation. Existing methods, reliant on low-level state controls, struggle to represent high-level semantic intents for task-oriented testing. We propose SPATSG, a novel framework for event-driven, semantically aligned traffic scenario generation, leveraging Spatiotemporal Polygon Anchors (SPA) to bridge high-level test requirements and low-level diffusion guidance. SPAs encapsulate critical geometric and temporal patterns of traffic agents, derived from a set of targeted scenarios. During diffusion denoising, SPATSG integrates SPAs via an auxiliary loss to steer sampling toward desired semantics. A dynamic resampling strategy further intensifies guidance and prioritizes promising trajectory candidates progressively to balance exploration and refinement. We evaluate SPATSG on SinD, a Chinese intersection benchmark featuring complex interactions and diverse conflicts. Experiments on occlusion-triggered scenario generation show that SPATSG demonstrates superior semantic controllability, effectively reveals risk events across ADS, and maintains diversity and realism compared to baselines. This work offers a principled, interpretable approach for semantically controllable ADS testing and evaluation.

Abstract:
Medical robotic laser systems require precise positioning and movements to control laser beam paths associated with sensors and other optical systems in many applications (e.g., laser surgery, laser-based tissue diagnosis). While existing robotic laser beam control systems were developed for microscale control to achieve highly precise steering and focusing, they assume a single robot which is limiting in applications where the beam path must cover large areas and angles (e.g., 360-degree full-view object scanning). To expand imaging flexibility, we propose a novel robot-mirror framework to use robot-attached mirrors to control a 3D free-space laser beam, which is referred to as N-mirror-N-robot system where N is the number of mirrors and robots. This framework allows for general laser beam planning to trace targets based on geometric constraints of 3D obstacles and fixed orientations and positions with unlimited number of robot-and-mirror combinations. We develop a prototype for the special case with a single mirror attached to the robot (N = 1). This prototype integrates an RGB-D depth camera for object tracking, a 6- DOF robot-attached mirror, and a laser diode source. We propose a computational framework for system kinematics and calibration. Simulation and real experiments are conducted to track specified paths, markers, phantoms, and real tissue to verify the system feasibility. The results show an average object tracking error of approximately 2.0 mm that is close to the depth accuracy of the camera. This N = 1 prototype shows promise for N > 1 case and the potential for general 3D laser planning under arbitrary geometric constraints.

Abstract:
Neural implicit representations have demonstrated excellent performance in Simultaneous Localization and Mapping (SLAM) by virtue of their ability to jointly model geometry, color and camera poses. Recent studies have attempted to integrate scene semantic information into implicit representation frameworks, significantly improving the ability of environmental understanding. Nevertheless, most existing methods rely on direct semantic coloring or rough fusing other modalities, resulting in underutilized semantic clues. This further causes problems such as blurred small objects, loss of fine structures and unclear regional boundaries. Additionally, redundant features introduced in the process reduce system efficiency. To address these challenges, we propose MTE-SLAM, an accurate and efficient end-to-end neural RGB-D semantic SLAM framework that synergizes Multi-Tier Feature Fusion (MTFF) and Feature Redundancy Suppressor (FRS). MTFF progressively fuses semantic features at global and local scales. The global context enhancement module captures scene-level semantic correlations, while the local continuity enhancement module refines neighborhood consistency, generating detailed and coherent semantic maps. FRS adaptively filters redundant features based on their importance and temporal variation, reducing parameters and computation while preserving representational power to accelerate training and inference. Comprehensive evaluations on Replica and ScanNet demonstrate that MTE-SLAM achieves centimeter-level reconstruction, state-of-the-art tracking and semantic accuracy, and runs up to four times faster than existing semantic SLAM systems.

Abstract:
Bimanual robot manipulators can achieve impressive dexterity, but typically rely on two full six- or seven-degree-of-freedom arms so that paired grippers can coordinate effectively. This traditional framework increases system complexity and footprint while only exploiting a fraction of the overall workspace for dexterous interaction. We introduce the OURS (OURSLONG), a compact system in which two reduced-mobility arms (3+ DOF each) are coupled into a kinematic chain that preserves full relative positioning between grippers and enables the entirety of systems workspace to be used for dexterity. To guide our design, we formulate a kinematic dexterity metric that enlarges the dexterous workspace while keeping the mechanism lightweight and wearable. The resulting system supports two complementary modes: (i) wearable kinesthetic data collection with self-tracked gripper poses, and (ii) deployment on a standard robot arm, extending dexterity across its entire workspace. We present kinematic analysis and design optimization methods for maximizing dexterous range, and demonstrate an end-to-end pipeline in which wearable demonstrations train imitation learning policies that perform robust, real-world bimanual manipulation.

Abstract:
Viewpoint shifts significantly change how gestures and facial expressions appear and frequently cause occlusions, posing a critical challenge for robust Sign Language Recognition (SLR). To address this challenge, we exploit the spatial flexibility and computational efficiency of skeleton data and propose ViSL, a dual-stream contrastive learning framework to learn underlineView-underlineinvariant representations for underlineSign underlineLanguage understanding. Specifically, the primary and lifting streams share a common visual feature extractor with different types of input: the primary stream (P-Stream) directly processes frontal-view skeleton data, and the lifting stream (L-Stream) synthesizes skeleton data from arbitrary viewpoints based on 3D estimations. We further propose a view-invariant contrastive loss to align representations across both viewpoints and streams. Experimental results on the challenging cross-view setting of MM-WLAuslan demonstrate that ViSL achieves substantial performance improvements, highlighting its potential for robust real-world SLR applications.

Abstract:
Abstract As teamed robots increasingly share public spaces with humans, the ability to co-adaptto mutually adjust behavior in response to one anotherbecomes essential for safe, efficient, and socially acceptable operation. This paper introduces a socially co-adaptive framework for heterogeneous multi-robot systems (HMRS) that enables real-time adaptation to human behavior while preserving cooperative task execution. Our approach fuses large language models for natural language understanding with model-agnostic meta-learning to allow robots to rapidly generalize across diverse social contexts. We implement and validate the system using a real-world HMRS composed of robots with different rolesworkers, a station, and a social robotinteracting with 44 human participants under induced behavioral states (relaxed vs. nervous). Results reveal significant behavioral adaptation: the system dynamically shifts between egoistic and altruistic strategies, improving crowd guidance success by 21%. It also reduces human cognitive loadspecifically, physical demands by 39% and temporal demands by 39%while increasing trust by 16% and perceived anthropomorphism by 21%. This work demonstrates the feasibility of human-robot co-adaptation at scale, laying the groundwork for socially intelligent robotic systems capable of thriving in complex, human-centered environments.

Abstract:
Automated manipulation of nanoliter-scale implantable microdevices (IMDs) typically relies on complex, custom-built robotic setups that are difficult to reproduce and require extensive manual calibration. To address this challenge, this paper proposes an easily deployable and highly reproducible vision-servoed manipulation system for IMDs. Based on standard commercial off-the-shelf devices, the proposed platform is hardware-agnostic and eliminates the need for tedious manual calibration. The automated workflow seamlessly integrates coarse positioning, auto-focus, and marker-aided centering to achieve robust precision. The system is validated using a sub-nanoliter IMD, the microscale optoelectronic tetherless electrode (MOTE). Experimental results demonstrate that the proposed framework requires minimal manual intervention and significantly reduces operating time by 47.2 % compared to manual injection performed by an experienced user. These results pave the way for economical, high-throughput, and automated IMD-based in vitro and in vivo experiments, and beyond.

Abstract:
LiDAR-based semantic segmentation is a key component for autonomous mobile robots, yet large-scale annotation of LiDAR point clouds is prohibitively expensive and time-consuming. Although simulators can provide labeled synthetic data, models trained on synthetic data often underperform on real-world data due to a data-level domain gap. To address this issue, we propose DRUM, a novel Sim2Real translation framework. We leverage a diffusion model pre-trained on unlabeled real-world data as a generative prior and translate synthetic data by reproducing two key measurement characteristics: reflectance intensity and raydrop noise. To improve sample fidelity, we introduce a raydrop-aware masked guidance mechanism that selectively enforces consistency with the input synthetic data while preserving realistic raydrop noise induced by the diffusion prior. Experimental results demonstrate that DRUM consistently improves Sim2Real performance across multiple representations of LiDAR data. The project page is available at https://miya-tomoya.github.io/drum.

Abstract:
Reasoning is essential for purposeful action, yet most robotic foundation models map perception and instructions directly to control, limiting adaptability, generalization, and semantic grounding. We introduce Action Reasoning Models (ARMs), which integrate perception, planning, and control through a structured three-stage pipeline. Our model, Molmoact, encodes observations and instructions into depth perception tokens, generates 2D spatial plans, and predicts fine-grained actions, enabling explainable and steerable behavior. Molmoact-7B-D achieves 70.5% zero-shot accuracy on SimplerEnv Visual Matching (surpassing Pi-0 and GR00T N1.5), 86.6% average success on LIBERO, and real-world fine-tuning gains of +10% (single-arm) and +22.7% (bimanual) over Pi-0-FAST. It further improves out-of-distribution generalization by +23.3% and ranks highest in human-preference evaluations for open-instruction following and trajectory steering. We also release Molmoact Dataset, a dataset of 10k diverse robot trajectories that yields an average +5.5% performance boost when used for training. Together with open model weights and code, this establishes Molmoact as a state-of-the-art robotic foundation model and an open blueprint for building ARMs that transform perception into grounded, purposeful action. Further experimental details and result with Molmoact Dataset and human-preference evaluations included in supplementary video.

Abstract:
Humanoid robots are designed to perform diverse loco-manipulation tasks. However, they face challenges due to their high-dimensional and unstable dynamics, as well as the complex contact-rich nature of the tasks. Model-based optimal control methods offer flexibility to define precise motion but are limited by high computational complexity and accurate contact sensing. On the other hand, reinforcement learning (RL) handles high-dimensional spaces with strong robustness but suffers from inefficient learning, unnatural motion, and sim-to-real gaps. To address these challenges, we introduce Opt2Skill, an end-to-end pipeline that combines model-based trajectory optimization with RL to achieve robust whole-body loco-manipulation. Opt2Skill generates dynamic feasible and contact-consistent reference motions for the Digit humanoid robot using differential dynamic programming (DDP) and trains RL policies to track these optimal trajectories. Our results demonstrate that Opt2Skill outperforms baselines that rely on human demonstrations and inverse kinematics-based references, both in motion tracking and task success rates. Furthermore, we show that incorporating trajectories with torque information improves contact force tracking in contact-involved tasks, such as wiping a table. We have successfully transferred our approach to real-world applications.

Abstract:
We introduce Select2Plan (S2P), a novel training-free framework for high-level robot planning that leverages off-the-shelf Vision-Language Models (VLMs) for autonomous navigation. Unlike most learning-based approaches that require extensive task- specific training and large-scale data collection, S2P overcomes the need for fine-tuning by adapting inputs to align with the VLMs pretraining data. Our method achieves this through a combination of structured Visual Question Answering (VQA) to ground action selection on the image, and In-Context Learning (ICL) to exploit knowledge drawn from relevant examples from a memory bank of (visually) annotated data, which can include diverse, in-the-wild sources. We demonstrate S2P flexibility by evaluating it in both First-Person View (FPV) and Third-Person View (TPV) navigation. S2P improves the performance of a baseline VLM by 40% in TPV and surpasses end-to-end trained models by approximately 24% in FPV when tasked with navigating towards unseen objects in novel scenes. These results highlight the adaptability, simplicity, and effectiveness of our training-free approach, demonstrating that the use of pre-trained VLMs with structured memory retrieval enables robust high-level robot planning without costly task-specific training. Our experiments also show that retrieving samples from heterogeneous data sources, including online videos of different robots or humans walking, is highly beneficial for navigation. Notably, our method effectively generalizes to novel scenarios, requiring only a handful of demonstrations. Project Page: lambdavi.github.io/select2plan

Abstract:
Autonomous Surface Vehicles (ASVs) play a crucial role in maritime operations, yet their navigation in shallow-water environments remains challenging due to dynamic disturbances and depth constraints. Traditional navigation strategies struggle with limited sensor information, making safe and efficient navigation difficult. In this paper, we propose a reinforcement learning (RL) framework for ASV navigation under depth constraints, where the vehicle must reach a target while avoiding unsafe areas with only a single depth measurement per timestep from a downward-facing Single Beam Echosounder (SBES). To enhance environmental awareness, we integrate Gaussian Process (GP) regression into the RL framework, enabling the agent to progressively estimate a bathymetric depth map from sparse sonar readings. This approach improves decision-making by providing a richer representation of the environment. Furthermore, we demonstrate effective sim-to-real transfer, ensuring that policies generalize well to real-world aquatic conditions. Experimental results validate our methods capability to improve ASV navigation performance while maintaining safety in challenging shallow-water environments. The code is available at https://github.com/Isla-lab/depth-constrained-aquatic-navigation

Abstract:
Taking into account future risk is essential for an autonomously operating robot to find online not only the best but also a safe action to execute. In this paper, we build upon the recently introduced formulation of probabilistic belief-dependent constraints. In our methodology safety can be materialized with any general belief-dependent operator we call payoff. We present an anytime approach employing the Monte Carlo Tree Search (MCTS) method in continuous domains in terms of states, actions and observations and general-belief dependent reward and payoff operators. Unlike previous approaches, our method ensures safety anytime with respect to the currently expanded search tree without relying on the convergence of the search. We prove convergence in probability with an exponential rate of a version of our algorithms and study proposed techniques via extensive simulations. Even with a tiny number of tree queries, the best action found by our approach is much safer than the baseline. Moreover, our approach constantly yields better than the baseline action in terms of objective function. This is because we revise the values and statistics maintained in the search tree and r

Abstract:
This paper presents Enhanced Autonomous Navigation, or ENav, the autonomous driving algorithm of NASA's Mars rover Perseverance. A unique challenge for the autonomous driving of Perseverance is to meet strict safety and performance requirements in a highly uncertain environment with only a single-core CPU with extremely limited computing resources. ENav overcame this challenge with a novel two-stage path selection approach that balances the path optimality and computational efficiency, combined with a unique collision checking algorithm that conservatively approximates computationally expensive kinematic settling. In addition, ENav provides robustness against slip by expanding the bounding boxes for wheels used by the collision check. These new features, together with FPGA-accelerated vision processing, enabled Perseverance to autonomously drive on substantially more rock-dense terrains and increased the average daily driving distance by an order of magnitude compared to its predecessors, the Curiosity, Spirit, and Opportunity rovers. Perseverance has set several new records for autonomous driving on Mars, breaking those previously held by the Opportunity rover. As of the 1312th Martian day since landing, or 28 October 2024 on the Earth calendar, ~90 of the 32.1 km of driving has used ENav to evaluate the terrain. This paper provides detailed documentation of the ENav algorithm, as well as its implementation, testing, deployment, and driving results on Mars.

Abstract:
To achieve general-purpose dexterous manipulation, robots must rapidly devise and execute contact-rich behaviors. Existing model-based controllers are incapable of globally optimizing in real-time over the exponential number of possible contact sequences. Instead, recent progress in contact-implicit control has leveraged simpler models that, while still hybrid, make local approximations. However, the use of local models inherently limits the controller to only exploit nearby interactions, potentially requiring intervention to richly explore the space of possible contacts. We present a novel approach which leverages the strengths of local complementarity-based control in combination with low-dimensional, but global, sampling of possible end-effector locations. Our key insight is to consider a contact-free stage preceding a contact-rich stage at every control loop. Our algorithm, in parallel, samples end effector locations to which the contact-free stage can move the robot, then considers the cost predicted by contact-rich MPC local to each sampled location. The result is a globally-informed, contact-implicit controller capable of real-time dexterous manipulation. We demonstrate our controller on precise, non-prehensile manipulation of non-convex objects using a Franka Panda arm.

Abstract:
Skin deformation haptic devices worn on the finger pad provide realistic touch feedback during interactions with virtual objects. Two primary challenges in creating such devices are: (1) making a multi-degree-of-freedom device that is small and lightweight so it does not encumber the wearer and (2) providing accurate control of forces displayed to the finger pad. This work presents a 4-degree-of-freedom (DoF) finger pad haptic device, called Fourigami, that addresses these challenges. We address the first challenge using origami manufacturing methods and pneumatic actuation to fabricate a 25 g prototype that displays normal, shear, and twist and can be easily worn on the finger pad. We address the second challenge using a low-profile, 6-DoF, force/torque sensor to control forces displayed to the finger. Fourigami has a bandwidth ranging from 2-4 Hz depending on direction, and when acting on a human finger, it exerts forces ranging from ± 1.0 N in shear, 4.2 N in normal, and ± 4.2 N·mm of twist. Finally, we demonstrate the devices efficacy when rendering haptic feedback to a user tracking a sinusoidal trajectory and a trajectory representing interactions with a virtual environment.

Abstract:
Mountain search and rescue is a form of emergency response to assist people in austere environments (e.g., extreme terrain, poor weather). Volunteer mountain search and rescue teams in the United States have begun adopting consumer-grade unmanned aerial vehicles to assist a variety of tasks (e.g., search, resource delivery); however, these tools lack the autonomy necessary for the mountain search and rescue teams to fully realize their potential for wide area, aerial search. The unique and tight constraints of mountain search and rescue (e.g., in situ computation, sensor limitations) greatly limit the applicability of recent robotics research. A two-step coverage path planning algorithm that leverages existing viewpoint and path planning approaches was developed to meet the unique needs of mountain search and rescue. Viewpoints were sampled to meet a minimum coverage ratio and assigned priority from a search probability map. The path planning problem was formulated as a clustered traveling salesperson problem, which is solved with a metaheuristic iterative solver. Simulation results inform parameter selection for a series of field experiments. The field experiments demonstrate how the new algorithm can provide resilience against the many compounding factors that make UAV-based mountain search and rescue challenging

Abstract:
Dynamic obstacle avoidance (DOA) is critical for quadrupedal robots operating in environments with moving obstacles or humans. Existing approaches typically rely on navigation-based trajectory replanning, which assumes sufficient reaction time and leading to fails when obstacles approach rapidly. In such scenarios, quadrupedal robots require reflexive evasion capabilities to perform instantaneous, low-latency maneuvers. This paper introduces Reflexive Evasion Robot (REBot), a control framework that enables quadrupedal robots to achieve real-time reflexive obstacle avoidance. REBot integrates an avoidance policy and a recovery policy within a finite-state machine. With carefully designed learning curricula and by incorporating regularization and adaptive rewards, REBot achieves robust evasion and rapid stabilization in instantaneous DOA tasks. We validate REBot through extensive simulations and real-world experiments, demonstrating notable improvements in avoidance success rates, energy efficiency, and robustness to fast-moving obstacles. Paper homepage: https://rebot-2025.github.io/.

Abstract:
Recent advancements in autonomous swarm systems mark a pivotal point in robotic science. Using a large-scale swarm of simple robots for complex tasks offers efficient, robust, and reliable solutions, inspired by natural phenomena. While bio-inspired methods are effective, approaches inspired by physical interactions in viscoelastic materials offer more structured ways to prove stability and robust performance mathematically. This paper proposes a new viscoelastic swarm algorithm applicable to heterogeneous swarm systems. The algorithm's development utilises the Lyapunov method to determine stability criteria and conditions, thereby avoiding reliance on complex optimisation to ensure stable performance parameters. A series of Monte Carlo simulations assessed the algorithm's performance and sensitivity to key variables. Furthermore, experiments with real robots evaluated the effects of variables like neighbourhood conditions and the stiffness coefficient on the algorithm's output. The results from simulations and experiments demonstrate the algorithm's stable, bounded performance and show how key variables, such as the stiffness coefficient and the number of neighbours, influence swarm performance. In real-world experiments, the proposed framework significantly reduces the robots' control effort while improving swarm behaviour, compared with a state-of-the-art algorithm.

Abstract:
Trans-catheter cardiac intervention has become an increasingly available option for high-risk patients without the complications of open heart surgery. However, current catheter-based platforms suffer from a lack of dexterity, force application, and compliance required to perform complex intracardiac procedures. An exemplary task that would significantly ease minimally invasive intracardiac procedures is the implantation of anchor coils, which can be used to fix and implant various devices. We introduce a robotic platform capable of delivering anchor coils. We develop a kineto-statics model of the robotic platform and demonstrate low positional error. We leverage the passive compliance and high force output of the actuator in a multi-anchor delivery procedure against a motile in-vitro simulator with millimeter level accuracy.

Abstract:
Hybrid representation Visual Simultaneous Localization and Mapping (VSLAM) systems combine the inherent strengths of both discrete and field representations. They promise high-precision tracking and photo-realistic dense mapping. However, current keyframe selection methods in hybrid representation VSLAM struggle to satisfy both the high-precision tracking requirements of discrete representations and the high-quality rendering requirements of field representations. In this paper, we propose a game-theory-inspired keyframe selection approach that addresses the requirements of both representation types. We introduce two objective functions to comprehensively assess discrete point tracking and radiance field model rendering. By employing a game-theory-inspired framework, our method effectively balances these objectives to achieve improved keyframe selection. Experimental results demonstrate that integrating our approach into a hybrid representation VSLAM system significantly enhances tracking accuracy and rendering quality, outperforming existing keyframe selection methods.

Abstract:
In the near future, most deployed spacecraft will be autonomous. Their tasks will involve autonomous rendezvous and proximity operations (RPOs) with large structures, such as inspection, assembly, and maintenance of orbiting space stations, as well as human-assistance tasks over shared workspaces. Yet, testing these capabilities remains challenging since microgravity conditions are difficult to simulate on Earth. Free-flying platforms, which replicate microgravity environments through nearly frictionless planar motion, have been used to provide a way to easily test and experiment on these systems without being in orbit. To promote replicable and reliable scientific results for autonomous control of spacecraft, we present the design of a space robotics platform based on open-source and modular software and hardwarethe autonomy testbed for multipurpose orbiting systems (ATMOS). ATMOS uses thrusters and air bearings to achieve near-frictionless motion, thereby simulating spacecraft dynamics in two dimensions and enabling realistic testing of navigation and control algorithms on Earth. The simulation software provides a software-in-the-loop architecture that seamlessly transfers simulated results to the hardware. Our results provide an insight into the performance of such a system, including comparisons of hardware and software results, as well as control and planning methodologies for controlling free-flying platforms.

Abstract:
Reliable grasp point selection on deformable linear objects, such as cables, requires not only accurate depth estimation but also awareness of prediction reliability. We present a five-stage stereo network for joint disparity, semantic, and uncertainty estimation, and use the predicted uncertainty to filter grasp candidates before geometric ranking. Disparity uncertainty is modeled via a Laplace negative log-likelihood, semantic uncertainty via the entropy of semantic predictions, with an alignment term enforcing consistency between them. Experiments on a synthetic stereo dataset show that uncertainty-aware selection reduces the mean grasp-point depth error from 4.19 mm to 1.55 mm, increases the success rate within a 3 mm tolerance from 74.2% to 88.6%, and lowers the 90th percentile of the failure exceedance above 3 mm from 29.47 mm to 6.77 mm. These results show that uncertainty is an effective cue for safer grasp selection on deformable linear objects.

Abstract:
Learning from demonstratins (LfD) is usually performed over Euclidean spaces, while the robot state, e.g. orientation, naturally evolves over curved spaces. Therefore, to ensure natural, complex motion generation, we investigate learning from demonstrations over Riemannian manifolds that are capable of encoding both position and orientation data. Here, geodesic paths provide for natural motion between two arbitrary points within the manifold. We propose to numerically estimate geodesics via neural ordinary differential equations, mitigating large computational overhead of existing approaches. Finally, these geodesics can be decoded back into the original task space before deploying on the robot. In this extended abstract, we discuss the architecture of our framework, provide some initial insights from our simulation experiments, including comparison to other geodesic computation mechanisms, and discuss the challenges and prospects for future work.

Abstract:
This paper introduces Robotic Augmented Reality for Machine Programming by Demonstration (RAMPA), the first ML-integrated, XR-driven end-to-end robotic system, allowing training and deployment of ML models such as ProMPs on the fly, and utilizing the capabilities of state-of-the-art and commercially available AR headsets, e.g., Meta Quest 3, to facilitate the application of Programming by Demonstration (PbD) approaches on industrial robotic arms, e.g., Universal Robots UR10. Our approach enables in-situ data recording, visualization, and fine-tuning of skill demonstrations directly within the users physical environment. RAMPA addresses critical challenges of PbD, such as safety concerns, programming barriers, and the inefficiency of collecting demonstrations on the actual hardware. The performance of our system is evaluated against the traditional method of kinesthetic control in teaching three different robotic manipulation tasks and analyzed with quantitative metrics, measuring task performance and completion time, trajectory smoothness, system usability, user experience, and task load using standardized surveys. Our findings indicate a substantial advancement in how robotic tasks are taught and refined, promising improvements in operational safety, efficiency, and user engagement in robotic programming.

Abstract:
In this letter, we propose a technique for calibrating Lighthouse localization systems using a single view of two or more coplanar circles traced by a moving robot. The calibration method leverages conic algebra to compute the homography between the Lighthouse view and the world plane, up to similarity. This approach requires minimal user intervention and is particularly suited for automatically calibrating large-scale deployments involving hundreds of mobile robots. We validate our method using a centimeter-scale differential-drive robot, utilizing 5 cm circles to calibrate a 2x2 m^2 area. The proposed technique achieved a mean positional accuracy of 7.77 mm, compared to the 5.37 mm accuracy of a previous calibration method based on manual measurements and known correspondences. We demonstrate that the conics traced by the robot are accurate enough for reliable homography estimation, even under varying conditions of tire material and surface type. A camera-based motion capture system served as the ground truth for all experiments. This work represents a step toward scalable and decentralized lighthouse calibration, enabling efficient 2D localization in large-scale robotic systems.

Abstract:
，在密集人群中导航任务是关键现实场景中的研究问题。这需要一个代理人以避免动态环境中的碰撞并达到代理人的目的地，确保高准确性和高效�?它的决策。现有方法通常将行人视�?刚体，检测物体边界框，并使用刚�?身体动力学指导代理行为。然而，在密集中在拥挤场景中，这种方法可能导致路径不�?规划解决方案，从而施加更严格的约�?在代理的行动空间上。在某些现实世界的导航中在某些情况下，行人可以通过轻微姿势避免碰撞调整时无需改变方向。在这方�?我们提出流体动力学正则化来解�?密集人群中行人建模带来的挑战环境。这种方法将行人�?

Abstract:
Soft robots exhibit natural compliance which is desirable in many applications, but often require stiffness modulation techniques when more rigidity is needed. However, many existing stiffening techniques lack portability or fast response times, hindering the ubiquitous adoption of soft robots. Here we introduce a new instantaneous stiffness modulation method based in magnetism that exhibits portability due to electronic control. This technique jams together thin layers of inherently magnetic metal sheets with a magnetic field generated by electropermanent magnets (EPMs), producing rapid stiffness changes. Quasi-static and dynamic mechanical characterizations for samples with varied layer numbers are presented, highlighting how the magnetic attraction generated by EPMs can be exploited to create a jamming effect. Stiffness increases of up to 68% and energy absorptions of up to 113 mJ were found during quasi-static and dynamic characterizations, respectively. Finally, we demonstrate how this jamming technique can be used in a haptic feedback application and to play a miniaturized version of the game of Skee-Ball.

Abstract:
Inertial odometry (IO) using only Inertial Measurement Units (IMUs) offers a lightweight and cost-effective solution for Unmanned Aerial Vehicle (UAV) applications, yet existing learning-based IO models often fail to generalize to UAVs due to the highly dynamic and non-linear-flight patterns that differ from pedestrian motion. In this work, we identify that the conventional practice of transforming raw IMU data to global coordinates undermines the observability of critical kinematic information in UAVs. By preserving the body-frame representation, our method achieves substantial performance improvements, with a 66.7% average increase in accuracy across three datasets. Furthermore, explicitly encoding attitude information into the motion network results in an additional 23.8% improvement over prior results. Combined with a data-driven IMU correction model (AirIMU) and an uncertainty-aware Extended Kalman Filter (EKF), our approach ensures robust state estimation under aggressive UAV maneuvers without relying on external sensors or control inputs. Notably, our method also demonstrates strong generalizability to unseen data not included in the training set, underscoring its potential for real-world UAV applications.

Abstract:
Person re-identification (re-ID) is crucial for security applications, including autonomous robots that monitor individuals via continuous image acquisition. Such data are transmitted to a database; however, if stored without adequate protection, they can be intercepted, posing privacy risks. In response, the existing methods balance privacy and accuracy, but protected images still reveal structural cues, such as silhouettes or edges. These methods rely on randomness to defend against recovery attacks, limiting the guarantee of complete protection. Thus, this work proposes latent retrieval-augmented generation (RAG), an identity retrieval-guided latent augmentation framework for privacy-preserving person re-ID that balances the re-ID performance with privacy protection. The proposed method generates augmented codes that distort appearance and disrupt mapping to the original input by retrieving identity-similar latent codes and applying inverse self-attention, enhancing its robustness to recovery attacks. Next, this approach employs gradient-based latent code manipulation to preserve identity vectors to maintain re-ID accuracy. The hierarchical latent codes are concurrently adjusted to eliminate structural cues that could threaten privacy. The experimental results demonstrate that Latent-RAG induces strong visual distortion, reliable re-ID accuracy and a robust defense against recovery attacks, even without additional training with a few frozen parameters in a pretrained generator. Our code is available at https://github.com/BACKAI/Latent-RAG.

Abstract:
Diffusion policies are powerful visuomotor models for robotic manipulation, yet they often fail to generalize to manipulators or end-effectors unseen during training and struggle to accommodate new task requirements at inference time. Addressing this typically requires costly data recollection and policy retraining for each new hardware or task configuration. To overcome this, we introduce an adaptation-projection strategy that enables a diffusion policy to perform cost-effective adaptation to novel manipulators and dynamic task settings, entirely at inference time and without retraining or fine-tuning the policy. Our method first trains a diffusion policy in SE(3) space using demonstrations from a base manipulator. During online deployment, it projects the policy's generated trajectories to satisfy the kinematic and task-specific constraints imposed by the new hardware and objectives. Moreover, this projection dynamically adapts to physical differences (e.g., tool-center-point offsets, jaw widths) and task requirements (e.g., obstacle heights), ensuring robust and successful execution. We validate our approach on real-world pick-and-place, pushing, and pouring tasks across multiple manipulators, including the Franka Panda and Kuka iiwa 14, equipped with a diverse array of end-effectors like flexible grippers, Robotiq 2F/3F grippers, and various 3D-printed designs. Our results demonstrate consistently high success rates in these cross-manipulator scenarios, proving the effectiveness and practicality of our adaptation-projection strategy.

Abstract:
Imitation Learning (IL) enables robots to learn complex skills from demonstrations without explicit task modeling, but it typically requires large amounts of demonstrations, creating significant collection costs. Prior work has investigated using flow as an intermediate representation to enable the use of human videos as a substitute, thereby reducing the amount of required robot demonstrations. However, most prior work has focused on the flow, either on the object or on specific points of the robot/hand, which cannot describe the motion of interaction. Meanwhile, relying on flow to achieve generalization to scenarios observed only in human videos remains limited, as flow alone cannot capture precise motion details. Furthermore, conditioning on scene observation to produce precise actions may cause the flow-conditioned policy to overfit to training tasks and weaken the generalization indicated by the flow. To address these gaps, we propose SFCrP, which includes a Scene Flow prediction model for Cross-embodiment learning (SFCr) and a Flow and Cropped point cloud conditioned Policy (FCrP). SFCr learns from both robot and human videos and predicts any point trajectories. FCrP follows the general flow motion and adjusts the action based on observations for precision tasks. Our method outperforms SOTA baselines across various real-world task settings, while also exhibiting strong spatial and instance generalization to scenarios seen only in human videos.

Abstract:
Lung cancer is the leading cause of cancer death globally, and early diagnosis via transbronchial biopsy (TBB) improves outcomes. However, conventional bronchoscopes for multiple pulmonary nodules face inefficiencies and operator skill dependency. This paper proposes a path planning strategy for robotic bronchoscopic multi-sample TBB. It abstracts the bronchial tree as a circuit, with lesions as constant resistance bulbs and bronchial branches as resistors with equivalent resistance based on their morphology. Multi-target path planning is transformed into minimizing total circuit resistance, optimizing trajectories to reduce redundant movements of robotic manipulators. Comparing to traditional methods, evaluations show that over 60% reduced movement distance and 76% less operation time are achieved; experiments accomplish over 40% efficiency improvement, enhancing multi-sample TBB efficiency and safety.

Abstract:
Nonprehensile manipulation, such as pushing and pulling, enables robots to move, align, or reposition objects that may be difficult to grasp due to their geometry, size, or relationship to the robot or the environment. Much of the existing work in nonprehensile manipulation relies on parallel-jaw grippers or tools such as rods and spatulas. In contrast, multi-fingered dexterous hands offer richer contact modes and versatility for handling diverse objects to provide stable support over the objects, which compensates for the difficulty of modeling the dynamics of nonprehensile manipulation. Therefore, we propose Geometry-aware Dexterous Pushing and Pulling(GD2P) for nonprehensile manipulation with dexterous robotic hands. We study pushing and pulling by framing the problem as synthesizing and learning pre-contact dexterous hand poses that lead to effective manipulation. We generate diverse hand poses via contact-guided sampling, filter them using physics simulation, and train a diffusion model conditioned on object geometry to predict viable poses. At test time, we sample hand poses and use standard motion planners to select and execute pushing and pulling actions. We perform extensive real-world experiments with an Allegro Hand and a LEAP Hand, demonstrating that GD2P offers a scalable route for generating dexterous nonprehensile manipulation motions with its applicability to different hand morphologies. Our project website is available at: https://geodex2p.github.io/.

Abstract:
Trajectory prediction systems are critical for autonomous vehicle safety, yet remain vulnerable to adversarial attacks that can cause catastrophic traffic behavior misinterpretations. Existing attack methods require white-box access with gradient information and rely on rigid physical constraints, limiting real-world applicability. We propose DTP-Attack, a decision-based black-box adversarial attack framework tailored for trajectory prediction systems. Our method operates exclusively on binary decision outputs without requiring model internals or gradients, making it practical for real-world scenarios. DTP-Attack employs a novel boundary walking algorithm that navigates adversarial regions without fixed constraints, naturally maintaining trajectory realism through proximity preservation. Unlike existing approaches, our method supports both intention misclassification attacks and prediction accuracy degradation. Extensive evaluation on nuScenes and Apolloscape datasets across state-of-the-art models including Trajectron++ and Grip++ demonstrates superior performance. DTP-Attack achieves 41�?1% attack success rates for intention misclassification attacks that manipulate perceived driving maneuvers with perturbations below 0.45m, and increases prediction errors by 1.9 �?4.2× for accuracy degradation. Our method consistently outperforms existing black-box approaches while maintaining high controllability and reliability across diverse scenarios. These results reveal fundamental vulnerabilities in current trajectory prediction systems, highlighting urgent needs for robust defenses in safety-critical autonomous driving applications.

Abstract:
Lifelong Multi-Agent Pathfinding (Lifelong MAPF) is an extension of the Multi-Agent Pathfinding (MAPF) problem. It has significant applications in scenarios such as warehouse logistics and delivery services. Narrow passages that restrict side-by-side traversal are common in such scenarios, posing a major challenge to lifelong MAPF problem.To address this issue, this paper proposes dual-layer PIBT, a lifelong MAPF method specifically designed for biconnected environments containing narrow passages. The method leverages loop decomposition of the biconnected graph to establish coordinated unidirectional constraints - all narrow passages belonging to the same loop are assigned consistent traversal directions, enabling rapid conflict-free navigation decisions.The experimental results demonstrate significant reductions in both makespan and task service time compared to the baseline method.

Abstract:
In this work, we address the challenge of Scene Change Detection (SCD), where the goal is to identify variations between two images of the same location captured at different times. Existing SCD models often overlook the varying importance of features across layers, employ single-step decoders that confine refinement, and provide limited insight into encoder pretraining strategies. We propose TERDNet, a Transformer EncoderRecurrent Decoder Network designed to overcome these limitations. TERDNet consists of a transformer-based encoder that extracts multi-level representations, a feature fusion module that integrates correlation volumes with these features, a recurrent 3-gate GRU decoder that performs iterative refinement, and a combined convolutioninterpolation upsampler that restores fine-grained resolution. Extensive experiments on four public benchmarks show that TERDNet consistently outperforms prior approaches and produces more accurate and detailed change masks. Ablation studies confirm the benefit of segmentation-based pretraining and the effectiveness of our fusion design. In addition, robustness tests under viewpoint misalignment confirm TERDNets potential for deployment in real-world robotic systems, where reliable perception is critical. Our code is at https://github.com/AutoCompSysLab/TERDNet.

Abstract:
This study introduces a novel assembly, layered peg-in-hole, and suggests a search strategy. Distinct from the traditional peg-in-hole assembly, where two workpieces (i.e., peg/hole) are present, the layered peg-in-hole assembly contains three workpieces (i.e., peg, hole, and thru-hole). To handle the additional moving part, the through-hole, without relying on task-specific devices, a dual-manipulator system is preferable to a single manipulator system. However, existing research has primarily concentrated on the traditional peg-in-hole assembly regardless of the number of manipulators. In this respect, as the main contribution, a search strategy is proposed for the layered peg-in-hole assembly, which consists of two phases. In the first phase, both manipulators actively engage in the search task. In the second phase, the compliant behavior of the manipulator grasping the through-hole is advocated to assist the counterpart. Finally, the proposed search strategy is verified through the real robot experiment that replicates an industrial environment with two 7-DOF torque-controlled manipulators.

Abstract:
Simultaneous locomotion and manipulation enables robots to interact with their environment beyond the constraints of a fixed base. However, coordinating legged locomotion with arm manipulation, while considering safety and compliance during contact interaction remains challenging. To this end, we propose a whole-body controller that combines a model-based admittance control for the manipulator arm with a Reinforcement Learning (RL) policy for legged locomotion. The admittance controller maps external wrenchessuch as those applied by a human during physical interactioninto desired end-effector velocities, allowing for compliant behavior. The velocities are tracked jointly by the arm and leg controllers, enabling a unified 6-DoF force response. The model-based design permits accurate force control and safety guarantees via a Reference Governor (RG), while robustness is further improved by a Kalman filter enhanced with neural networks for reliable base velocity estimation. We validate our approach in both simulation and hardware using the Unitree Go2 quadruped robot with a 6-DoF arm and wrist-mounted 6-DoF Force/Torque sensor. Results demonstrate accurate tracking of interaction-driven velocities, compliant behavior, and safe, reliable performance in dynamic settings.

Abstract:
Accurate state estimation for flexible robotic systems poses significant challenges, particularly for platforms with dynamically deforming structures that invalidate rigid-body assumptions. This paper addresses this problem and enables the extension of existing rigid-body pose estimation methods to non-rigid systems. Our approach integrates two core components: first, we capture elastic properties using a deformation-force model, efficiently learned via a Multi-Layer Perceptron; second, we resolve the platform's inherently smooth motion using continuous-time B-spline kinematic models. By continuously applying Newton's Second Law, our method formulates the relationship between visually-derived trajectory acceleration and predicted deformation-induced acceleration. We demonstrate that our approach not only enables robust and accurate pose estimation on non-rigid platforms, but also shows that the properly modeled platform physics allow for the recovery of inertial sensing properties. We validate this feasibility on a simple spring-camera system, showing how it robustly resolves the typically ill-posed problem of metric scale and gravity recovery in monocular visual odometry.

Abstract:
Automating cell manipulation at a solid-liquid interface is a critical challenge for biomedical applications such as embryo cryopreservation. Unlike manipulation in a full liquid medium, the cell-substrate contact creates a significant static friction force that is not readily measurable with current sensing or vision technologies. This unpredictability poses a high risk of cell loss, as the cell can transition abruptly from a static to a high-velocity state when the applied hydrodynamic force breaches the friction threshold. Existing methods fail to estimate this hidden friction parameter and cannot anticipate the sudden dynamic shift. To address this challenge, this paper proposes a worst-case predictive control approach with online adversarial parameter estimation (WPC-OAPE). The core innovation is the inference of the static friction barrier from observations made while the cell is still stationary. This estimate then informs a predictive controller that proactively plans against the worst-case scenario, to select the optimal action. The WPC-OAPE scheme was validated in robotic embryo vitrification experiments, where it achieved a 100% success rate with zero cell loss. This performance significantly surpassed open-loop (66.6% success) and standard predictive control (83.3% success) methods, proving its potential for clinical applications.

Abstract:
Imitation learning (IL) presents a promising paradigm for enabling embodied robots to efficiently acquire human-like manipulation skills. However, prevailing methods face a persistent trade-off between motion precision and computational tractability. To resolve this fundamental challenge, this paper introduces Viper, a framework for Verifiable Imitation learning Policy for Efficient Robotic manipulation. Viper integrates principles of Nonlinear Model Predictive Control (NMPC) within a learning-based model. Grounded in an NMPC-style closed-loop architecture, the proposed method unifies the modeling of nonlinear system dynamics with online, multi-horizon optimization of state-action predictions, while intrinsically embedding physical constraints. This co-design enables both smooth trajectory generation and fast execution. Furthermore, a theoretical stability analysis for the Viper framework is provided. Extensive evaluations, from simulated benchmarks to real-world manipulation tasks, demonstrate that Viper effectively reconciles the competing demands of precision and speed inherent in existing robotic IL paradigms. The source code will be released upon paper acceptance.

Abstract:
Classical algorithms for autonomous navigation, while well-understood and safe, require manual parameter tuning by experts to perform well. APPL and similar methods use machine learning to dynamically adjust planner parameters during deployment. This approach maintains the safety of classical systems but remains constrained by the under lying algorithm. Instead of parameter tuning, we suggest using classical planners to regulate action selection of a reinforcement learning (RL) algorithm. The resulting policy is provably similar to the well-understood classical algorithm, performs better than both a well-tuned classical planner and an unregularized RL based policy, and can be shown to respect a user-controlled trust region even during training. In experiments, our method reduces traversal time by 8% (vs. DWA) and 43% (vs. TEB), and lowers proximity risk by 24% and 17%, respectively, while matching or surpassing learning-based baselines and aligning more closely with user preferences.

Abstract:
Robots operate under significant uncertainty, from quantifiable noise to unquantifiable unknowns, and must account for strict operational constraints, such as limited resources. In this paper, we consider the problem of synthesizing robust strategies to guide a robot's actions in fulfilling a given task, while ensuring the system never exhausts its resources. To solve this problem, we first model the robotic system as a Consumption Markov Decision Process with Set-valued Transitions(CMDPST), a unified framework modelling nondeterministic actions, quantifiable and unquantifiable uncertainty, and resource consumption. Then, we combine the CMDPST with the task specification, expressed as a Linear Temporal Logic over finite traces (LTLf ) formula. Lastly, we address the resource constrained optimal robust strategy synthesis problem, which aims to synthesize a strategy that maximizes the probability of satisfying the LTLf objective without resource exhaustion. Our solution involves two techniques: a direct unrolling-based method and a more efficient, optimized approach that leverages state-space pruning for better performance. Experiments on a warehouse transportation network show the effectiveness of the proposed solutions.

Abstract:
Electroencephalogram (EEG) signals have unique individual characteristics and have broad application prospects in identity authentication. At present, person identification (PI) based on EEG using the temporal-spatial-spectral feature extraction framework has achieved remarkable success. However, the existing methods suffer from coupled cross-domain feature parameters and insufficient feature fusion during feature extraction, which limits the recognition ability. Moreover, fixed-scale feature extractors can hardly exploit the subject-specific multi-scale information. To address these challenges, we propose CMDFM: a complete multi-domain decoupled fusion model for EEG-based PI. Firstly, we design an independent temporal-spatial-spectral attention mechanism to eliminate cross-domain parameter coupling. Secondly, a full-domain fusion mechanism is designed to comprehensively integrate the features of the temporal domain, spatial domain and spectral domain. Finally, an adaptive multi-scale CNN is designed to adjust the contribution of the multi-scale convolution kernel, thereby making full use of individual-specific multi-scale information. We use four datasets to verify our method. The experimental results show that our method is superior to all the state-of-the-art methods. The code of CMDFM is at https://github.com/2538441690/CMDFM.

Abstract:
LiDAR-based odometry is widely used in ground robot localization. However, current methods encounter challenges in accuracy and robustness due to structural degradation, system observational error, and accumulated error. To address the above issues, we propose CPBA-LIWO, a continuous-time LiDAR-Inertial-Wheel (LIW) odometry based on probabilistic bundle adjustment (PBA) within a sliding window. This method constructs a general wheel model, which is used for the complementary fusion of LiDAR, IMU and wheel data through a continuous-time trajectory using a B-spline curve, thereby improving the robustness of the system in structurally degraded environments. Furthermore, to improve the accuracy of long-distance odometry, we propose a probabilistic model for the voxel plane and implement a sliding-window voxel PBA backend based on this model. The experimental results on the M2DGR-plus and KAIST datasets demonstrate that our method outperforms state-of-the-art LiDAR-based odometry in terms of accuracy and robustness.

Abstract:
Imitation learning for robotic manipulation has progressed from 2D image policies to 3D representations that explicitly encode geometry. Yet purely geometric policies often lack explicit part-level semantics, which are critical for pose-aware manipulation (e.g., distinguishing a shoe's toe from heel). In this paper, we present HeRO, a diffusion-based policy that couples geometry and semantics via hierarchical semantic fields. HeRO employs dense semantics lifting to fuse discriminative, geometry-sensitive features from DINOv2 with the smooth, globally coherent correspondences from Stable Diffusion, yielding dense features that are both fine-grained and spatially consistent. These features are processed and partitioned to construct a global field and a set of local fields. A hierarchical conditioning module conditions the generative denoiser on global and local fields using permutation-invariant network architecture, thereby avoiding order-sensitive bias and producing a coherent control policy for pose-aware manipulation. In various tests, HeRO establishes a new state-of-the-art, improving success on Place Dual Shoes by 12.3% and averaging 6.5% gains across six challenging pose-aware tasks. Code is available at https://github.com/Chongyang-99/HeRO.

Abstract:
High-fidelity sensor simulation of light-based sen- sors such as cameras and LiDARs is critical for safe and accurate autonomy testing. Neural radiance field (NeRF)-based methods that reconstruct sensor observations via ray-casting of implicit representations have demonstrated accurate simulation of driving scenes, but are slow to train and render, hampering scalability. 3D Gaussian Splatting (3DGS) has demonstrated faster training and rendering times through rasterization, but is primarily restricted to pinhole camera sensors, preventing usage for realistic multi-sensor autonomy evaluation. Moreover, both NeRF and 3DGS couple the representation with the rendering procedure (implicit networks for ray-based evaluation, particles for rasterization), preventing interoperability, which is key for general usage. In this work, we present Sparse Local Fields (SaLF), a novel volumetric representation that supports rasterization and raytracing. SaLF represents volumes as a sparse set of 3D voxel primitives, where each voxel is a local implicit field. SaLF has fast training (<30 min) and rendering capabilities (50+ FPS for camera and 600+ FPS for LiDAR), has adaptive pruning and densification to easily handle large scenes, and can support non-pinhole cameras and spinning LiDARs. We demonstrate that SaLF delivers realism comparable to existing self-driving sensor simulation methods while improving efficiency and enhancing capabilities, thereby enabling more scalable simulation.

Abstract:
Enhancing brain activation efficiency is crucial in developing brain computer interface (BCI) paradigm for cognitive rehabilitation. However, the existing BCI paradigms mostly achieved limited sensory-activation without sufficient feedback of mind and body, significantly limiting the user engagement and training efficiency. In this study, we propose a novel multisensory neurofeedback framework to develop an immersive BCI paradigm for emotion regulation, supported by a novel panoramic motion-based virtual reality system. The paradigm is designed to promote deeper cognitive and physical involvement in functional brain training.It delivers multisensory neurofeedbackvisual, auditory, and motorthrough the Gait Real-time Analysis Interactive Lab system and incorporates cognitive reappraisal from the modified Gross procedure for emotion regulation. Its effectiveness is validated through three experimental studies, including event-related potential analysis, power spectral density analysis, and brain network analysis. The results demonstrate that the paradigm enhances motorcognitive interaction and multisensory coordination, effectively increasing brain activation in visual, auditory, and motor processing regions, and further promoting stronger engagement of emotion regulation-related areas such as the prefrontal cortex. Compared with conventional paradigms, the proposed paradigm increases the number of high-intensity functional connections by 28.6% (from 42 to 54) and the number of effective functional connections by 42.3% (from 71 to 101).

Abstract:
Enabling robots with contact-rich manipulation remains a pivotal challenge in robot learning, which is substantially hindered by the data collection gap, including its inefficiency and limited sensor setup. While prior work has explored handheld paradigms, their rod-based mechanical structures remain rigid and unintuitive, providing limited tactile feedback and posing challenges for operators. Motivated by the dexterity and force feedback of human motion, we propose FreeTacMan, a human-centric and robot-free data collection system for accurate and efficient robot manipulation. Concretely, we design a wearable gripper with visuo-tactile sensors for data collection, which can be worn by human fingers for intuitive control. A high-precision optical tracking system is introduced to capture end-effector poses while synchronizing visual and tactile feedback simultaneously. We leverage FreeTacMan to collect a large-scale multimodal dataset, comprising over 3000k paired visuotactile images with end-effector poses, 10k demonstration trajectories across 50 diverse contact-rich manipulation tasks. FreeTacMan achieves multiple improvements in data collection performance over prior works and enables effective policy learning from self-collected datasets. By open-sourcing the hardware and the dataset, we aim to facilitate reproducibility and support research in visuo-tactile manipulation.

Abstract:
Accurate 6-DoF pose estimation of objects is critical for robots to perform precise manipulation tasks. However, for dynamic object pose estimation, conventional camera-based approaches face several major challenges, such as motion blur, sensor noise, and low-light limitation. To address these issues, we employ event cameras, whose high dynamic range and low latency offer a promising solution. Furthermore, we propose a keypoint-based detection and tracking approach for dynamic object pose estimation. Firstly, a keypoint detection network is constructed to extract keypoints from the time surface generated by the event stream. Subsequently, the polarity and spatial coordinates of the events are leveraged, and the event density in the vicinity of each keypoint is utilized to achieve continuous keypoint tracking. Finally, a hash mapping is established between the 2D keypoints and the 3D model keypoints, and the EPnP algorithm is employed to estimate the 6-DoF pose. Experimental results demonstrate that, whether in simulated or real event environments, the proposed method outperforms the event-based state-of-the-art methods in terms of both accuracy and robustness.

Abstract:
Two-wheelers account for a disproportionately high share of road fatalities in the Global South. Research on two-wheeler rider behavior, however, lags far behind fourwheelers, where multimodal datasets have driven major advances in Advanced Driver Assistance Systems (ADAS). To address this gap, we present the MOtorized TwO-wheeler Rider (MOTOR) dataset, the first large-scale, multi-view, multimodal resource dedicated to two-wheelers in dense, unstructured traffic. MOTOR comprises 2,500 sequences (25+ hours of video data) collected from 16 riders and integrates synchronized front, rear, and helmet videos, rider eye-gaze from wearable trackers, on-road audio, and telemetry (GPS, accelerometer, gyroscope). Rich annotations capture traffic context, rider state, 12 riding maneuvers spanning conventional and unconventional behaviors, and legality labels (Legal, Illegal, Unspecified). We benchmark rider behavior recognition and maneuver legality classification using state-of-the-art video action recognition backbones (CNN and Transformer-based), extended with multimodal fusion, and find that combining RGB, gaze, and telemetry consistently yields the best performance. MOTOR thus provides a unique foundation for advancing safety-critical understanding of two-wheeler riding. It offers the research community a benchmark to develop and evaluate models for behavior analysis, legality-aware prediction, and intelligent transportation systems. Dataset and code is available at https://varuniiith.github.io/MOTOR-Dataset/

Abstract:
Robust 3D scene understanding is crucial for autonomous robots, but degrades sharply in low-light environments where sensor noise and illumination inconsistencies corrupt visual inputs. Even 3D Gaussian Splatting (3DGS), while efficient for real-time reconstruction, produces unstable and artifact-prone results under such conditions, limiting its reliability for navigation and mapping. To address these challenges, we propose a 3DGS-based framework for reconstructing clear scenes under low-light conditions. Firstly, We employ a frequency-aware modulator that operates on spectral components to decouple and suppress sensor noise from structural signals, providing a clean input for reconstruction. To refine the 3D model and ensure its compactness for onboard deployment, we introduce an adaptive denoising mask guided by dynamically updated statistics of rendering contribution and stability, which filters transient artifacts caused by sensor noise. Finally, a multi-view frequency consistency constraint is enforced to ensure the global coherence of the reconstructed model's appearance, which is critical for consistent mapping. Experiments on challenging low-light datasets demonstrate that our method achieves state-of-the-art reconstruction quality while significantly reducing model storage by approximately 46.4% and maintaining real-time rendering speeds.

Abstract:
We present Caspar, a library that makes the power of modern GPUs more accessible in robotics and provides a state-of-the-art nonlinear GPU solver that can be applied to a wide range of different optimization problems. Caspar bridges the gap between expressive symbolic programming in Python and high-performance GPU runtimes in C++ by automatically generating optimized CUDA kernels from symbolic expressions. Building on the SymForce library, users can easily define and combine symbolic expressions, including Lie group operations, to generate custom CUDA kernels. To use Caspar as a solver, users need only define the symbolic residual functions; Caspar then uses symbolic differentiation to generate the necessary GPU kernels and interfaces to perform nonlinear optimization. In this paper, we present the core components of Caspar and showcase its performance by performing bundle adjustment on the Bundle Adjustment in the Large (BAL) dataset. We benchmark Caspar against other state-of-the-art bundle adjusters and show that it is 5 to 20 times faster than the best alternative, requires less memory, and achieves similar accuracy. This illustrates the benefit of our symbolic GPU programming approach. Caspar is released as part of SymForce and is freely available at https://github.com/symforce-org/symforce.

Abstract:
Multi-fingered hands are emerging as powerful platforms for performing fine manipulation tasks, including tool use. However, environmental perturbations or execution errors can impede task performance, motivating the use of recovery behaviors that enable normal task execution to resume. In this work, we take advantage of recent advances in diffusion models to construct a framework that autonomously identifies when recovery is necessary and optimizes contact-rich trajectories to recover. We use a diffusion model trained on the task to estimate when states are not conducive to task execution, framed as an out-of-distribution detection problem. We then use diffusion sampling to project these states in-distribution and use trajectory optimization to plan contact-rich recovery trajectories. We also propose a novel diffusion-based approach that distills this process to efficiently diffuse the full parameterization, including constraints, goal state, and initialization, of the recovery trajectory optimization problem, saving time during online execution. We compare our method to a reinforcement learning baseline and other methods that do not explicitly plan contact interactions, including on a hardware screwdriver-turning task where we show that recovering using our method improves task performance by 96% and that ours is the only method evaluated that can attempt recovery without causing catastrophic task failure. Videos can be found at https://dtourrecovery.github.io/.

Abstract:
Achieving safe and robust interaction in articulated-soft humanoid robots (ASRs) remains a major challenge due to their compliant joints, high degree of freedom, and highly nonlinear coupled dynamics, which makes them especially sensitive to external disturbances. This paper presents a novel contact-force-based iterative learning center-of-mass (CoM) impedance control framework (CF-IL-CIC) specifically designed to enhance disturbance robustness in floating-base ASRs. The key idea is to iteratively derive a time-series gross force compensation term from zero moment point (ZMP) tracking errors of previous trials, using a proportional-derivative (PD)-type update rule in simulation. This compensation is integrated with a contact-force-based CoM impedance controller to improve push recovery without requiring precise dynamic models or heavy online optimization. The approach is accompanied by mathematical proof of divergent component of motion (DCM) error convergence, ensuring theoretical stability guarantees. The proposed method is validated through both dynamic simulations and real-robot experiments on the compliant humanoid BRUCE, demonstrating significant improvements in external impact rejection and recovery stability compared to baseline controllers.

Abstract:
Cable routing is a common manipulation task in assembly and manufacturing, yet it remains challenging due to the deformable nature of cables and the constraints of cluttered routing environments. In this paper, we present CRAFT: Cable Routing Around Fixtures using Two grippers, a novel hardware plus software architecture that integrates unimanual and bimanual operations for long-horizon cable routing. To address jamming due to friction, we present a novel caging gripper with roller mechanism. Physical experiments consisting of 160 trials on a modified NIST board with five types of fixtures and turning angle up to 930 degrees, yield an average completion ratio of 84.5% across four routing difficulty tiers, representing a 54.2% improvement over an earlier baseline. The cable routing materials and benchmarks are available at https://manipulation-net.org/tasks/cable_routing.html.

Abstract:
Humanoid soccer dribbling is a highly challenging task that demands dexterous ball manipulation while maintaining dynamic balance. Traditional rule-based methods often struggle to achieve accurate ball control due to their reliance on fixed walking patterns and limited adaptability to real-time ball dynamics. To address these challenges, we propose a two-stage curriculum learning framework that enables a humanoid robot to acquire dribbling skills without explicit dynamics or predefined trajectories. In the first stage, the robot learns basic locomotion skills; in the second stage, we fine-tune the policy for agile dribbling maneuvers. We further introduce a virtual camera model in simulation that simulates the field of view and perception constraints of the real robot, enabling realistic ball perception during training. We also design heuristic rewards to encourage active sensing, promoting a broader visual range for continuous ball perception. The policy is trained in simulation and successfully transferred to a physical humanoid robot. Experiment results demonstrate that our method enables effective ball manipulation, achieving flexible and visually appealing dribbling behaviors across multiple environments. This work highlights the potential of reinforcement learning in developing agile humanoid soccer robots. Additional details and videos are available at https://zhuoheng0910.github.io/dribble-master/.

Abstract:
Navigation of mobile robots in dynamic environments with pedestrian traffic poses a significant challenge in the development of autonomous mobile service robots. Recently, deep reinforcement learning-based methods have been actively studied and have outperformed traditional rule-based approaches, owing to their optimization capabilities. Among these methods, those assuming continuous action spaces typically use Gaussian distributions, limiting the flexibility of action generation. By contrast, the application of diffusion models to reinforcement learning has advanced, allowing more flexible action distributions than Gaussian policy-based approaches. In this study, we used a diffusion-based reinforcement learning approach to social navigation and validated its effectiveness. Furthermore, using the characteristics of diffusion models, we propose extensions that allow adaptation to previously unseen scenarios without additional training. As concrete scenario examples, we show adaptability to scenarios in which static obstacles exist in an environment that was not present during training, as well as scenarios in which the objective differs from training, such as accompanying a target pedestrian while avoiding other pedestrians to reach a destination.

Abstract:
Learning-based methods have enabled robots to acquire bio-inspired movements with increasing levels of naturalness and adaptability. Among these, Imitation Learning (IL) has proven effective in transferring complex motion patterns from animals to robotic systems. However, current state-of-the-art frameworks predominantly rely on Proximal Policy Optimization (PPO), an on-policy algorithm that prioritizes stability over sample efficiency and policy generalization. This paper proposes a novel IL framework that combines Adversarial Motion Priors (AMP) with the off-policy Soft Actor-Critic (SAC) algorithm to overcome these limitations. This integration leverage replay-driven learning and entropy regularized exploration, enabling naturalistic behavior and task execution improving data efficiency and robustness. We evaluate the proposed approach (AMP+SAC) on quadruped gaits involving multiple reference motions and diverse terrains. Experimental results demonstrate that the proposed framework not only maintains stable task execution but also achieves higher imitation rewards compared to the widely used AMP+PPO method. These findings highlight the potential of an off-policy IL formulations for advancing motion generation in robotics.Code and supplementary material are available at: urlhttps://github.com/nayariml/AMP_SAC.git

Abstract:
Robots fail, potentially leading to a loss in the robots perceived reliability (PR), a measure correlated with trustworthiness. In this study we examine how various kinds of failures affect the PR of the robot differently, and how this measure recovers without explicit social repair actions by the robot. In a preregistered and controlled online video study, participants were asked to predict a robots success in a pick-and-place task. We examined manipulation failures (slips), freezing (lapses), and three types of incorrect picked objects or place goals (mistakes). Participants were shown one of 11 videosone of five types of failure, one of five types of failure followed by a successful execution in the same video, or a successful execution video. This was followed by two additional successful execution videos. Participants bet money either on the robot or on a coin toss after each video. Peoples betting patterns along with a qualitative analysis of their survey responses highlight that mistakes are less damaging to PR than slips or lapses, and some mistakes are even perceived as successes. We also see that successes immediately following a failure have the same effect on PR as successes without a preceding failure. Finally, we show that successful executions recover PR after a failure. Our findings highlight which robot failures are in higher need of repair in a human-robot interaction, and how trust could be recovered by robot successes.

Abstract:
Dense SLAM with a monocular camera remains a highly challenging task. In this paper, we present AMG-SLAM, a novel dense monocular SLAM system that tightly couples sparse tracking with dense Gaussian mapping to achieve fully online and high-quality surface reconstruction. In the frontend, learning-based modules enable efficient pose tracking and Gaussian proposal with sparse depth initialization. Specifically, we propose a fidelity-aware Gaussian proposal strategy that adaptively adds new Gaussians based on reconstruction completeness, effectively avoiding redundancy. In the backend, we propose a focus-and-balance online refinement strategy, which adaptively selects under-optimized Gaussians for focused refinement while ensuring globally balanced optimization by maximizing scene view coverage. We evaluated our method on synthetic and real-world datasets, including Replica, ScanNet, and EuRoC. Thanks to efficient system coupling and adaptive Gaussian proposal and refinement, our system achieves trajectory accuracy, rendering precision, and geometric accuracy comparable to or exceeding current state-of-the-art methods, while also demonstrating high efficiency.

Abstract:
Multi-agent reinforcement learning systems deployed in real-world robotics applications face severe communication constraints that significantly impact coordination effectiveness. We present a framework that combines information bottleneck theory with vector quantization to enable selective, bandwidth-efficient communication in multi-agent environments. Our approach learns to compress and discretize communication messages while preserving task-critical information through principled information-theoretic optimization. We introduce a gated communication mechanism that dynamically determines when communication is necessary based on environmental context and agent states. Experimental evaluation on challenging coordination tasks demonstrates that our method achieves 181.8% performance improvement over no-communication baselines while reducing bandwidth usage by 41.4%. Pareto frontier analysis shows dominance across the entire success-bandwidth spectrum, with an area under the curve of 0.198 vs 0.142 for next-best methods. Our approach significantly outperforms existing communication strategies and establishes a theoretically grounded framework for deploying multi-agent systems in bandwidth-constrained environments such as robotic swarms, autonomous vehicle fleets, and distributed sensor networks.

Abstract:
Large vision-language models have driven remarkable progress in open-vocabulary robot policies, e.g., generalist robot manipulation policies, that enable robots to complete complex tasks specified in natural language. Despite these successes, open-vocabulary autonomous drone navigation remains an unsolved challenge due to the scarcity of large-scale demonstrations, real-time control demands of drones for stabilization, and lack of reliable external pose estimation modules. In this work, we present SINGER for language-guided autonomous drone navigation in the open world using only onboard sensing and compute. To train robust, open-vocabulary navigation policies, SINGER leverages three central components: (i) a photorealistic language-embedded flight simulator with minimal sim-to-real gap using Gaussian Splatting for efficient data generation, (ii) an RRT-inspired multi-trajectory generation expert for collision-free navigation demonstrations, and these are used to train (iii) a lightweight end-to-end visuomotor policy for real-time closed-loop control. Through extensive hardware flight experiments, we demonstrate superior zero-shot sim-to-real transfer of our policy to unseen environments and unseen language-conditioned goal objects. When trained on ~700k-1M observation action pairs of language conditioned visuomotor data and deployed on hardware, SINGER outperforms a velocity-controlled semantic guidance baseline by reaching the query 23.33% more on average, and maintains the query in the field of view 16.67% more on average, with 10% fewer collisions.

Abstract:
Reliable trajectory prediction from an ego-centric perspective is crucial for robotic navigation in human-centric environments. However, existing methods typically assume noiseless observation histories, failing to account for the perceptual artifacts inherent in first-person vision, such as occlusions, ID switches, and tracking drift. This discrepancy between training assumptions and deployment reality severely limits model robustness. To bridge this gap, we introduce EgoTraj-Bench, built upon the TBD dataset, which is the first real-world benchmark that aligns noisy, first-person visual histories with clean, birds-eye-view future trajectories, enabling robust learning under realistic perceptual constraints. Building on this benchmark, we propose BiFlow, a dual-stream flow matching model that concurrently denoises historical observations and forecasts future motion. To better model agent intent, BiFlow incorporates our EgoAnchor mechanism, which conditions the prediction decoder on distilled historical features via feature modulation. Extensive experiments show that BiFlow achieves state-of-the-art performance, reducing minADE and minFDE by 1015% on average and demonstrating superior robustness. We anticipate that our benchmark and model will provide a critical foundation for robust real-world ego-centric trajectory prediction. The benchmark library is available at: https://github.com/zoeyliu1999/EgoTraj-Bench.

Abstract:
The human brain constructs emotional percepts not by processing facial expressions in isolation, but through a dynamic, hierarchical integration of sensory input with semantic and contextual knowledge. However, existing vision-based dynamic emotion modeling approaches often neglect emotion perception and cognitive theories. To bridge this gap between machine and human emotion perception, we propose cognition-inspired Dual-stream Semantic Enhancement (DuSE). Our model instantiates a dual-stream cognitive architecture. The first stream, a Hierarchical Temporal Prompt Cluster (HTPC), operationalizes the cognitive priming effect. It simulates how linguistic cues pre-sensitize neural pathways, modulating the processing of incoming visual stimuli by aligning textual semantics with fine-grained temporal features of facial dynamics. The second stream, a Latent Semantic Emotion Aggregator (LSEA), computationally models the knowledge integration process, akin to the mechanism described by the Conceptual Act Theory. It aggregates sensory inputs and synthesizes them with learned conceptual knowledge, reflecting the role of the hippocampus and default mode network in constructing a coherent emotional experience. By explicitly modeling these neuro-cognitive mechanisms, DuSE provides a more neurally plausible and robust framework for dynamic facial expression recognition (DFER). Extensive experiments on challenging in-the-wild benchmarks validate our cognition-centric approach, demonstrating that emulating the brain's strategies for emotion processing yields state-of-the-art performance and enhances model interpretability.

Abstract:
Low-light images often suffer from low contrast, noise, and color distortion, degrading visual quality and impairing downstream vision tasks. We propose a novel conditional diffusion framework for low-light image enhancement that incorporates a Structured Control Embedding Module (SCEM). SCEM decomposes a low-light image into four informative components including illumination, illumination-invariant features, shadow priors, and color-invariant cues. These components serve as control signals that condition a U-Netbased diffusion model trained with a simplified noise-prediction loss. Thus, the proposed SCEM equipped Diffusion method enforces structured enhancement guided by physical priors. In experiments, our model is trained only on the LOLv1 dataset and evaluated without fine-tuning on LOLv2-real, LSRW, DICM, MEF, and LIME. The method achieves state-of-the-art performance in quantitative and perceptual metrics, demonstrating strong generalization across benchmarks. https://casted.github.io/scem/.

Abstract:
Recent work introduced human teleoperation (HT), where the remote robot typically used in conventional bilateral teleoperation is replaced by a novice person wearing a mixed reality headset and tracking the motion of a virtual tool controlled by an expert. HT has advantages in cost, complexity, and patient acceptance for telemedicine in remote or low-resource communities. However, the stability, transparency, and performance of bilateral HT are unexplored. In this paper, we therefore develop a mathematical model of the HT system using test data. We then analyze various control architectures with this model and implement them with the HT system, testing volunteer operators and a virtual fixture-based simulated patient to find the achievable performance, investigate stability, and determine the most promising teleoperation scheme in the presence of time delays. We show that instability in HT, while not destructive or dangerous, makes the system unusable. However, stable and transparent teleoperation are possible with small time delays (<200 ms) through 3-channel teleoperation, or with large delays through model-mediated teleoperation with local pose and force feedback for the novice.

Abstract:
Non-rigid structure-from-motion (NRSfM), a promising technique for addressing the mapping challenges in monocular visual deformable simultaneous localization and mapping (SLAM), has attracted growing attention. We introduce a novel method, called Con-NRSfM, for NRSfM under conformal deformations, encompassing isometric deformations as a subset. Our approach performs point-wise reconstruction using 2D selected image warps optimized through a graph-based framework. Unlike existing methods that rely on strict assumptionssuch as locally planar surfaces or locally linear deformationsand fail to recover the conformal scale, our method eliminates these constraints and accurately computes the local conformal scale. Additionally, our framework decouples constraints on depth and conformal scale, which are inseparable in other approaches, enabling more precise depth estimation. To address the sensitivity of the formulated problem, we employ a parallel separable iterative optimization strategy. Furthermore, a self-supervised learning framework, utilizing an encoder-decoder network, is incorporated to generate dense 3D point clouds with texture. Simulation and experimental results using both synthetic and real datasets demonstrate that our method surpasses existing approaches in terms of reconstruction accuracy and robustness.

Abstract:
Accurate dynamic modeling of soft-shelled spherical robots is challenging due to coupled rigidsoft body interactions and pressure-dependent contact behavior. This letter presents a modeling strategy for an empirically tuned pendulum-driven inflatable spherical robot. The approach combines a rigid-body dynamics engine in Drake with non-conservative effects. The robots rigid-body model is generated from a custom URDF and augmented with interchangeable joint friction modules. Three alternative outer shell contact models are also considered: Drakes native hydroelastic contact, a pressure-dependent injected stiffnessdamping model derived from isolated shell experiments, and a rigid point-contact baseline. Shell dynamics are characterized in the steering direction using a custom locking fixture, yielding empirical pressure-related frequency and damping relationships to parameterize the models. Ramp descent experiments across multiple inflation pressures validate the framework, showing that an appropriate model reduces drive velocity prediction error compared to a rigid point-contact case. The approach enables modular integration of additional dynamic effects, supports data-driven parameter tuning, and provides a reproducible pathway for accurate simulation of soft spherical robots.

Abstract:
Oil spills continuously affect marine ecosystems and require rapid monitoring for effective emergency response. This letter tackles the problem of persistent monitoring for continuously changing and scattered oil spill regions through Entropy-Based Incremental Coverage Path Planning (EICPP). By using contour comparison between monitoring cycles, an incremental coverage mechanism is first introduced to focus on newly emerged oil spill regions. Then, a balanced region division algorithm is incorporated to handle scattered oil spill areas while ensuring equal workload distribution among UAVs. The entropy-based path planning enhances oil spill monitoring effectiveness by Drift Information Freshness (DIF) through prioritizing high-entropy regions under limited UAV resources. We evaluate the robustness and effectiveness of our method across multiple scenarios. Our method demonstrates clear advantages in DIF, achieving 1925% improvements over strong baselines across different spill scales and about 19.624% on real-world oil spill datasets. It also substantially reduces total flight distance while consistently satisfying the 90% coverage requirement.

Abstract:
Depth information is crucial for underwater robotic detection and navigation tasks. However, the underwater imaging environment is complex and variable. The images captured by robots are typically sequences or videos with uniform scene content, and the ground-truth of depth is difficult to obtain. This challenge hinders the generalization of existing self-supervised monocular depth estimation (SMDE) schemes for practical underwater detection applications. To address this issue, we propose an SMDE method for underwater images informed by the physical process of optical degradation. Specifically, we developed a further degradation process for underwater images, which can constrain the image restoration process to solve the attenuation coefficient and depth map, and then combine it with the ego-motion based framework to form a self-supervised learning closed loop. Guided by inherent optical properties, this closed-loop can learn depth cues from the underwater image formation model and the geometric relationships involved in view transformation. Experimental results demonstrate that the proposed method outperforms existing techniques and generalizes well across different underwater scenes. Experiments demonstrate that the proposed method is reduced by about 9:1% in RMSE index and improved by about 3:5% in threshold accuracy compared with the SOTA method and can adapt to various underwater robot detection scenarios.

Abstract:
Cosserat rod models are widely used to simulate, design, and control soft robots. The Cosserat framework accounts for bending, torsion, transverse shear, and elongation of a long, slender structure and correctly handles large rotations and deflections in 3D, while being far less computationally expensive than full 3D elasticity models using finite elements. However, the Cosserat model is not always appropriate for soft robotic structures since it assumes the cross sections never change size or shape. In this letter, we extend the standard Cosserat rod model to include cross-sectional deformation while retaining much of its simplicity. We add to the Cosserat model additional degrees of freedom that parameterize stretch and shear in the cross-sectional plane and their rates of change along the rod length. We then formulate several possible constitutive laws on the state variables (one linear and one non-linear) and compare them to the standard Cosserat energy expressions to gain insight. We further show how fluidic actuation and tendon actuation can be incorporated into the model, and we compare the extended Cosserat models to 3D nonlinear finite-element simulations with good agreement. Finally, we demonstrate use of this model in a robotics context to control the path-following gait of a peristaltic worm-inspired soft robot.

Abstract:
A differential dynamic programming (DDP)-based framework for inverse reinforcement learning (IRL) is introduced to recover the parameters in the cost function, system dynamics, and constraints from demonstrations. Different from existing work, where DDP was usually used for the inner forward problem, our proposed framework uses it to efficiently compute the gradient required in the outer inverse problem with equality and inequality constraints. The equivalence between the proposed and existing methods based on Pontryagins Maximum Principle (PMP) is established. More importantly, using this DDP-based IRL with an open-loop loss function, a closed-loop IRL framework is presented. In this framework, a loss function is proposed to capture the closed-loop nature of demonstrations. It is shown to be better than the commonly used open-loop loss function. We show that the closed-loop IRL framework reduces to a constrained inverse optimal control problem under certain assumptions. Under these assumptions and a rank condition, it is proven that the learning parameters can be recovered from the demonstration data. The proposed framework is extensively evaluated through four numerical robot examples and one real-world quadrotor system. The experiments validate the theoretical results and illustrate the practical relevance of the approach.

Abstract:
We propose a novel framework for decision-making in cooperative grasping for two-robot object transport in constrained environments. The core of the framework is a Conditional Embedding (CE) model consisting of two neural networks that map grasp configuration information into an embedding space. The resulting embedding vectors are then used to identify feasible grasp configurations that allow two robots to collaboratively transport an object. To ensure generalizability across diverse environments and object geometries, the neural networks are trained on a dataset comprising a range of environment maps and object shapes. We employ a supervised learning approach with negative sampling to ensure that the learned embeddings effectively distinguish between feasible and infeasible grasp configurations. Evaluation results across a wide range of environments and objects in simulations demonstrate the model's ability to reliably identify feasible grasp configurations. We further validate the framework through experiments on a physical robotic platform, confirming its practical applicability.

Abstract:
Wearable assistive devices that monitor muscle fatigue reduce the risk of work-related musculoskeletal disorders, enhance rehabilitation outcomes, and extend operational time by optimizing the power consumption of the device. This work proposes a muscle fatigue-aware controller (MFAC) for a semi-rigid knee exoskeleton. During an offline calibration phase, we use Gaussian Process Regression (GPR) to model the relationship between muscle activation (measured via EMG) and the corresponding joint moment and angle, enabling fatigue state estimation for the controller. The trained model then approximates muscle activation online using only joint states and moment derived from users kinematic data and ground reaction forces provided by the wearable device. The estimated muscle activation is used to assess the muscle fatigue state through a model-based fatigue evaluation module. Notably, EMG measurement is only required during the offline training in our approach, enabling EMG-free online estimation, which significantly enhances the feasibility for long-term mobile applications. Building on muscle fatigue and human-exoskeleton interaction models, we then developed an adaptive controller within a predictive control framework. The resulting optimization problem generates control signals that adjust assistance to reduce the fatigue progression. Two experiments validate the EMG-free fatigue estimation method and the integrated MFAC, demonstrating accurate muscle activation estimation an

Abstract:
High-quality 3D reconstruction of unknown small objects with complex surface details is important in applications such as digital preservation and cultural heritage archiving. In practice, such scanning procedures rely heavily on skilled human experts, but the high cost of expert training and the large number of objects requiring digitization make this process difficult to scale. This motivates the need to construct expert demonstration datasets as a foundation for future automated view planning. However, available scan data often contain only frame-level geometry without per-frame sensor poses. To address this issue, we propose a hierarchical grid-based method for extracting sensor poses from frame-based scan data. The proposed method progressively refines candidate poses through coarse-to-fine grid search and selects poses that effectively observe the target surface. Experimental results show an average coverage of 0.85, demonstrating the practicality of the proposed approach for expert demonstration dataset construction.

Abstract:
Human-in-the-loop robot deployment has gained significant attention in both academia and industry as a semi-autonomous paradigm that enables human operators to intervene and adjust robot behaviors at deployment time, improving success rates. However, continuous human monitoring and intervention can be highly labor-intensive and impractical when deploying a large number of robots. To address this limitation, we propose a method that allows diffusion policies to actively seek human assistance only when necessary, reducing reliance on constant human oversight. To achieve this, we leverage the generative process of diffusion policies to compute an uncertainty-based metric based on which the autonomous agent can decide to request operator assistance at deployment time, without requiring any operator interaction during training. Additionally, we show that the same method can be used for efficient data collection for fine-tuning diffusion policies in order to improve their autonomous performance. Experimental results from simulated and real-world environments demonstrate that our approach enhances policy performance during deployment for a variety of scenarios.

Abstract:
A recent set of techniques in the robotics community, known as certifiably correct methods, frames robotics problems as polynomial optimization problems (POPs) and applies convex, semidefinite programming (SDP) relaxations to either find or certify their global optima. In parallel, differentiable optimization allows optimization problems to be embedded into end-to-end learning frameworks and has received considerable attention in the robotics community. In this paper, we consider the ill effect of convergence to spurious local minima in the context of learning frameworks that use differentiable optimization. We present SDPRLayers, an approach that seeks to address this issue by combining convex relaxations with implicit differentiation techniques to provide certifiably correct solutions and gradients throughout the training process. We provide theoretical results that outline conditions for the correctness of these gradients and provide efficient means for their computation. Our approach is first applied to two simple-but-demonstrative simulated examples, which expose the potential pitfalls of reliance on local optimization in existing, state-of-the-art, differenti

Abstract:
This work views the multi-agent system and its surrounding environment as a co-evolving system. The goal is to take agent actions and environment configurations as decision variables, and optimize both in a coordinated manner. Towards this end, we consider the problem of decentralized multi-agent navigation in reconfigurable environments. By introducing two sub-objectives of multi-agent navigation and environment optimization, we propose an agent-environment co-optimization problem and develop a coordinated algorithm that alternates between sub-objectives to search for an optimal synthesis of agent actions and obstacle configurations; ultimately, improving navigation performance. Due to the challenge of modeling the relation between agents, environment and performance, we formulate a model-free learning mechanism within the coordinated framework. A formal convergence analysis shows our coordinated algorithm tracks the local minimum trajectory of an associated time-varying non-convex optimization problem. Experiments evaluate the benefits of co-optimization and interestingly, indicate optimized environments offer structural guidance that is key to de-conflicting agents.

Abstract:
Legged manipulators offer high mobility and versatile manipulation. However, robust interaction with heterogeneous articulated objects, such as doors, drawers, and cabinets, remains challenging because of the diverse articulation types of the objects and the complex dynamics of the legged robot. Existing reinforcement learning (RL)-based approaches often rely on high-dimensional sensory inputs, leading to sample inefficiency. In this paper, we propose a robust and sample-efficient framework for opening heterogeneous articulated objects with a legged manipulator. In particular, we propose Sampling-based Abstracted Feature Extraction (SAFE), which encodes handle and panel geometry into a compact low-dimensional representation, improving cross-domain generalization. Additionally, Articulation Information Estimator (ArtIEst) is introduced to adaptively mix proprioception with exteroception to estimate opening direction and range of motion for each object. The proposed framework was deployed to manipulate various heterogeneous articulated objects in simulation and real-world robot systems. Videos can be found on the project website: https://openheart-icra.github.io/OpenHEART/

Abstract:
This paper presents a closed-loop automation framework for heterogeneous modular robots, encompassing the entire pipeline from morphological construction to adaptive control. Within this framework, a mobile manipulator manipulates heterogeneous functional modulesincluding structural, joint, and wheeled modulesto dynamically assemble diverse robot configurations and grant them immediate locomotion capabilities. To address the state-space explosion inherent in large-scale heterogeneous reconfiguration, we propose a hierarchical planner: the high-level planner employs a bi-directional heuristic search with type penalty terms to generate module-handling sequences, while the low-level planner utilizes A search to compute optimal execution trajectories. This approach effectively decouples discrete configuration planning from continuous motion execution. For adaptive motion generation of unknown assembled configurations, we introduce a GPU-accelerated Annealing Variance Model Predictive Path Integral (MPPI) controller. By incorporating a multi-stage variance annealing strategy to balance global exploration and local convergence, the controller achieves configuration-agnostic, real-time motion control. Large-scale simulations demonstrate that the type penalty term is crucial for planning robustness in heterogeneous scenarios. Furthermore, the greedy heuristic generates plans with lower physical execution costs compared to the Hungarian heuristic. The proposed Annealing-variance MPPI significantly outperforms standard MPPI in both velocity tracking accuracy and control frequency, achieving real-time control at 50 Hz. The framework successfully validates the full-cycle process, including module assembly, robot merging and splitting, and dynamic motion generation.

Abstract:
Effective communication is critical for coordinating multi-robot teams, yet practical deployments often face severe bandwidth constraints and frequent message loss. This paper presents a communication protocol that leverages Bloom filters to enable efficient, approximate set membership queries in multi-robot systems. Bloom filters offer a tunable trade-off between false positive rate and memory footprint, making them well suited for bandwidth-limited communication. To mitigate the effects of false positives, we introduce a salting strategy that decorrelates Bloom filters and enables stacking - the combination of membership queries across multiple filters. These stacked results are incorporated into each robot's belief map, such that only sufficiently corroborated information influences frontier generation and exploration planning. We evaluate our proposed communication protocol in a multi-robot exploration task, where robots share information about their observed cells to enable efficient coverage. Our results demonstrate that compared to exact methods, our Bloom filter-based protocol reduces communication cost by up to 6 times while maintaining team exploration performance, even under severe communication dropouts.

Abstract:
In this paper, we present RhoMorph, a novel deformable planar lattice modular self-reconfigurable robot (MSRR) with a rhombus shaped module. Each module consists of a parallelogram skeleton with a single centrally mounted actuator that enables folding and unfolding along its diagonal. The core design philosophy is to achieve essential MSRR functionalities such as morphing, docking, and locomotion with minimal control complexity. This enables a continuous and stable reconfiguration process that is independent of the surrounding medium, allowing the system to reliably form various configurations in diverse environments. To leverage the unique kinematics of RhoMorph, we introduce morphpivoting, a novel motion primitive for reconfiguration that differs from advanced MSRR systems, and propose a strategy for its continuous execution. Finally, a series of physical experiments validate the module's stable reconfiguration ability, as well as its positional and docking accuracy.

Abstract:
Diffusion policies have demonstrated strong performance in generative modeling, making them promising for robotic manipulation guided by natural language instructions. However, generalizing language-conditioned diffusion policies to open-vocabulary instructions in everyday scenarios remains challenging due to the scarcity and cost of robot demonstration datasets. To address this, we propose DISCO, a framework that leverages off-the-shelf vision-language models (VLMs) to bridge natural language understanding with high-performance diffusion policies. DISCO translates linguistic task descriptions into actionable 3D keyframes using VLMs, which then guide the diffusion process through constrained inpainting. However, enforcing strict adherence to these keyframes can degrade performance when the VLM-generated keyframes are inaccurate. To mitigate this, we introduce an inpainting optimization strategy that balances keyframe adherence with learned motion priors from training data. Experimental results in both simulated and real-world settings demonstrate that DISCO outperforms conventional fine-tuned language-conditioned policies, achieving superior generalization in zero-shot, open-vocabulary manipulation tasks.

Abstract:
B is a novel optimization framework that addresses a critical challenge in fixed-base manipulator robotics: optimal base placement. Current methods rely on pre-computed kinematics databases generated through sampling to search for solutions. However, they face an inherent trade-off between solution optimality and computational efficiency when determining sampling resolution. To address these limitations, B unifies multiple objectives without database dependence. The framework employs a two-layer hierarchical approach. The outer layer systematically manages terminal constraints through progressive tightening, particularly for base mobility, enabling feasible initialization and broad solution exploration. The inner layer addresses non-convexities in each outer-layer subproblem through sequential local linearization, converting the original problem into tractable sequential linear programming (SLP). Testing across multiple robot platforms demonstrates B's effectiveness. The framework achieves solution optimality five orders of magnitude better than sampling-based approaches while maintaining perfect success rates and reduced computational overhead. Operating directly in configuration space, B enables simultaneous path planning with customizable optimization criteria. B serves as a crucial initialization tool that bridges the gap between theoretical motion planning and practical deployment, where feasible trajectory existence is fundamental.

Abstract:
Signed distance-radiance field (SDF-NeRF) is a promising environment representation that offers both photorealistic rendering and geometric reasoning such as proximity queries for collision avoidance. However, the slow training speed and convergence of SDF-NeRF hinder their use in practical robotic systems. We propose SplatSDF, a novel SDF-NeRF architecture that accelerates convergence using 3D Gaussian splats (3DGS), which can be quickly pre-trained. Unlike prior approaches that introduce a consistency loss between separate 3DGS and SDF-NeRF models, SplatSDF directly fuses 3DGS at an architectural level by consuming it as an input to SDF-NeRF during training. This is achieved using a novel sparse 3DGS fusion strategy that injects neural embeddings of 3DGS into SDF-NeRF around the object surface, while also permitting inference without 3DGS for minimal operation. Experimental results show SplatSDF achieves 3 times faster convergence to the same geometric accuracy than the best baseline, and outperforms state-of-the-art SDF-NeRF methods in terms of chamfer distance and peak signal to noise ratio, unlike consistency loss-based approaches that in fact provide limited gains. We also present computational techniques for accelerating gradient and Hessian steps by 3 times. We expect these improvements will contribute to deploying SDF-NeRF on practical systems.

Abstract:
Rigorous crop counting is crucial for effective agricultural management and informed intervention strategies. However, in outdoor field environments, partial occlusions combined with inherent ambiguity in distinguishing clustered crops from individual viewpoints poses an immense challenge for image-based segmentation methods. To address these problems, we introduce a novel crop counting framework designed for exact enumeration via 3D instance segmentation. Our approach utilizes 2D images captured from multiple viewpoints and associates independent instance masks for neural radiance field (NeRF) view synthesis. We introduce crop visibility and mask consistency scores, which are incorporated alongside 3D information from a NeRF model. This results in an effective segmentation of crop instances in 3D and highly-accurate crop counts. Furthermore, our method eliminates the dependence on crop-specific parameter tuning. We validate our framework on three agricultural datasets consisting of cotton bolls, apples, and pears, and demonstrate consistent counting performance despite major variations in crop color, shape, and size. A comparative analysis against the state of the art highlights superior performance on crop counting tasks. Lastly, we contribute a cotton plant dataset to advance further research on this topic.

Abstract:
Robotics research has made significant strides in learning, yet mastering basic skills like object placement remains a fundamental challenge. A key bottleneck is the acquisition of large-scale, high-quality data, which is often a manual and laborious process. Inspired by Graspit!, a foundational work that used simulation to automatically generate dexterous grasp poses, we introduce Placeit!, an evolutionary-computation framework for generating valid placement positions for rigid objects. Placeit! is highly versatile, supporting tasks from placing objects on tables to stacking and inserting them. Our experiments show that by leveraging quality-diversity optimization, Placeit! significantly outperforms state-of-the-art methods across all scenarios for generating diverse valid poses. A pick&place pipeline built on our framework achieved a 90% success rate over 120 real-world deployments. This work positions Placeit! as a powerful tool for open-environment pick-and-place tasks and as a valuable engine for generating the data needed to train simulation-based foundation models in robotics.

Abstract:
Safe control has been widely studied in various safety-critical applications, for instance, autonomous driving. In order to ensure the autonomous vehicle does not collide with other vehicles, it is essential to obtain an accurate expectation of surrounding vehicles' behavior and react adaptively. Instead of assuming fully cooperative and homogeneous vehicles using the same safety-critical controllers, recent works have been exploring different data-driven approaches to model the neighboring vehicles' underlying controllers with observed data. However, existing works either suffer from 1) the inter-vehicle influence during the multi-vehicle interaction, which makes it hard to determine the causality of surrounding vehicles' behavior in controller modeling, or 2) being dominated by the worst-case analysis, which may lead to overly conservative behavior. In this paper, we extend the prior work on Parametric-Control Barrier Function (Parametric-CBF) to multi-robot interactions with embedded causality inference to explicitly reason over the inter-vehicle influence. Given the learned Causality-based Parametric-CBF, we present an adaptive safety-critical controller that allows the ego vehicle to safely react to surrounding vehicles with the learned expectation. We demonstrate that by leveraging the motion flexibility among multi-vehicle systems, task efficiency can be greatly improved in various interaction-intensive scenarios.

Abstract:
The focus on human-robot collaboration has emerged as a pivotal area in the advancement of precision agricultural systems. This strategy exploits the distinct strengths of both humans and robots while minimising the exertion of each. A central aim within human-robot collaboration is to create robotic systems that are capable of understanding instructions given in natural language. Agricultural settings, especially those with structured rows of crops, are characteristically uniform, presenting difficulties in accurately grounding instructions and navigating the space. In this paper, we establish a systematic method for robotic platforms operating within agricultural settings to recognize natural language directives and autonomously traverse toward specified targets, gathering data en route. We advance the 3D Scene graph model introduced in Osiris [3], adapting it to support autonomy through a Visual Teach and Repeat paradigm, which does not rely on an expansive navigation stack. Additionally, we exploit large language models to correctly ground instructions within the newly constructed 3D scene graph representation, thus enabling natural language directives to be relayed to robotic systems in agricultural contexts. The systems ability to interpret and execute natural language commands is confirmed through validation and evaluation in a practical agricultural scenario via a ground robot.

Abstract:
Vision-Language-Action (VLA) models, such as OpenVLA, hold the promise of generalist robots, yet their performance is often impaired by distracted attention, which we identify as a manifestation of shortcut learning. We posit that the solution lies not in architectural modifications, but in a new training paradigm centered on visual prompts that provide explicit visual guidance to the model. We introduce Dual Stochastic Visual Prompting (SVP) as a concrete realization of this paradigm. SVP functions as a training-only ``visual scaffold'', a non-invasive mechanism that requires no architectural modifications. Our work demonstrates that this data-centric training paradigm is a highly effective strategy for mitigating distracted attention, enabling the learning of more robust and capable policies without architectural overhead. SVP yields substantial gains on the challenging LIBERO benchmark and real robot experience. It improves the absolute success rate of the standard OpenVLA by 8.2% on long-horizon tasks and enhances the performance of the highly optimized OpenVLA-OFT. These improvements are validated on a real robot, where our model consistently outperforms baselines across a variety of manipulation tasks.

Abstract:
Reinforcement Learning (RL) robot controllers usually aggregate many task objectives into one scalar reward. While large-scale proximal policy optimisation (PPO) has enabled impressive results such as robust real-world robot locomotion, many tasks still require careful reward tuning and remain brittle to local optima. Tuning cost and sub-optimality grow with the number of objectives, limiting scalability. Modelling reward vectors and their trade-offs can address these issues; however, multi-objective methods remain underused in RL for robotics because of computational cost and optimisation difficulty. In this work, we study gradient conflicts that arise when multiple task objectives are combined into a scalar reward. In particular, we explicitly address the conflict between task-based rewards and terms that regularise the policy towards realistic behaviour. We propose GCR-PPO, a lightweight modification to PPO that decomposes actor updates into objective-wise gradients using a multi-headed critic and resolves conflicts according to objective priority. We evaluate GCR-PPO on IsaacLab manipulation and locomotion benchmarks and two additional tasks modified to include many objectives. GCR-PPO demonstrates superior scalability compared to massively-parallel PPO (p = 0.04) without significant computational overhead. Across tasks, GCR-PPO improves performance over large-scale PPO by an average of 9.5% (Symmetric Percentage Change), with larger gains on tasks exhibiting higher gradient conflict. Code is available at: https://github.com/humphreymunn/GCR-PPO.

Abstract:
The rapid advancement of humanoid robotics has intensified the need for robust and adaptable controllers to enable stable and efficient locomotion across diverse platforms. However, developing such controllers remains a significant challenge because existing solutions are tailored to specific robot designs, requiring extensive tuning of reward functions, physical parameters, and training hyperparameters for each embodiment. To address this challenge, we introduce H-Zero, a cross-humanoid locomotion pretraining pipeline that learns a generalizable humanoid base policy. We show that pretraining on a limited set of embodiments enables zero-shot and few-shot transfer to novel humanoid robots with minimal fine-tuning. Evaluations show that the pretrained policy maintains up to 81% of the full episode duration on unseen robots in simulation while enabling few-shot transfer to unseen humanoids and upright quadrupeds within 30 minutes of fine-tuning.

Abstract:
Robotic manipulation tasks such as inserting a key into a lock or plugging a USB device into a port can fail when visual perception is insufficient to detect misalignment. In these situations, touch sensing is crucial for the robot to monitor the task's states and make precise, timely adjustments. Current touch sensing solutions are either insensitive to detect subtle changes or demand excessive sensor data. Here, we introduce TranTac, a data-efficient and low-cost tactile sensing and control framework that integrates a single contact-sensitive 6-axis inertial measurement unit within the elastomeric tips of a robotic gripper for completing fine insertion tasks. Our customized sensing system can detect dynamic translational and torsional deformations at the micrometer scale, enabling the tracking of visually imperceptible pose changes of the grasped object. By leveraging transformer-based encoders and diffusion policy, TranTac can imitate human insertion behaviors using transient tactile cues detected at the gripper's tip during insertion processes. These cues enable the robot to dynamically control and correct the 6-DoF pose of the grasped object. When combined with vision, TranTac achieves an average success rate of 79% on object grasping and insertion tasks, outperforming both vision-only policy and the one augmented with end-effector 6D force/torque sensing. Additionally, TranTac's contact localization performance is validated through tactile-only insertion tasks, where the inserted object and slot are initially misaligned by 1 to 3 mm, achieving an average success rate of 88%. We assess the generalizability by training TranTac on a single prism-slot pair and testing it on unseen data, including a USB plug and a metal key, and find that the insertion tasks can still be completed with an average success rate of nearly 70%. The proposed framework may inspire new robotic tactile sensing systems for delicate manipulation tasks.

Abstract:
Object navigation for mobile robots typically assumes that targets are visible and paths are unobstructed. However, real-world scenarios often involve occluded targets like objects hidden behind doors or inside containers. Such scenarios require interactive navigation and manipulation by mobile manipulators. To address this challenge, we propose VLION, a vision-language model-guided framework for interactive object navigation (ION) that enables robots to locate and access such targets efficiently. VLION constructs a probabilistic occupancy map and dynamically identifies frontiers for efficient exploration. It leverages vision-language models (VLMs) to perform joint semantic reasoning at both the scene and object levels, generating Scene-Target and Object-Target Value Maps from egocentric observations. These maps are adaptively fused based on spatial entropy to guide target selection and dynamically balance navigation and manipulation priorities for multi-step decision-making. A hybrid A planner ensures safe and feasible navigation, while star-convex manipulation regions enable interaction with objects. Extensive experiments in iGibson simulations and real-world environments demonstrate the effectiveness of VLION in zero-shot transfer and on-board deployment, advancing the state of the art in ION.

Abstract:
In this work, we introduce a risk-aware reinforcement learning framework for robust quadrupedal locomotion. Our approach first trains a family of risk-conditioned policies using a Conditional Value-at-Risk (CVaR) constrained optimization technique, which improves both training stability and sample efficiency. During deployment, we frame online policy selection as a multi-armed bandit problem. Relying solely on observed episodic returns rather than privileged environment information, this method dynamically adjusts the robot's robustness level to handle unknown conditions on the fly. We evaluate our approach in simulation across eight diverse settingsvarying dynamics, contacts, sensing noise, and terrainas well as in real-world trials on a Unitree Go2 robot. Compared to existing baselines, our risk-aware policy achieves nearly twice the mean and tail performance in novel environments, with the bandit algorithm successfully identifying the optimal policy within just two minutes of operation.

Abstract:
We realize 3D robotic swarmalators that reconfigure, navigate, and avoid obstacles with formal safety on Crazyflie drones. We incorporate ellipsoidal Control Barrier Functions to avoid downwash turbulence between drones, and a combination of Control Lyapunov Function and Control Barrier Function methods to enable the collective to move toward desired locations while avoiding collisions between drones or with nearby obstacles. We implement a global control scheme that moves the collective as a single entity, and a local control scheme that enables fluid-like flow around nearby obstacles while maintaining the same general collective formation. Finally, we demonstrate how the swarmalator model combined with these control schemes can be used to reconfigure and rotate a drone collective so it moves through a narrow passage without colliding with the surrounding environment. Our simulations and physical experiments quantify scalability limits and validate the feasibility of implementing 3D swarmalator-based control on real drone collectives.

Abstract:
Wall-climbing robots capable of scaling vertical surfaces could help automate hazardous or labor intensive tasks such as window washing, inspection, maintenance, and construction. Active adhesion methods achieve higher payload capacities, but require power to maintain their grip. Passive adhesion devices such as suction cups are an attractive option for such robots because they do not require power to maintain their grip, but they are limited by their payload capacity. This work presents a novel high-payload wall-climbing robot that utilizes passive bistable suction cups to generate adhesion without needing to be pushed into the wall. The robot features a track-based system that automatically engages and disengages bistable suction cups to achieve locomotion on smooth surfaces. The robot is able to achieve vertical wall climbing on glass, wood, metal, and painted surfaces, sideways and upside-down climbing, and is able to tow a payload of 7.940 kg (with a payload-to-weight ratio of 2.25).

Abstract:
In modern human-robot collaboration (HRC) applications, multiple perception modules jointly extract visual, auditory, and contextual cues to achieve comprehensive scene understanding, enabling the robot to provide appropriate assistance to human agents intelligently. While executing multiple perception modules on a frame-by-frame basis enhances perception quality in offline settings, it inevitably accumulates latency, leading to a substantial decline in system performance in streaming perception scenarios. Recent work in scene understanding, termed Relevance, has established a solid foundation for developing efficient methodologies in HRC. However, modern perception pipelines still face challenges related to information redundancy and suboptimal allocation of computational resources. Drawing inspiration from the Relevance concept and the information sparsity in HRC events, we propose a novel lightweight perception scheduling framework that efficiently leverages output from previous frames to estimate and schedule necessary perception modules in real-time based on scene context. The experimental results demonstrate that the proposed perception scheduling framework effectively reduces computational latency by up to 27.52% compared to conventional parallel perception pipelines, while also achieving a 72.73% improvement in MMPose activation recall. Additionally, the framework demonstrates high keyframe accuracy, achieving rates of up to 98%. The results validate the framework's capability to enhance real-time perception efficiency without significantly compromising accuracy. The framework shows potential as a scalable and systematic solution for multimodal streaming perception systems in HRC.

Abstract:
Object pose estimation is a fundamental problem in robotics and computer vision, yet it remains challenging due to partial observability, occlusions, and object symmetries, which inevitably lead to pose ambiguity and multiple hypotheses consistent with the same observation. While deterministic deep networks achieve impressive performance under well-constrained conditions, they are often overconfident and fail to capture the multi-modality of the underlying pose distribution. To address these challenges, we propose a probabilistic framework that leverages flow matching on the SE(3) manifold for estimating 6D object pose distributions. Unlike existing methods that regress a single deterministic output, our approach models the full pose distribution with a sample-based estimate and enables reasoning about uncertainty in ambiguous cases such as symmetric objects or severe occlusions. We achieve state-of-the-art results on Real275, YCB-V and LM-O, and demonstrate how our sample-based pose estimates can be leveraged in downstream robotic manipulation tasks such as active perception for disambiguating uncertain viewpoints, or guiding grasp synthesis in an uncertainty-aware manner.

Abstract:
Long-term state estimation over graphs remains challenging as current graph estimation methods scale poorly on large, long-term graphs. To address this, our work advances a current state-of-the-art graph sparsification algorithm, maximizing algebraic connectivity (MAC). MAC is a sparsification method that preserves estimation performance by maximizing the algebraic connectivity, a spectral graph property that is directly connected to the estimation error. Unfortunately, MAC remains computationally prohibitive for online use and requires users to manually pre-specify a connectivity-preserving edge set. Our contributions close these gaps along three complementary fronts: we develop a specialized solver for algebraic connectivity that yields an average 2x runtime speedup; we investigate advanced step size strategies for MACs optimization procedure to enhance both convergence speed and solution quality; and we propose automatic schemes that guarantee graph connectivity without requiring manual specification of edges. Together, these contributions make MAC more scalable, reliable, and suitable for real-time estimation applications.

Abstract:
Semi-supervised 3D object detection (SS3DOD) aims to reduce costly 3D annotations utilizing unlabeled data. Recent studies adopt pseudo-label-based teacher-student frameworks and demonstrate impressive performance. The main challenge of these frameworks is in selecting high-quality pseudo-labels from the teachers predictions. Most previous methods, however, select pseudo-labels by comparing confidence scores over thresholds manually set. The latest works tackle the challenge either by dynamic thresholding or refining the quality of pseudo-labels. Such methods still overlook contextual information e.g. object distances, classes, and learning states, and inadequately assess the pseudo-label quality using partial information available from the networks. In this work, we propose a novel SS3DOD framework featuring a learnable pseudo-labeling module designed to automatically and adaptively select high-quality pseudo-labels. Our approach introduces two networks at the teacher output level. These networks reliably assess the quality of pseudo-labels by the score fusion and determine context-adaptive thresholds, which are supervised by the alignment of pseudo-labels over GT bounding boxes. Additionally, we introduce a soft supervision strategy that can learn robustly under pseudo-label noises. This helps the student network prioritize cleaner labels over noisy ones in semi-supervised learning. Extensive experiments on the KITTI and Waymo datasets demonstrate the effectiveness of our method. The proposed method selects high-precision pseudo-labels while maintaining a wider coverage of contexts and a higher recall rate, significantly improving relevant SS3DOD methods.

Abstract:
Speech-driven 3D facial animation has achieved significant progress in both research and applications. While recent baselines struggle to generate natural and continuous facial movements due to their frame-by-frame vertex generation approach, we propose 3DFacePolicy, a pioneer work that introduces a novel definition of vertex trajectory changes across consecutive frames through the concept of "action". By predicting action sequences for each vertex that encode frame-to-frame movements, we reformulate vertex generation approach into an action-based control paradigm. Specifically, we leverage a robotic control mechanism, diffusion policy, to predict action sequences conditioned on both audio and vertex states. Extensive experiments on VOCASET and BIWI datasets demonstrate that our approach significantly outperforms state-of-the-art methods and is particularly expert in dynamic, expressive and naturally smooth facial animations.

Abstract:
Soft robots offer safe and adaptive interaction with humans and unstructured environments through their inherent ability to deform and comply. Pneumatic actuators are one way to build soft robots. They are typically made from soft silicone materials and are especially effective for driving such systems, enabling smooth and adaptable motion. However, their compliant nature also makes them vulnerable to mechanical failures like punctures and tears, limiting practical deployment. To address this, we propose a puncture detection system for soft actuators using motion data from a single inertial measurement unit. Extracted features are used to train anomaly detectors for puncture detection and non-linear models to estimate severity. We also introduce a multi-chamber pneumatic soft bending actuator capable of diverse configurations via selective chamber inflation. Our algorithm identifies the punctured chamber and provides a severity score using a chamber perturbation scheme. Anomaly detectors are trained on normal operation data and detect damage through reconstruction errors, while severity is estimated by a separate model trained under slightly modified conditions. Finally, we demonstrate a failure recovery strategy to maintain actuation force post-failure. This approach enhances the reliability and safety of soft robotic systems through real-time, data-driven damage detection.

Abstract:
Localization systems often rely heavily on visual information, which can degrade under challenging conditions such as variable lighting, dynamic objects, or repetitive textures. To enhance robustness beyond single-image methods, we model localization as a structural point cloud registration problem, leveraging motion continuity and geometric consistency over time. This formulation reduces sensitivity to transient occlusions and appearance changes, enabling the system to resolve ambiguities that single-image techniques often cannot. In this work, we introduce Struct-Loc, a localization framework that advances structural point cloud registration through confidence-aware hierarchical localization. By estimating the reliability of structural regions and incorporating it into the matching process, Struct-Loc generates robust descriptors tailored for pose estimation. To achieve near real-time performance, Struct-Loc combines efficient point convolutional encoders, a caching mechanism, and a hierarchical coarse-to-fine matching strategy that progressively narrows the search space. It consistently outperforms strong baselines in both accuracy and runtime, while achieving a 100× compression of the global map compared to COLMAP, significantly improving storage efficiency. We validate Struct-Loc on the LaMAR benchmark, demonstrating its effectiveness and robustness under real-world conditions.

Abstract:
High-altitude balloons (HABs) are common in scientific research due to their wide range of applications and low cost. Because of their nonlinear, underactuated dynamics and the partial observability of wind fields, prior work has largely relied on model-free reinforcement learning (RL) methods to design near-optimal control schemes for station-keeping. These methods often compare only against hand-crafted heuristics, dismissing model-based approaches as impractical given the system complexity and uncertain wind forecasts. We revisit this assumption about the efficacy of model-based control for station-keeping by developing First-Order Model Predictive Control (FOMPC). By implementing the wind and balloon dynamics as differentiable functions in JAX, we enable gradient-based trajectory optimization for online planning. FOMPC outperforms a state-of-the-art RL policy, achieving a 24% improvement in time-within-radius (TWR) without requiring offline training, though at the cost of greater online computation per control step. Through systematic ablations of modeling assumptions and control factors, we show that online planning is effective across many configurations, including under simplified wind and dynamics models.

Abstract:
In the context of imitation learning, visuomotor-based diffusion policy learning is one of the main directions in robotic manipulation. Most of these approaches rely on point clouds as observation inputs and construct scene representations through point clouds feature learning, which enables them to achieve remarkable accuracy. However, the existing literature lacks an in-depth exploration of vision-only solutions that have significant potential. In this paper, we propose a Vision-Only and single-view Diffusion Policy learning method (VO-DP) that leverages pretrained visual foundation models to achieve effective fusion of semantic and geometric features. We utilize intermediate features from VGGT incorporating semantic features from DINOv2 and geometric features from Alternating Attention blocks. Features are fused via cross-attention and spatially compressed with a CNN to form the input to the policy head. Extensive experiments demonstrate that VO-DP not only outperforms the vision-only baseline DP significantly but also exhibits distinct performance trends against the point cloud-based method DP3: in simulation tasks, VO-DP achieves an average success rate of 64.6%on par with DP3 64.0% and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%, outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further robustness evaluations confirm that VO-DP remains highly stable under varying conditions including color, size, background, and lighting. Lastly, we open-source DRRM (D-Robotics Robotic Manipulation), a training library for robotic manipulation. Built on Accelerate, this library supports multi-machine and multi-GPU parallel training, as well as mixed precision training (e.g., bf16, fp16). It is compatible with visuomotor policies such as DP and DP3, and also supports the RoboTwin simulator. VO-DP is integrated into DRRM. We refer to the project page for the code and videos.

Abstract:
Bimanual coordination is essential for many real-world manipulation tasks, yet learning bimanual robot policies is limited by the scarcity of bimanual robots and datasets. Single-arm robots, however, are widely available in research labs. Can we leverage them to train bimanual robot policies? We present MonoDuo, a framework for learning bimanual manipulation policies using single-arm robot demonstrations paired with human collaboration. MonoDuo collects data by teleoperating a single-arm robot to perform one side of a bimanual task while a human performs the other, then swapping roles to cover both sides. RGB-D observations from a wrist-mounted and fixed camera are augmented into synthetic demonstrations for target bimanual robots using state-of-the-art hand pose estimation, image and point cloud segmentation, and inpainting. These synthetic demonstrations, grounded in real robot kinematics, are used to train bimanual policies. We evaluate MonoDuo on five tasksbox lifting, backpack packing, cloth folding, jacket zipping, and plate handover. Compared to approaches relying solely on human bimanual videos, MonoDuo enables zero-shot deployment on unseen bimanual robot configurations, achieving success rates up to 70%. With only 25 target robot demonstrations, few-shot finetuning further boosts success rates by 6570% over training from scratch, demonstrating MonoDuos effectiveness in efficiently transferring knowledge from single-arm robot data to bimanual robot policies. Project page: https://bimanual-monoduo.github.io

Abstract:
Language-guided grasping has emerged as a promising paradigm for enabling robots to identify and manipulate target objects through natural language instructions, yet it remains highly challenging in cluttered or occluded scenes. Existing methods often rely on multi-stage pipelines that separate object perception and grasping, which leads to limited cross-modal fusion, redundant computation, and poor generalization in cluttered, occluded, or low-texture scenes. To address these limitations, we propose GeoLanG, an end-to-end multi-task framework built upon the CLIP architecture that unifies visual and linguistic inputs into a shared representation space for robust semantic alignment and improved generalization. To enhance target discrimination under occlusion and low-texture conditions, we explore a more effective use of depth information through the Depth-guided Geometric Module (DGGM), which converts depth into explicit geometric priors and injects them into the attention mechanism without additional computational overhead. In addition, we propose Adaptive Dense Channel Integration, which adaptively balances the contributions of multi-layer features to produce more discriminative and generalizable visual representations. Extensive experiments on the OCID-VLG dataset, as well as in both simulation and real-world hardware, demonstrate that GeoLanG enables precise and robust language-guided grasping in complex, cluttered environments, paving the way toward more reliable multimodal robotic manipulation in real-world human-centric settings.

Abstract:
Reinforcement learning (RL) approaches based on Markov Decision Processes (MDPs) are predominantly applied in the robot joint space, often relying on limited task-specific information and partial awareness of the 3D environment. In contrast, episodic RL has demonstrated advantages over traditional MDP-based methods in terms of trajectory consistency, task awareness, and overall performance in complex robotic tasks. Moreover, traditional step-wise and episodic RL methods often neglect the contact-rich information inherent in task-space manipulation, especially considering the contact-safety and robustness. In this work, contact-rich manipulation tasks are tackled using a task-space, energy-safe framework, where reliable and safe task-space trajectories are generated through the combination of Proximal Policy Optimization (PPO) and movement primitives. Furthermore, an energy-aware Cartesian Impedance Controller objective is incorporated within the proposed framework to ensure safe interactions between the robot and the environment. Our experimental results demonstrate that the proposed framework outperforms existing methods in handling tasks on various types of surfaces in 3D environments, achieving high success rates as well as smooth trajectories and energy-safe interactions.

Abstract:
While generalist robot policies hold significant promise for learning diverse manipulation skills through imitation, their performance is often hindered by the long-tail distribution of training demonstrations. Policies learned on such data, which is heavily skewed towards a few data-rich head tasks, frequently exhibit poor generalization when confronted with the vast number of data-scarce tail tasks. In this work, we conduct a comprehensive analysis of the pervasive long-tail challenge inherent in policy learning. Our analysis begins by demonstrating the inefficacy of conventional long-tail learning strategies (e.g., re-sampling) for improving the policy's performance on tail tasks. We then uncover the underlying mechanism for this failure, revealing that data scarcity on tail tasks directly impairs the policy's spatial reasoning capability. To overcome this, we introduce Approaching-Phase Augmentation (APA), a simple yet effective scheme that transfers knowledge from data-rich head tasks to data-scarce tail tasks without requiring external demonstrations. Extensive experiments in both simulation and real-world manipulation tasks demonstrate the effectiveness of APA. Our code and demos are publicly available at: https://mldxy.github.io/Project-VLA-long-tail/.

Abstract:
Robotic ultrasound (US) has recently attracted considerable attention as a means to overcome the limitations of conventional US examinations, such as the strong operator dependence. However, the decision-making process of existing methods is often either rule-based or relies on end-to-end learning models that operate as black boxes. This has been seen as a main limit for clinical acceptance and raises safety concerns for widespread adoption in routine practice. To tackle this problem, we introduce the RAG-RUSS, an interpretable framework capable of performing a full carotid examination in accordance with the clinical workflow while explicitly explaining both the current stage and the next planned action. Furthermore, given the scarcity of medical data, we incorporate retrieval-augmented generation to enhance generalization and reduce dependence on large-scale training datasets. The method was trained on data acquired from 28 volunteers, while an additional four volumetric scans recorded from previously unseen volunteers were reserved for testing. The results demonstrate that the method can stage the current scanning stage and autonomously plan probe motions to complete the carotid examination, encompassing both transverse and longitudinal planes.

Abstract:
Realistic and controllable panoramic LiDAR data generation is critical for scalable 3D perception in autonomous driving and robotics. Existing methods either perform unconditional generation with poor controllability or adopt text-guided synthesis, which lacks fine-grained spatial control. Leveraging a monocular RGB image as a spatial control signal offers a scalable and low-cost alternative, which remains an open problem. However, it faces three core challenges: (i) semantic and depth cues from RGB vary spatially, complicating reliable conditioning generation; (ii) modality gaps between RGB appearance and LiDAR geometry amplify alignment errors under noisy diffusion; and (iii) maintaining structural coherence between monocular RGB and panoramic LiDAR is challenging, particularly in image-LiDAR's non-overlap regions. To address these challenges, we propose Veila, a novel conditional diffusion framework that integrates: (i) a Confidence-Aware Conditioning Mechanism (CACM) that strengthens RGB conditioning by adaptively balancing semantic and depth cues according to their local reliability; (ii) Geometric Cross-Modal Alignment (GCMA) for robust RGB-LiDAR alignment under noisy diffusion; and (iii) Panoramic Feature Coherence (PFC) for enforcing global structural consistency across monocular RGB and panoramic LiDAR. Additionally, we introduce two metrics - Cross-Modal Semantic Consistency and Cross-Modal Depth Consistency - to evaluate alignment quality across modalities. Experiments on nuScenes, SemanticKITTI, and our proposed KITTI-Weather benchmark demonstrate that Veila achieves state-of-the-art generation fidelity and cross-modal consistency, while enabling generative data augmentation that improves downstream LiDAR semantic segmentation.

Abstract:
Accurate visual localization often relies on dense, high-fidelity 3D models, which provide rich geometric and photometric detail but are expensive to acquire, heavy to store, and limited in scalability. As an alternative, lightweight city models represent only coarse building volumes, offering compactness, accessibility, and privacy but posing challenges for reliable alignment due to the lack of textures and fine structure. This work addresses these challenges by introducing a semantic equirectangular Gaussian Mixturebased virtual visual servoing approach that aligns real panoramic images with synthetic views rendered from lightweight building models. The method combines semantic building masks with Gaussian Mixtures, a seamless 360^circ formulation, and frequency-domain computation to overcome the poor gradients of direct photometric binary-mask alignment while maintaining computational efficiency. Experiments on outdoor trajectories demonstrate accurate and stable tracking, robustness under frame skipping, and resilience to dynamic occlusions through semantic masking. These results indicate that reliable localization is feasible with coarse city models, providing a scalable alternative to high-fidelity reconstructions and opening perspectives for deeper integration of semantic rules into the localization process.

Abstract:
Inspired by how humans reason over discrete objects and their relationships, we explore whether compact object-centric and object-relation representations can form a foundation for multitask robotic manipulation. Most existing robotic multitask models rely on dense embeddings that entangle both object and background cues, raising concerns about both efficiency and interpretability. In contrast, we study object-relation-centric representations as a pathway to more structured, efficient, and explainable visuomotor control. Our contributions are two-fold. First, we introduce LIBERO+, a fine-grained benchmark dataset designed to enable and evaluate object-relation reasoning in robotic manipulation. Unlike prior datasets, LIBERO+ provides object-centric annotations that enrich demonstrations with box- and mask-level labels as well as instance-level temporal tracking, supporting compact and interpretable visuomotor representations. Second, we propose SlotVLA, a slot-attention-based framework that captures both objects and their relations for action decoding. It uses a slot-based visual tokenizer to maintain consistent temporal object representations, a relation-centric decoder to produce task-relevant embeddings, and an LLM-driven module that translates these embeddings into executable actions. Experiments on LIBERO+ demonstrate that object-centric slot and object-relation slot representations drastically reduce the number of required visual tokens, while providing competitive generalization. Together, LIBERO+ and SlotVLA provide a compact, interpretable, and effective foundation for advancing object-relation-centric robotic manipulation.

Abstract:
The enhanced mobility brought by legged locomotion empowers quadrupedal robots to navigate through complex and unstructured environments. However, optimizing agile locomotion while accounting for the varying energy costs of traversing different terrains remains an open challenge. Most previous work focuses on planning trajectories with traversability estimation based on human-labeled environmental features. This human-centric approach is insufficient because it does not account for the varying capabilities of the robot locomotion controllers. We introduce a novel real-world learning pipeline that unifies offline demonstrations, online reinforcement learning, and multi-modal perception to achieve robust legged navigation. The framework employs multiple training stages to develop a planner that guides the robot in avoiding obstacles and hard-to-traverse terrains while reaching its goals. With the proposed method, a quadrupedal robot learns to perform traversability-aware navigation through real-world interactions in diverse offroad and unstructured environments. Moreover, the robot demonstrates the ability to generalize the learned navigation skills to unseen scenarios.

Abstract:
Differentiable Gaussian Splatting (GS) has emerged as a powerful paradigm for scene representation, enabling efficient rendering and real-time editing. However, existing GS-based methods, which rely mainly on clear visual images, perform poorly in underwater environments due to camera distortions such as light absorption and backscattering. In contrast, acoustic sensors like Forward Looking Sonar (FLS) offer superior penetration and robustness in such conditions. To leverage the complementary merits of visual and FLS images, we propose a novel Gaussian splatting framework customized for underwater scenarios, termed Aqua-Splat, for robust and accurate underwater perception. It ensures physically consistent reconstruction by incorporating the sonar wave propagation modeling in the image formation process. Moreover, we propose a volume rendering technique for sonar image synthesis, achieving similar speed to visual rendering. Additionally, we introduce a sonar-guided den- sification strategy to optimize the scene representation. Through extensive experiments on both simulated and laboratory datasets, we demonstrate that Aqua-Splat significantly improves image synthesis and 3D scene reconstruction in challenging underwater environments, outperforming existing methods in terms of both geometric accuracy and photometric fidelity. The code of Aqua-Splat will be open-sourced later for the community.

Abstract:
In environments where robots operate with limited global navigation satellite system accessibility, ultra-wideband (UWB) localization technology is a popular auxiliary solution to assist visualinertial odometry systems. However, current UWB approaches lack 3-D pairwise localization capability and suffer from rapidly declining localization update rates as the network scales, limiting their effectiveness for swarm robotic applications. This article presents a novelUWBsensor that enables 3-D pairwise localization and a localization scheme that can deliver robust, scalable, and accurate position awareness for multi-robot systems. Our approach begins with calibrating intrinsic UWB errors from hardware deviations and propagation effects, yielding high-accuracy distance and direction measurements. Using these measurements, we perform distributed relative localization through inter- and intra-node cooperation by integrating UWB and inertial measurement unit data. To enable swarm-scale operation, our platform implements the signal-multiplexing network ranging protocol to maximize update rates and network capacity. Experimental results show that our approach achieves centimeter-level localization accuracy at high update rates (100 Hz with UWB only), validating its robustness, scalability, and accuracy for robotic applications.

Abstract:
Challenges in traversing dynamic clutters lie mainly in the efficient perception of the environmental dynamics and the generation of evasive behaviors considering obstacle movement. Previous solutions have made progress in explicitly modeling the dynamic obstacle motion for avoidance, but this key dependency of decision-making is time-consuming and unreliable in highly dynamic scenarios with occlusions. On the contrary, without introducing object detection, tracking, and prediction, we empower the sim-to-real reinforcement learning (RL) with single LiDAR sensing to realize an autonomous flight system directly from point to motion. For exteroception, a depth sensing distance map achieving fixed-shape, low-resolution, and detail-safe is encoded from raw point clouds, and an environment change sensing point flow is adopted as motion features extracted from multi-frame observations. These two are integrated into a lightweight and easy-to-learn representation of complex dynamic environments. For action generation, the behavior of avoiding dynamic threats in advance is implicitly driven by the proposed change-aware sensing representation, where the policy optimization is indicated by the relative motion modulated distance field. With the deployment-friendly sensing simulation and dynamics model-free acceleration control, the proposed system shows a superior success rate and adaptability to alternatives, and the policy derived from the simulator can drive a real-world quadrotor with safe maneuvers.

Abstract:
Temporal alignment of multiple signals through time warping is crucial in many fields, such as classification within speech recognition or robot motion learning. Almost all related works are limited to data in Euclidean space. Although an attempt was made in 2011 to adapt this concept to unit quaternions, a general extension to Riemannian manifolds remains absent. Given its importance for numerous applications in robotics and beyond, we introduce Riemannian Time Warping (RTW). This novel approach efficiently aligns multiple signals by considering the geometric structure of the Riemannian manifold in which the data is embedded. Extensive experiments on synthetic and real-world data, including tests with an LBR iiwa robot, demonstrate that RTW consistently outperforms state-of-the-art baselines in both averaging and classification tasks.

Abstract:
This paper presents a novel decentralized approach for achieving emergent behavior in multi-agent systems with minimal information sharing. Based on prior work in simple orbits, our method produces a broad class of stable, periodic trajectories by stabilizing the system around a Lie group-based geometric embedding. By employing the Lie group SO(3), we generate a wider range of periodic curves than existing quaternion-based methods. Furthermore, we exploit SO(3) properties to eliminate the need for velocity inputs, allowing agents to receive only position inputs. We also propose a novel phase controller that ensures uniform agent separation, along with a formal stability proof. Validation through simulations and experiments showcases the method's adaptability to complex low-level dynamics and disturbances.

Abstract:
Transparent objects are common in daily life and industry, necessitating that robots be able to perceive and manipulate them. The physical properties of reflection and refraction pose challenges for accurately reconstructing the 3D geometry of transparent objects. Conventional methods, which rely on simultaneous estimation of background ambient light and complex refraction fields, lack robustness in real-world scenes, thereby impeding robotic grasping performance. To address this issue, this letter proposes TORM, a novel framework for robust reconstruction and manipulation of multiple transparent objects. TORM focuses on semantic information from transparent objects and employs multi-view segmentation masks to constrain a self-supervised multi-object deep marching tetrahedra (DMTet-Multi) 3D fitting process. To mitigate the risk of the geometry representation getting stuck in suboptimal solutions during multi-transparent-object reconstruction, we design a novel loss function that prevents marching tetrahedra from crossing boundaries. By applying a connectivity determination strategy to the fitted mesh, transparent objects can be processed in parallel by a grasp perception network, predicting the end-effector configuration for grasp tasks. Real-world experiments demonstrate that TORM achieves an 88.8% grasping success rate in multi-transparent-object grasping tasks.

Abstract:
Navigation signs and maps, such as floor plans and street maps, are widely available and serve as ubiquitous aids for way-finding in human environments. Yet, they are rarely used by robot systems. This paper presents SignLoc, a global localization method that leverages navigation signs to localize the robot on publicly available mapsspecifically floor plans and OpenStreetMap (OSM) graphswithout prior sensor-based mapping. SignLoc first extracts a navigation graph from the input map. It then employs a probabilistic observation model to match directional and locational cues from the detected signs to the graph, enabling robust topo-semantic localization within a Monte Carlo framework. We evaluated SignLoc in diverse large-scale environments: part of a university campus, a shopping mall, and a hospital complex. Experimental results show that SignLoc reliably localizes the robot after observing only one to two signs.

Abstract:
Micro-positioning and pick-and-place applications at the millimeter scale are driving the development of smaller robots necessitating the use of alternative methods for design and manufacture. Additive manufacturing can enable significant cost and time savings in the fabrication of robots while having a low barrier to entry. Specifically, multimaterial 3D printing naturally lends itself to the creation of monolithic mechanisms by removing the requirement for manual assembly, in particular, when compliant joints can replace the rigid joints that are traditionally used. The lack of an assembly requirement naturally opens up the possibility of reducing the size scale of these mechanisms. In this work, the design, fabrication, and characterization of an additively manufactured mesoscale compliant parallel robot actuated by piezoelectric bimorphs through a compliant transmission mechanism is presented. The transmission mechanism is required to convert and amplify the small but rapid linear displacements of piezoelectric actuators into the large rotational motion that is required to create a large workspace for the compliant parallel robot. The developed planar parallel robot has a workspace with maximum planar extents of 14.36 mm by 8.66 mm, with a total area of 65.6 millimeters squared. Three different trajectories are tracked at frequencies of up to 10 Hz, demonstrating the robot's capability to rapidly follow trajectories in its workspace.

Abstract:
Monocular dense SLAM faces significant challenges in low-texture environments and under rapid camera motions. The recent development of 3D Gaussian Splatting (3DGS) offers a promising approach for real-time dense 3D reconstruction. However, existing 3DGS-based SLAM systems employ end-to-end optimization frameworks, which often struggle to achieve both efficient camera tracking and high-quality scene reconstruction simultaneously. To address this challenge, we propose a dense decoupled SLAM system that seamlessly integrates traditional visual odometry with 3DGS within a unified framework. Our system utilizes dense direct image alignment using pseudo-depth maps rendered from a global model, which is represented by an octree-managed structured Gaussian representation. This structured Gaussian supports fast rendering and efficient mesh extraction. Furthermore, we adopt a stereo 3D reconstruction model to generate dense depth maps from visual odometry for optimizing the 3D Gaussians. Experimental results demonstrate that our framework achieves state-of-the-art performance in both tracking robustness and reconstruction outperforming to existing monocular Gaussian-based SLAM systems, while maintaining real-time efficiency.

Abstract:
Multi-robot coordination in shared workspaces is prone to deadlocks, which can compromise operational capabilities and task efficiency. Accurately determining the timing and spatial locations of deadlocks is essential for effective resolution, yet remains challenging due to dynamic robot interactions and growing system complexity. To this end, a distributed deadlock-aware control framework is proposed for robots to detect and avoid deadlocks while maintaining safe task execution. First, deadlocks are characterized by analyzing undesired equilibria in robot dynamics under safety constraints imposed by multiple stacked control barrier functions (CBFs). Our analysis reveals two critical properties: 1) Deadlocks occur at intersections of all active CBF boundaries, and 2) Deadlocks arise when robot stabilizing force are confined within the conical hull formed by active safety forces. These theoretical insights underpin a new detection method that identifies potential deadlocks from conflicts between safety requirements and task objectives. Furthermore, a reactive deadlock avoidance method is designed to help robots escape and prevent entry into potential deadlock regions by adaptively modulating the stabilizing force. A generalized workflow is established to systematically address deadlocks across various multi-robot tasks. Simulation and hardware experiments are conducted on robots collaborating in dense environments to validate the framework's effectiveness in preventing task failures caused by deadlocks.

Abstract:
Cooperative tasks are common in multi-agent systems, with closely cooperative tasks being a special case of this, where a change in the state of the environment requires multiple agents to perform a specific operation at the same time. Take a box-pushing task as an example, the box is heavy and requires multiple agents to push it simultaneously. Optimal actions in a closely cooperation task are correlated with the actions of other agents, which makes the individual optimal action potentially inconsistent with the group optimal action, which leads to more non-globally optimal Nash equilibrium policies in the problem. This makes it easier for the policy learned by reinforcement learning to fall into these locally optimal policies. In this paper, we propose a self-organised sequential multi-agent reinforcement learning algorithm (SOS-MARL). We propose sequential decision-making to change the optimization objective of the agent's policy so that the learned policy tends to group optimal policies. And propose an automatic grouping mechanism to make the policy smoother for training and reasoning in large-scale agent environments. We decompose the joint action value factorization outside the group into a combination of each group action value, thus guiding the agents to improve their group policies in a fine-grained manner. We deployed scenarios in both simulated and real environments and compared SOS-MARL with various classical MARL algorithms on box-pushing tasks, demonstrating the state-of-the-art of our method.

Abstract:
Coverage control is the problem of navigating a robot swarm to collaboratively monitor features or a phenomenon of interest not known a priori. The problem is challenging in decentralized settings with robots that have limited communication and sensing capabilities. We propose a learnable Perception-Action-Communication (LPAC) architecture for the problem, wherein a convolutional neural network (CNN) processes localized perception; a graph neural network (GNN) facilitates robot communications; finally, a shallow multi-layer perceptron (MLP) computes robot actions. The GNN enables collaboration in the robot swarm by computing what information to communicate with nearby robots and how to incorporate received information. Evaluations show that the LPAC models---trained using imitation learning---outperform standard decentralized and centralized coverage control algorithms. The learned policy generalizes to environments different from the training dataset, transfers to larger environments with more robots, and is robust to noisy position estimates. The results indicate the suitability of LPAC architectures for decentralized navigation in robot swarms to achieve collaborative behavior.

Abstract:
DexFruit is a robotic manipulation framework that enables gentle, autonomous handling of fragile fruit and precise evaluation of damage. Soft fruits have long faced an issue of produce loss in both the harvesting and post-harvesting processes due to their extreme fragility and susceptibility to bruising, making them one of the hardest produce type to manipulate with automation. In this work, we demonstrate by using optical tactile sensing, autonomous manipulation of fruit with minimal damage can be achieved. We show that our tactile informed diffusion policies outperform baselines in both reduced bruising and pick-and-place success rate across three fruits: strawberries, tomatoes, and blackberries. In addition, we introduce FruitSplat, a novel technique to represent and quantify visual damage in a high-resolution 3D representation via 3D Gaussian Splatting (3DGS). Existing metrics for measuring damage lack quantitative rigor or require expensive equipment. With FruitSplat, we distill a 2D fruit mask as well as a 2D bruise segmentation mask into the 3DGS representation from just a web-cam video. Furthermore, this representation is modular and general, compatible with any relevant 2D model. Overall, we demonstrate a 92% grasping policy success rate, up to a 15% reduction in visual bruising, and up to a 31% improvement in grasp success rate on challenging fruit compared to our baselines across our three tested fruits. We rigorously evaluate this result with over 630 trials.

Abstract:
Construction site environments are highly non deterministic scenarios under constant changes. Complex tasks are usually required in these scenarios and multi agent systems have been probed as the flexible solution to solve them. Nevertheless, the uncertainty in the environments often makes the available information inaccurate, incomplete or difficult to integrate in multi agent systems. To successfully automate complex processes in construction environments it is necessary to overcome the barrier imposed by the lack of accurate information. The research challenge here presented is the coordination of multi agent systems in non deterministic environments. In this proposal, a Multi Agent Proximal Policy Optimization system (MAPPO) is proposed to create the necessary flexible framework. Various policy networks associated with different types of agents are trained over different scenarios. Different teams of agents are also proposed during the training process. With this approach it is intended to create a framework able to command different teams of agents independently from the constraints imposed by the information of the environment.

Abstract:
This paper proposes a hybrid Agentic AIFSM framework for robust natural-language-driven automation in safety-critical industrial robotics applications. Although natural-language procedures are commonplace in manufacturing, translating them into reliable robot programs remains labor-intensive. While Large Language Models (LLMs) offer strong parsing and planning capabilities, their inherent non-determinism and susceptibility to hallucinations preclude their direct use for robot control. To bridge this gap, our architecture employs an LLM-based planning agent to translate instructions offline into a structured task plan. Execution is then delegated to a deterministic Finite State Machine (FSM)-style execution engine to ensure reliability. Safety is further guaranteed by a multi-stage validationsimulation pipeline that verifies schema compliance and operational constraints through dry runs prior to deployment. For runtime anomalies, a RAG-enhanced Exception Handling Agent proposes recovery options, which are strictly mediated through a human-in-the-loop (HIL) interface for operator approval. Finally, a rule-based Safety Agent enforces physical constraints and provides an independent protection layer.

Abstract:
We present a neural radiance field (NeRF) based large-scale reconstruction system that fuses lidar and vision data to generate high-quality reconstructions that are geometrically accurate and capture photorealistic texture. Our system adopts the state-of-the-art NeRF representation to additionally incorporate lidar. Adding lidar data adds strong geometric constraints on the depth and surface normals, which is particularly useful when modelling uniform texture surfaces which contain ambiguous visual reconstruction cues. A key contribution of this work is a novel method to quantify the epistemic uncertainty of the lidar-visual NeRF reconstruction by estimating the spatial variance of each point location in the radiance field given the sensor observations from the cameras and lidar. This provides a principled approach to evaluate the contribution of each sensor modality to the final reconstruction. In this way, reconstructions that are uncertain (due to, e.g., uniform visual texture, limited observation viewpoints, or little lidar coverage) can be identified and removed. Our system is integrated with a real-time pose-graph lidar SLAM system which is used to bootstrap a Structure-from-Motion (SfM) reconstruction procedure. It also helps to properly constrain the overall metric scale which is essential for the lidar depth loss. The refined SLAM trajectory can then be divided into submaps using Spectral Clustering to group sets of co-visible images together. This submapping approach is more suitable for visual reconstruction than distance-based partitioning. Our uncertainty estimation is particularly effective when merging submaps, as their boundaries often contain artefacts due to limited observations. We demonstrate the reconstruction system using a multi-camera, lidar sensor suite in experiments involving both robot-mounted and handheld scanning with total area of more than 20,000 m^2. Code and dataset are available at https://dynamic.robots.ox.ac.uk/projects/silvr/

Abstract:
As autonomous robotic systems become increasingly mature, users will want to specify missions at the level of intent rather than in low-level detail. Language is an expressive and intuitive medium for such mission specification. However, realizing language-guided robotic teams requires overcoming significant technical hurdles. Interpreting and realizing language-specified missions require advanced semantic reasoning. Successful heterogeneous robots must effectively coordinate actions and share information across varying viewpoints. Additionally, communication between robots is typically intermittent, necessitating robust strategies that leverage communication opportunities to maintain coordination and achieve mission objectives. In this work, we present a first-of-its-kind system where an unmanned aerial vehicle (UAV) and an unmanned ground vehicle (UGV) can collaboratively accomplish missions specified in natural language while reacting to changes in specification on the fly. We leverage a large language model-enabled planner to reason over semantic-metric maps that are built online and opportunistically shared between an aerial and a ground robot. We consider task-driven navigation in urban and rural areas. Our system must infer mission-relevant semantics and actively acquire information via semantic mapping. In both ground and air-ground teaming experiments, we demonstrate our system on seven different natural-language specifications at up to kilometer-scale navigation.

Authors: Trey Smith, Oleg Alexandrov, Jonathan Barlow, Jose Benavides, Maria Bualat, Roberto Carlino, Brian Coltin, Jose Cortez, Earl Daley, Jeffrey Feller, Lorenzo Flückiger, Terrence Fong, Jesse Fusco, Ruben Garcia Ruiz, Katie Browne, Simeon Kanis, Aric Katterhagen, Yunkyung Kim, John Love, Michael McIntyre, Blair McLachlan, Andres Mora, Zachary Moratto, Marina Moreira, Henry Orosco, In-Won Park, Christopher Provencher, Hugo Sanchez, Khaled Sharif, Ernest Smith, Ryan Soussan, Andrew Symington, Rafael Omar Talavera, Vinh To, Dawn Wheeler, Jongwoon Yoo

Abstract:
Enabling home-assistant robots to perceive and manipulate a diverse range of 3D objects based on human language instructions is a pivotal challenge. Prior research has predominantly focused on simplistic and task-oriented instructions, i.e., "Slide the top drawer open". However, many real-world tasks demand intricate multi-step reasoning, and without human instructions, these will become extremely difficult for robot manipulation. To address these challenges, we introduce a comprehensive benchmark, NrVLM, comprising 15 distinct manipulation tasks, containing over 4500 episodes meticulously annotated with fine-grained language instructions. We split the long-term task process into several steps, with each step having a natural language instruction. Moreover, we propose a novel learning framework that completes the manipulation task step-by-step according to the fine-grained instructions. Specifically, we first identify the instruction to execute, taking into account visual observations and the end-effectors current state. Subsequently, our approach facilitates explicit learning through action-prompts and perception-prompts to promote manipulation-aware cross-modality alignment. Leveraging both visual observations and linguistic guidance, our model outputs a sequence of actionable predictions for manipulation, including contact points and end-effector poses. We evaluate our method and baselines using the proposed benchmark NrVLM. The experimental results demonstrate the effectiveness of our approach. For additional details, please refer to https://sites.google.com/view/naturalvlm.

Abstract:
Discovering the symbols and rules that can be used in long-horizon planning from a robot's unsupervised exploration of its environment and continuous sensorimotor experience is a challenging task. The previous studies proposed learning symbols from single or paired object interactions and planning with these symbols. In this work, we propose a system that learns rules with discovered object and relational symbols that encode an arbitrary number of objects and the relations between them, converts those rules to Planning Domain Description Language (PDDL), and generates plans that involve affordances of the arbitrary number of objects to achieve tasks. We validated our system with box-shaped objects in different sizes and showed that the system can develop a symbolic knowledge of pick-up, carry, and place operations, taking into account object compounds in different configurations, such as boxes would be carried together with a larger box that they are placed on. We also compared our method with other symbol learning methods and showed that planning with the operators defined over relational symbols gives better planning performance compared to the baselines.

Abstract:
The interaction of robots with bendable objects in midair presents significant challenges in control, often resulting in performance degradation and potential crashes, especially for aerial robots due to their limited actuation capabilities and constant need to remain airborne. This paper presents an adaptive controller that enables two aerial vehicles to collaboratively follow a trajectory while transporting a bendable object without relying on explicit elasticity models. Our method allows on-the-fly adaptation to the object's unknown deformable properties, ensuring stability and performance in trajectory-tracking tasks. We use Lyapunov analysis to demonstrate that our adaptive controller is asymptotically stable. Our method is evaluated through hardware experiments in various scenarios, demonstrating the capabilities of using multirotor aerial vehicles to handle bendable objects.

Abstract:
Predicting the kinematics of bending pneumatic muscles (BPMs) remains challenging due to the necessity for models that effectively address the pronounced hysteresis and creep inherent in soft materials. While prior research has predominantly focused on phenomenological and data-driven modeling approaches, this study introduces a viscoelasticity-based mechanistic model (VBMM) and a feedforward-feedback hybrid control system tailored for BPMs. First, the VBMM is developed by leveraging the principles of viscoelasticitya common property of soft materials and a mechanistic driver of hysteresis and creep. Second, we address the computational challenge arising from the history-dependent viscoelastic response of BPMs, where the current state depends on the cumulative stress-strain history. Conventional methods incur escalating computational costs over time, rendering real-time control impractical. To resolve this, we propose a sliding window-based long-term prediction mechanism (long-term VBMM) that maintains model accuracy while significantly reducing computational overhead. Finally, a hybrid control system integrating the long-term VBMM as a feedforward compensator with feedback correction is designed to achieve precise BPM motion tracking. Experimental validation confirms the VBMMs superior predictive accuracy (error < 3.69%) and demonstrates the control systems effectiveness.

Abstract:
Effective impact mitigation strategies are cru- cial for preventing potential damage to both robotic sys- tems and their operational environments during high-velocity and dynamic maneuvers, as well as during the execution of high-precision tasks. The successful implementation of impact mitigation strategies in real-world applications fundamentally requires appropriate parameter tuning. However, owing to the destructive nature of collisions, heuristic parameter tuning is impractical, as it risks damage to both the robotic system and its operational environment during experimental trials. This study eliminates the need for preliminary collision experiments in parameter optimization by introducing a novel methodology that leverages recent proximity sensor-based preemptive impact mitigation strategies that reframe impact mitigation as a geometric rather than physical problem. The key innovation of this work lies in the reformulation of the proximity sensor output to enable both the analytical derivation of preemptive motion trajectories and the direct application of standard optimization solvers. The effectiveness of the proposed method- ology is validated through numerical simulations and two different experimental conﬁgurations. By eliminating the need for collision trials, robotic systems can safely execute potentially destructive tasks that would otherwise result in system damage without proper impact mitigation.

Abstract:
The Multi-Agent Path Finding (MAPF) problem seeks to find conflict-free paths for multiple agents. However, most existing MAPF methods simplify agents to points or uniform circles, a model that fails when agents have diverse geometries or carry oversized loads. This oversimplification can lead to undetected collisions or the failure to find feasible paths. To address this, we propose AnyGeometry-CBS (AG-CBS), a novel extension of Conflict-Based Search (CBS) that accommodates agents of arbitrary, non-convex shapes. AG-CBS represents each geometry of agent via a set of grid cells and introduces enriched conflict definitions to handle complex interactions. To improve search efficiency, we develop a Multi-Constraint (MC) technique and a Shape Heuristic (SH) for suboptimal variants. Experimental results demonstrate that our method reduces runtime by up to 84.43% against optimal baselines and 88.24% against bounded-suboptimal ones, providing a general and effective solution to complex MAPF problems.

Abstract:
Learning-based methods commonly treat state estimation in robotics as a sequence modeling problem. While this paradigm can be effective at maximizing end-to-end performance, models are often difficult to interpret and expensive to train, since training requires unrolling sequences of predictions in time. As an alternative to end-to-end trained state estimation, we propose a novel particle filtering algorithm in which models are trained from individual state transitions, fully exploiting the Markov property in robotic systems. In this framework, measurement models are learned implicitly by minimizing a denoising score matching objective. At inference, the learned denoiser is used alongside a (learned) dynamics model to approximately solve the Bayesian filtering equation at each time step, effectively guiding predicted states toward the data manifold informed by measurements. We evaluate the proposed method on challenging robotic state estimation tasks in simulation, demonstrating competitive performance compared to tuned end-to-end trained baselines. Importantly, our method offers the desirable composability of classical filtering algorithms, allowing prior information and external sensor models to be incorporated without retraining.

Abstract:
Spherical robots typically require at least two actuators to achieve controlled 2D planar motion. Here we present Rollbot, the first spherical robot capable of controllably maneuvering on a 2D plane with a single actuator, challenging this assumption. Rollbot rolls on the ground in a circular pattern and controls its motion by changing the trajectory's curvature by accelerating and decelerating its single motor and the attached mass according to our derived quasi-stable state dynamics and control laws. We present the theoretical analysis, design, and control of Rollbot, and demonstrate its ability to move in a controllable circular pattern and follow waypoints, validating the efficacy of the proposed theoretical framework.

Abstract:
Multi-agent reinforcement learning (MARL) provides a flexible solution for tackling task and motion planning challenges, particularly in swarm confrontation scenarios. By customizing termination conditions for diverse tasks, event-driven MARL reduces decision jitter caused by frequent task switching. However, it hinders robots from updating strategies on a consistent timescale, leading to misaligned information sharing that disrupts agent coordination. To address this, we propose a novel event-driven MARL approach that facilitates collaborative strategy learning under asynchronous conditions. The approach introduces an experience selection scheme tailored to diverse timescales, ensuring efficient training through synchronized information sharing among robots. By incorporating Transformers, our method enables robots to infer others' behaviors from historical data, optimizing collaborative strategies. Extensive experiments validate the effectiveness of our proposed approach.

Abstract:
Encoder-decoder networks are commonly used model architectures for dense prediction tasks, where the encoder typically employs a model pre-trained on upstream tasks, while the decoder is often either randomly initialized or pre-trained on other tasks. In this paper, we introduce ×Net, a novel framework that leverages a model pre-trained on upstream tasks as the decoder, fostering a ``pre-trained encoder × pre-trained decoder'' collaboration within the encoder-decoder network. ×Net effectively addresses the challenges associated with using pre-trained models in the decoding, applying the learned representations to enhance the decoding process. This enables the model to achieve more precise and high-quality dense predictions. By simply coupling the pre-trained encoder and pre-trained decoder, ×Net distinguishes itself as a highly promising approach. Remarkably, it achieves this without relying on decoding-specific structures or task-specific algorithms. Despite its streamlined design, ×Net outperforms advanced methods in tasks such as monocular depth estimation and semantic segmentation. The code is available at https://2j472no.github.io/xNet/.

Abstract:
In this work, we propose a structured methodology for the system identification of underwater vehicles through the design of optimal excitation trajectories. To this end, the trajectories are parameterized using Bezier curves, which ensure smooth and differentiable motion profiles while facilitating the enforcement of constraints through appropriate manipulation of the control points. An optimization problem is formulated to determine a dynamically feasible excitation trajectory that respects safety limits and maximizes the quality of the collected data, thereby enabling reliable estimation of the vehicles dynamic parameters using least squares. The proposed methodology is experimentally validated in a laboratory water tank, where the dynamic parameters, identified from the optimized trajectory, are evaluated by predicting the vehicles velocity through forward simulation on previously unseen trajectories.

Abstract:
Exploration of steep and irregular terrains, such as lunar caves and vertical rock faces, requires free-climbing robots capable of identifying and securely grasping natural handholds. This study introduced SureGrip, a novel framework for detecting handholds and evaluating grasp quality in freeclimbing robots. By integrating depth-based contour extraction with gripper-specific contact analysis, SureGrip accurately identifies candidate handholds and quantifies their suitability using the proposed grasp metrics. Experimental results confirm that the framework can reliably detect handhold locations, estimate surface slopes, and distinguish between secure and unsuitable grasps across a range of artificial and natural surfaces. The findings emphasize the importance of both the number and placement of spine fingers for stable attachment. SureGrip thus enables informed handhold selection, improving climbing safety and efficiency.

Abstract:
Grasping of diverse objects in unstructured environments remains a significant challenge. Open-loop grasping methods, effective in controlled settings, struggle in cluttered environments. Grasp prediction errors and object pose changes during grasping are the main causes of failure. In contrast, closed-loop methods address these challenges in simplified settings (e.g., single object on a table) on a limited set of objects, with no path to generalization. We propose Grasp-MPC, a closed-loop 6-DoF vision-based grasping policy designed for robust and reactive grasping of novel objects in cluttered environments. Grasp-MPC incorporates a value function, trained on visual observations from a large-scale synthetic dataset of 2 million grasp trajectories that include successful and failed attempts. We deploy this learned value function in an MPC framework in combination with other cost terms that encourage collision avoidance and smooth execution. We evaluate Grasp-MPC on FetchBench and real-world settings across diverse environments. Grasp-MPC improves grasp success rates by up to 32.6% in simulation and 33.3% in real-world noisy conditions, outperforming open-loop, diffusion policy, transformer policy, and IQL approaches. Videos and more at http://grasp-mpc.github.io.

Abstract:
Autonomous exploration aims to efficiently map unknown environments, yet utilizing limited environmental information to achieve efficient path planning remains challenging. In this work, we focus on leveraging latent information in partial observations to predict the complete environmental structure, thereby furnishing a proposed path planner with the necessary context to devise a long-term optimal exploration strategy. Most existing prediction approaches extract environment features through convolutional neural networks (CNN) and infer the characteristics of neighboring regions. This information then feeds into a value function that evaluates candidate frontiers and guides the robot's planning. Notwithstanding its advantages over traditional heuristic methods, this paradigm remains inherently constrained by its lack of long-term foresight. To this end, we propose dPWM, a diffusion model-based framework for global map prediction, consisting of two key components. The first employs a DDPM with a variable mask to estimate the probability distribution of unknown regions and thereby predict structural features of the global map. We incorporate Gaussian heatmap positional fields into the denoising process via a cross-attention mechanism to enhance regional awareness. This guides the model to focus on nearby areas that are most valuable for exploration. Once the global predictive map is obtained, the second component refers to a designed Watchman Route Problem (WRP) solver to generate an optimal path from the current exploration state. Extensive evaluations show that dPWM reduces exploration path length by 18.53% on HouseExpo and achieves a 16.37% improvement in cross-domain generalization on Dungeon over SOTA baselines. Real-world experiments further validate its effectiveness in physical environments.

Abstract:
Smartphone-based teleoperation is gaining traction as a versatile remote control solution, using widely available hardware to provide a portable and scalable interface for telerobotics. However, a crucial limitation of such an approach is the lack of effective haptic feedback, which restricts accuracy and increases operator workload. While smartphones offer a low-entry barrier as well as both portability and scalability, current interfaces rely almost exclusively on visual cues. To address this gap, we investigate the use of symbolic haptic feedback delivered through an unmodified mobile device to support remote manipulation tasks. We designed a combined teleoperation task that integrates object sorting and peg-in-hole insertion, embedding five candidate haptic cues (i.e., contact, gripper state, alignment, error boundary, and motion initiation). A within-subjects study with 16 participants compared visual-only and visual-plus-haptic conditions. Results show that haptic augmentation reduced total errors by 42% and significantly lowered perceived workload. Continuous cues for alignment and error boundaries achieved the highest recognition rates of 94% and 81%, respectively, while brief state cues were less reliably interpreted. Post-task interviews highlighted user preference for simple, continuous, and intense signals in visually ambiguous scenarios. Our findings provide new design guidelines for haptic cue prioritisation and encoding strategies.

Abstract:
Shape estimation is fundamental for controlling continuously bending tensegrity manipulators, yet achieving it remains a challenge. Although using exteroceptive sensors makes the implementation straightforward, it is costly and limited to specific environments. Proprioceptive approaches, by contrast, do not suffer from these limitations. So far, several methods have been proposed; however, to our knowledge, there are no proven examples of large-scale tensegrity structures used as manipulators. This paper demonstrates that shape estimation of the entire tensegrity manipulator can be achieved using only the inclination angle information relative to gravity for each strut. Inclination angle information is intrinsic sensory data that can be obtained simply by attaching an inertial measurement unit (IMU) to each strut. Experiments conducted on a five-layer tensegrity manipulator with 20 struts and a total length of 1160 mm demonstrate that the proposed method can estimate the shape with an accuracy of 2.1 % of the total manipulator length from arbitrary initial conditions under both static conditions and maintains stable shape estimation under external disturbances.

Abstract:
Recent advances in Embodied AI have enabled agents to perform increasingly complex tasks and adapt to diverse environments. However, deploying such agents in realistic human-centered scenarios, such as domestic households, remains challenging, particularly due to the difficulty of modeling individual human preferences and behaviors. In this work, we introduce PersONAL (PERSonalized Object Navigation And Localization), a comprehensive benchmark designed to study personalization in Embodied AI. Agents must identify, retrieve, and navigate to objects associated with specific users, responding to natural-language queries such as find Lilys backpack. PersONAL comprises over 2,000 high-quality episodes across 30+ photorealistic homes from the HM3D dataset. Each episode includes a natural-language scene description with explicit associations between objects and their owners, requiring agents to reason over user-specific semantics. The benchmark supports two evaluation modes: (1) active navigation in unseen environments, and (2) object grounding in previously mapped scenes. Experiments with state-of-the-art baselines reveal a substantial gap to human performance, highlighting the need for embodied agents capable of perceiving, reasoning, and memorizing over personalized information; paving the way towards real-world assistive robot. Code and dataset available at: github.io/PersONAL

Abstract:
Mobile robots joining public spaces like sidewalks must care for pedestrian comfort. Many studies consider pedestrians' objective safety, for example, by developing collision avoidance algorithms, but not enough studies take the pedestrian's subjective safety or comfort into consideration. Quantifying comfort is a major challenge that hinders mobile robots from understanding and responding to human emotions. We empirically look into the relationship between the mobile robot-pedestrian interaction kinematics and subjective comfort. We perform one-on-one experimental trials, each involving a mobile robot and a volunteer. Statistical analysis of pedestrians' reported comfort versus the kinematic variables shows moderate but significant correlations for most variables. We use the findings and empirically design three comfort estimators/predictors based on the minimum distance, the minimum projected time-to-collision, and a composite estimator. The composite estimator employs all studied kinematic variables and reaches the highest prediction rate and classifying performance among the predictors. The composite predictor has an odds ratio of 3.67. In simple terms, when it identifies a pedestrian as comfortable, it is almost 4 times more likely that the pedestrian is comfortable rather than uncomfortable. The study provides a comfort quantifier for incorporating pedestrian feelings into path planners for more socially compliant robots.

Abstract:
Feature matching is a fundamental technique in visual perception, essential for tasks such as 3D reconstruction, SLAM, and visual localization. Existing detector-free methods often struggle with generalization due to their reliance on depth data, which is not available in many datasets. We propose PG-Match, a detector-free feature matching framework that leverages pose supervision instead of depth-based supervision, improving its generalization across diverse environments. Additionally, we introduce a Differentiable Outlier Rejection Module (DORM) to enhance global consistency and increase the inlier ratio. A coarse-to-fine matching strategy is employed for efficiency, where specially designed confidence scores are utilized to guide the sampling process. This ensures efficient convergence and avoids local optima. Experiments on the widely used MegaDepth-1500 dataset demonstrate that PG-Match consistently outperforms state-of-the-art approaches, highlighting the effectiveness of its pose-guided design. Additionally, experiments on the depth-free PhotoTourism dataset further evaluate generalization of PG-Match, and its performance is also assessed in a downstream Structure from Motion (SfM) task.

Abstract:
Embeddings from Visual-Language Models are increasingly utilized to represent semantics in robotic maps, offering an open-vocabulary scene understanding that surpasses traditional, limited labels. Embeddings enable on-demand querying by comparing embedded user text prompts to map embeddings via a similarity metric. The key challenge in performing the task indicated in a query is that the robot must determine the parts of the environment relevant to the query. This paper proposes a solution to this challenge. We leverage natural-language synonyms and antonyms associated with the query within the embedding space, applying heuristics to estimate the language space relevant to the query, and use that to train a classifier to partition the environment into matches and non-matches. We evaluate our method through extensive experiments, querying both maps and standard image benchmarks. The results demonstrate increased queryability of maps and images. Our querying technique is agnostic to the representation and encoder used, and requires limited training.

Abstract:
The deployment of robots in unstructured environments demands perception systems that are both accurate and resilient. While RGB-Thermal (RGB-T) fusion is promising, current trackers often fail due to rigid, non-adaptive fusion strategies and underutilized cross-modal cues, compromising reliability for robotics. We introduce DSTrack, a novel tracking framework that embeds two core mechanisms for robotic robustness: a Probability-Gated Dynamic Switch and a Synergistic Multi-Domain Enhancement Network. The switch acts as an online decision-maker, allowing the robot to dynamically select the most reliable fusion path based on real-time confidence estimation, enabling crucial adaptation to scene changes. The enhancement network concurrently strengthens target representations within each modality through tri-domain (channel, spatial, frequency) refinement and establishes compensatory links between modalities via a cross-attention module, ensuring performance even during partial sensor degradation. Extensive evaluations on RGB-T benchmarks demonstrate state-of-the-art accuracy. More critically, DSTrack exhibits key properties for robotic integration: real-time environmental adaptability, inherent sensor fault tolerance, and consistent output for downstream planning.

Abstract:
Pre-trained Vision-Language-Action (VLA) models have achieved remarkable success in improving robustness and generalization for end-to-end robotic manipulation. However, these models struggle with long-horizon tasks due to their lack of memory and reliance solely on immediate sensory inputs. To address this limitation, we propose Memory-Augmented Prompting for Vision-Language-Action model (MAP-VLA), a novel framework that empowers pre-trained VLA models with demonstration-derived memory prompts to augment action generation for long-horizon robotic manipulation tasks. To achieve this, MAP-VLA first constructs a memory library from historical demonstrations, where each memory unit captures information about a specific stage of a task. These memory units are implemented as learnable soft prompts optimized through prompt tuning. Then, during real-time task execution, MAP-VLA retrieves relevant memory through trajectory similarity matching and dynamically integrates it into the VLA model for augmented action generation. Importantly, this prompt tuning and retrieval augmentation approach operates as a plug-and-play module for a frozen VLA model, offering a lightweight and flexible solution to improve task performance. Experimental results show that MAP-VLA delivers up to 7.0% absolute performance gains in the simulation benchmark and 25.0% on real robot evaluations for long-horizon tasks, surpassing the current state-of-the-art methods.

Abstract:
Humanoid robots are envisioned to perform a wide range of tasks in human-centered environments, requiring controllers that combine agility with robust balance. Recent advances in locomotion and whole-body tracking have enabled impressive progress in either agile dynamic skills or stability-critical behaviors, but existing methods remain specialized, focusing on one capability while compromising the other. In this work, we introduce AMS (Agility Meets Stability), the first framework that unifies both dynamic motion tracking and extreme balance maintenance in a single policy. Our key insight is to leverage heterogeneous data sources: human motion capture datasets that provide rich, agile behaviors, and physically constrained synthetic balance motions that capture stability configurations. To reconcile the divergent optimization goals of agility and stability, we design a hybrid reward scheme that applies general tracking objectives across all data while injecting balance-specific priors only into synthetic motions. Further, an adaptive learning strategy with performance-driven sampling and motion-specific reward shaping enables efficient training across diverse motion distributions. We validate AMS extensively in simulation and on a real Unitree G1 humanoid. Experiments demonstrate that a single policy can execute agile skills such as dancing and running, while also performing zero-shot extreme balance motions like Ip Mans Squat, highlighting AMS as a versatile control paradigm for future humanoid applications.

Abstract:
Fine-grained 3D part segmentation is crucial for enabling embodied AI systems to perform complex manipulation tasks, such as interacting with specific functional components of an object. However, existing interactive segmentation methods are largely confined to coarse, instance-level targets, while non-interactive approaches struggle with sparse, real-world scans and suffer from a severe lack of annotated data. To address these limitations, we introduce PinPoint3D, a novel interactive framework for fine-grained, multi-granularity 3D segmentation, capable of generating precise part-level masks from only a few user point clicks. A key component of our work is a new 3D data synthesis pipeline that we developed to create a large-scale, scene-level dataset with dense part annotations, overcoming a critical bottleneck that has hindered progress in this field. Through comprehensive experiments and user studies, we demonstrate that our method significantly outperforms existing approaches, achieving an average IoU of around 55.8% on each object part under first-click settings and surpassing 71.3% IoU with only a few additional clicks. Compared to current state-of-the-art baselines, PinPoint3D yields up to a 16% improvement in IoU and precision, highlighting its effectiveness on challenging, sparse point clouds with high efficiency. Our work represents a significant step towards more nuanced and precise machine perception and interaction in complex 3D environments.

Abstract:
Robotic knot-tying represents a fundamental chal- lenge in robotics due to the complex interactions between de- formable objects and strict topological constraints. We present TWISTED-RL, a framework that improves upon the previous state-of-the-art in demonstration-free knot-tying (TWISTED), which smartly decomposed a single knot-tying problem into manageable subproblems, each addressed by a specialized agent. Our approach replaces TWISTEDs single-step inverse model that was learned via supervised learning with a multi- step Reinforcement Learning policy conditioned on abstract topological actions rather than goal states. This change allows more delicate topological state transitions while avoiding costly and ineffective data collection protocols, thus enabling better generalization across diverse knot configurations. Experimen- tal results demonstrate that TWISTED-RL manages to solve previously unattainable knots of higher complexity, including commonly used knots such as the Figure-8 and the Overhand. Furthermore, the increase in success rates and drop in planning time establishes TWISTED-RL as the new state-of-the-art in robotic knot-tying without human demonstrations.

Abstract:
Task decomposition is critical for understanding and learning complex long-horizon manipulation tasks. Especially for tasks involving rich physical interactions, relying solely on visual observations and robot proprioceptive information often fails to reveal the underlying event transitions. This raises the requirement for efficient collection of high-quality multi-modal data as well as robust segmentation method to decompose demonstrations into meaningful modules. Building on the idea of the handheld demonstration device Universal Manipulation Interface (UMI), we introduce TacUMI, a multi-modal data collection system that integrates additionally ViTac sensors, forcetorque sensor, and pose tracker into a compact, robot-compatible gripper design, which enables synchronized acquisition of all these modalities during human demonstrations. We then propose a multi-modal segmentation framework that leverages temporal models to detect semantically meaningful event boundaries in sequential manipulations. Evaluation on a challenging cable mounting task shows more than 90% segmentation accuracy and highlights a remarkable improvement with more modalities, which validates that TacUMI establishes a practical foundation for both scalable collection and segmentation of multi-modal demonstrations in contact-rich tasks.

Abstract:
Safe multi-agent motion planning (MAMP) under task-induced constraints is a critical challenge in robotics. Many real-world scenarios require robots to navigate dynamic environments while adhering to manifold constraints imposed by tasks. For example, service robots must carry cups upright while avoiding collisions with humans or other robots. Despite recent advances in decentralized MAMP for high-dimensional systems, incorporating manifold constraints remains difficult. To address this, we propose a manifold-constrained Hamilton-Jacobi reachability (HJR) learning framework for decentralized MAMP. Our method solves HJR problems under manifold constraints to capture task-aware safety conditions, which are then integrated into a decentralized trajectory optimization planner. This enables robots to generate motion plans that are both safe and task-feasible without requiring assumptions about other agents policies. Our approach generalizes across diverse manifold-constrained tasks and scales effectively to high-dimensional multi-agent manipulation problems. Experiments show that our method outperforms existing constrained motion planners and operates at speeds suitable for real-world applications.

Abstract:
We propose Synchronous Dual-Arm Rearrange- ment Planner (SDAR), a task and motion planning (TAMP) framework for tabletop rearrangement, where two robot arms equipped with 2-finger grippers must work together in close proximity to rearrange objects whose start and goal config- urations are strongly entangled. To tackle such challenges, SDAR tightly knit together its dependency-driven task planner (SDAR-T) and synchronous dual-arm motion planner (SDAR- M), to intelligently sift through a large number of possible task and motion plans. Specifically, SDAR-T applies a simple yet effective strategy to decompose the global object dependency graph induced by the rearrangement task, to produce more optimal dual-arm task plans than solutions derived from optimal task plans for a single arm. Leveraging state-of-the-art GPU SIMD-based motion planning tools, SDAR-M employs a layered motion planning strategy to sift through many task plans for the best synchronous dual-arm motion plan while ensuring high levels of success rate. Comprehensive evaluation demonstrates that SDAR delivers a 100% success rate in solving complex, non-monotone, long-horizon tabletop rearrangement tasks with solution quality far exceeding the previous state- of-the-art. Experiments on two UR-5e arms further confirm SDAR directly and reliably transfers to robot hardware. Source code and supplementary materials are available at https://github.com/arc-l/dual-arm.

Abstract:
The performance of legged locomotion is closely tied to the accuracy and comprehensiveness of state observations. Blind policies, which rely solely on proprioception, are considered highly robust due to the reliability of proprioceptive observations. However, these policies significantly limit locomotion speed and often require collisions with the terrain to adapt. In contrast, Vision policies allows the robot to plan motions in advance and respond proactively to unstructured terrains with an online perception module. However, perception is often compromised by noisy real-world environments, potential sensor failures, and the limitations of current simulations in presenting dynamic or deformable terrains. Humanoid robots, with high degrees of freedom and inherently unstable morphology, are particularly susceptible to misguidance from deficient perception, which can result in falls or termination on challenging dynamic terrains. To leverage the advantages of both vision and blind policies, we propose VB-Com, a composite framework that enables humanoid robots to determine when to rely on the vision policy and when to switch to the blind policy under perceptual deficiency. We demonstrate that VB-Com effectively enables humanoid robots to traverse challenging terrains and obstacles despite perception deficiencies caused by dynamic terrains or perceptual noise.

Abstract:
Diffusion policies have shown impressive results in robot imitation learning, even for tasks that require satisfaction of kinematic equality constraints. However, task performance alone is not a reliable indicator of the policys ability to precisely learn constraints in the training data. To investigate, we analyze how well diffusion policies discover these manifolds with a case study on a bimanual pick-and-place task that encourages fulfillment of a kinematic constraint for success. We study how three factors affect trained policies: dataset size, dataset quality, and manifold curvature. Our experiments show diffusion policies learn a coarse approximation of the constraint manifold with learning affected negatively by decreases in both dataset size and quality. However, manifold curvature showed inconclusive correlations with constraint satisfaction and task success. A hardware evaluation verifies the applicability of our results in the real world. Project website with additional results and visuals: https://diffusion-learns-kinematic.github.io/.

Abstract:
Modular Aerial Robot Systems (MARS) consist of multiple drone modules that are physically bound together to form a single structure for flight. Exploiting structural redundancy, MARS can be reconfigured into different formations to mitigate unit or rotor failures and maintain stable flight. Prior work on MARS self-reconfiguration has solely focused on maximizing controllability margins to tolerate a single rotor or unit fault for rectangular-shaped MARS. We propose TransforMARS, a general fault-tolerant reconfiguration framework that transforms arbitrarily shaped MARS under multiple rotor and unit faults while ensuring continuous in-air stability. Specifically, we develop algorithms to first identify and construct minimum controllable assemblies containing faulty units. We then plan feasible disassembly-assembly sequences to transport MARS units or subassemblies to form target configuration. Our approach enables more flexible and practical feasible reconfiguration. We validate TransforMARS in challenging arbitrarily shaped MARS configurations, demonstrating substantial improvements over prior works in both the capacity of handling diverse configurations and the number of faults tolerated. The videos and source code of this work are available at https://github.com/RuiHuangNUS/TransforMARS

Abstract:
Vision-Language-Action (VLA) models like Open-VLA demonstrate impressive zero-shot generalization across robotic manipulation tasks but struggle to adapt to specific deployment environments where consistent high performance on a limited set of tasks is more valuable than broad generalization. We present EXPierence replayed, REtrieval augmented, Specialized VLA (ExpReS-VLA), a method that enables rapid on-device adaptation of pre-trained VLAs to target domains while preventing catastrophic forgetting through compressed experience replay and retrieval-augmented generation. Our approach maintains a memory-efficient buffer by storing extracted embeddings from OpenVLAs frozen vision backbone, reducing storage requirements by 97% compared to raw image-action pairs. During deployment, ExpReS-VLA retrieves the k most similar past experiences using cosine similarity to augment training batches, while a prioritized experience replay buffer preserves recently successful trajectories. To leverage failed attempts, we introduce Thresholded Hybrid Contrastive Loss (THCL), enabling the model to learn from both successful and unsuccessful demonstrations collected during deployment. Experiments on the LIBERO simulation benchmark show that ExpReS-VLA improves success rates from 82.6% to 93.1% on spatial reasoning tasks and from 61% to 72.3% on long horizon tasks compared to base OpenVLA, with consistent gains across VLA architectures including π0 (+3.2 points) and OpenVLA-OFT (+1.7 points). Physical robot experiments across five manipulation tasks demonstrate that our approach achieves 98% success on both in-distribution and out-of-distribution tasks (with unseen backgrounds and objects), improving from 84.7% and 32% respectively for naive fine-tuning. ExpReSVLA accomplishes this adaptation in 31 seconds using only 12 demonstrations on a single RTX 5090, making it practical for real-world deployment where robots must quickly specialize to their specific operating environment.

Abstract:
Local feature detection and description serve as the foundation for many 3D vision tasks. However, most existing algorithms rely on sharp images, resulting in degraded performance when motion blur occurs due to long exposure. To tackle this challenge, we propose an effective end-to-end model that jointly learns feature detection and description from blurred images in a self-supervised manner, without requiring any additional labeled data. Rather than simply mixing sharp and blurred samples during training, we design a studentteacher framework to explicitly transfer knowledge from sharp to blurred domains. The teacher model extracts local features from sharp images and enforces photometric consistency in feature space, which is then distilled to the student model trained on blurred inputs. To facilitate this knowledge transfer, we introduce two tailored loss functions, feature divergence loss and triplet knowledge distillation loss, both aimed at aligning feature representations under motion blur. Extensive experiments on homography estimation, relative pose estimation, and visual localization demonstrate that our method achieves state-of-the-art performance on blurred images, while maintaining competitive accuracy on sharp images.

Abstract:
Vision-based policies for robot manipulation have achieved significant recent success, but are still brittle to distribution shifts such as camera viewpoint variations. Robot demonstration data is scarce and often lacks appropriate variation in camera viewpoints. Simulation offers a way to collect robot demonstrations at scale with comprehensive coverage of different viewpoints, but presents a visual sim2real challenge. To bridge this gap, we propose MANGO -- an unpaired image translation method with a novel segmentation-conditioned InfoNCE loss, a highly-regularized discriminator design, and a modified PatchNCE loss. We find that these elements are crucial for maintaining viewpoint consistency during sim2real translation. When training MANGO, we only require a small amount of fixed-camera data from the real world, but show that our method can generate diverse unseen viewpoints by translating simulated observations. In this setting, MANGO outperforms all other image translation methods we tested. In certain real-world tabletop manipulation tasks, MANGO augmentation increases shifted-view success rates by over 40 percentage points compared to policies trained without augmentation.

Abstract:
In the past decade, the adoption of compact 3D range sensors, such as LiDARs, has driven the developments of robust state-estimation pipelines, making them a standard sensor for aerial, ground, and space autonomy. Unfortunately, poor propagation of electromagnetic waves underwater, has limited the visibility-independent sensing options of underwater state-estimation to acoustic range sensors, which provide 2D information including, at-best, spatially ambiguous information. This paper, to the best of our knowledge, is the first study examining the performance, capacity, and opportunities arising from the recent introduction of the first compact 3D sonar. Towards that purpose, we introduce calibration procedures for extracting the extrinsics between the 3D sonar and a camera and we provide a study on acoustic response in different surfaces and materials. Moreover, we provide novel mapping and SLAM pipelines tested in deployments in underwater cave systems and other geometrically and acoustically challenging underwater environments. Our assessment showcases the unique capacity of 3D sonars to capture consistent spatial information allowing for detailed reconstructions and localization in datasets expanding to hundreds of meters. At the same time it highlights remaining challenges related to acoustic propagation, as found also in other acoustic sensors. Datasets collected for our evaluations would be released and shared with the community to enable further research advancements.

Abstract:
We present KITE, a training-free, keyframe-anchored, layout-grounded front-end that converts long robot-execution videos into compact, interpretable tokenized evidence for vision-language models (VLMs). KITE distills each trajectory into a small set of motion-salient keyframes with open-vocabulary detections and pairs each keyframe with a schematic bird's-eye-view (BEV) representation that encodes relative object layout, axes, timestamps, and detection confidence. These visual cues are serialized with robot-profile and scene-context tokens into a unified prompt, allowing the same front-end to support failure detection, identification, localization, explanation, and correction with an off-the-shelf VLM. On the RoboFAC benchmark, KITE with Qwen2.5-VL substantially improves over vanilla Qwen2.5-VL in the training-free setting, with especially large gains on simulation failure detection, identification, and localization, while remaining competitive with a RoboFAC-tuned baseline. A small QLoRA fine-tune further improves explanation and correction quality. We also report qualitative results on real dual-arm robots, demonstrating the practical applicability of KITE as a structured and interpretable front-end for robot failure analysis. Code and models are released on our project page: https://m80hz.github.io/kite/

Abstract:
While reasoning technology like Chain-of-Thought (CoT) has been widely adopted in Vision-Language-Action (VLA) models, it demonstrates promising capabilities in end-to-end autonomous driving. However, recent efforts to integrate CoT reasoning often fall short in simple scenarios, introducing unnecessary computational overhead without improving decision quality. To address this, we propose AdaThinkDrive, a novel VLA framework with a dual-mode reasoning mechanism inspired by fast and slow thinking. First, our framework is pretrained on large-scale autonomous driving (AD) scenarios using both question-answering (QA) and trajectory datasets to acquire world knowledge and driving commonsense. During supervised fine-tuning (SFT), we introduce a two-mode datasetfast answering (w/o CoT) and slow thinking (with CoT), enabling the model to distinguish between scenarios that require reasoning. Furthermore, an Adaptive Think Reward strategy is proposed in conjunction with the Group Relative Policy Optimization (GRPO), which rewards the model for selectively applying CoT by comparing trajectory quality across different reasoning modes. Extensive experiments on the Navsim benchmark show that AdaThinkDrive achieves a PDMS of 90.3, surpassing the best vision-only baseline by 1.7 points. Moreover, ablations show that AdaThinkDrive surpasses both the never-Think and always-Think baselines, improving PDMS by 2.0 and 1.4, respectively. It also reduces inference time by 14% compared to the always-Think baseline, demonstrating its ability to balance accuracy and efficiency through adaptive reasoning.

Abstract:
Softrigid tendon-driven robotic hands are widely adopted due to their simple fabrication and effective compliance, enabling robust and adaptive grasping. However, achieving dexterity, such as in-hand manipulation, remains challenging because actuation systems typically constrain finger trajectories. This paper presents a novel parametric wave-structured 3-DoF compliant joint with tunable stiffness, designed to enhance dexterity while maintaining a compact form factor. The joint combines a compliant structure and a particular geometry with a Twisted String Actuation (TSA) system, allowing simultaneous modulation of joint stiffness and the mobility of a universal joint that can be used to resemble flexion/extension and abduction/adduction motion of the human hand fingers. Two tendons, independently actuated, control asymmetric bending and stiffness regulation, while a third tendon drives flexion/extension. Analytical modeling and numerical simulations are provided to characterize the kinematics, statics, and stiffness modulation properties of the joint. A functional prototype demonstrates significant improvements in workspace and dexterity when integrated as the base joint of a wearable robotics supernumerary finger. Experimental evaluations validate the proposed design and confirm its potential as a versatile building block for dexterous, lightweight, and adaptive robotic hands.

Abstract:
Path planning in obstacle-dense environments is a challenging problem, particularly for robots with asymmetric rectangular footprints. To address this problem, we propose a novel collision-checking approach, called a Rotatable Area, which represents a range of heading angles where the robot can rotate without colliding with obstacles. Based on the relationship between two rotatable areas, we define safe local motion and extend this concept to the RoA-Planner, a path planning framework in SE(2) dense space. We validate our planner through extensive simulations and real-world experiments in complex and narrow environments. The results demonstrate that our method achieves fast planning speed while ensuring safety and robustness, making it suitable for practical applications.

Abstract:
Despite the growing adoption of radar in robotics, the majority of research has been confined to homogeneous sensors, overlooking the integration and cross-modality challenges inherent in heterogeneous radar. This leads to significant difficulties in generalizing across diverse radar types, with modality-aware approaches that could leverage the complementary strengths of heterogeneous radar remaining unexplored. To bridge these gaps, we propose SHeRLoc, the first deep network tailored for heterogeneous radar, which utilizes radar cross-section polar matching to align multimodal radar data. Our hierarchical optimal transport-based feature aggregation generates rotationally robust multi-scale descriptors. By employing FFT-similarity-based data mining and adaptive margin-based triplet loss, SHeRLoc enables FOV-aware metric learning. SHeRLoc achieves an order of magnitude improvement in heterogeneous radar place recognition, increasing recall@1 from below 0.1 to 0.9, and paves the way for cross-modal localization.

Abstract:
Many robots are not equipped with a manipulator and many objects are not suitable for prehensile manipulation (such as boxes and large cylinders). In these cases, pushing is a simple yet effective non-prehensile skill for robots to interact with and further change the environment. Existing work often assumes a set of predefined pushing modes and fixed-shape objects. This work tackles the general problem of controlling a robotic fleet to push collaboratively numerous arbitrary objects to respective destinations, within complex environments of cluttered and movable obstacles. It incorporates several characteristic challenges for multi-robot systems such as online task coordination under large uncertainties of cost and duration, and for contact-rich tasks such as hybrid switching among different contact modes, and under-actuation due to constrained contact forces. The proposed method is based on combinatorial hybrid optimization over dynamic task assignments and hybrid execution via sequences of pushing modes and associated forces. It consists of three main components: (I) the decomposition, ordering and rolling assignment of pushing subtasks to robot subg

Abstract:
Narrow passage path planning is a prevalent problem from industrial to household sites, often facing difficulties in finding feasible paths or requiring excessive computational resources. Given that deep penetration into the environment can cause optimization failure, we propose a framework to ensure feasibility throughout the process using a series of subproblems tailored for narrow passage problem. We begin by decomposing the environment into convex objects and initializing collision constraints with a subset of these objects. By continuously interpolating the collision constraints through the process of sequentially introducing remaining objects, our proposed framework generates subproblems that guide the optimization toward solving the narrow passage problem. Several examples are presented to demonstrate how the proposed framework addresses narrow passage path planning problems.

Abstract:
This paper addresses the problem of robot navigation in mixed geometric/semantic 3D environments. Given a hierarchical representation of the environment, the objective is to navigate from a start position to a goal while satisfying task-specific safety constraints and minimizing computational cost. We introduce Hierarchical Class-ordered A (HCOA), an algorithm that leverages the environment's hierarchy for efficient and safe path-planning in mixed geometric/semantic graphs. We use a total order over the semantic classes and prove theoretical performance guarantees for the algorithm. We propose three approaches for higher-layer node classification based on the semantics of the lowest layer: a Graph Neural Network method, a k-Nearest Neighbors method, and a Majority-Class method. We evaluate our algorithm in simulations on two 3D Scene Graphs, comparing it to the state-of-the-art and assessing the performance of each classification approach. Results show that HCOA reduces the computational time of navigation by up to 50%, while maintaining near-optimal performance across a wide range of scenarios.

Abstract:
This paper proposes a lightweight decentralized solution for multi-robot coordinated navigation with cooperative perception. First, we introduce a rapid way to process sensory data, thus obtaining safe directions and key environmental features. Then, an information flow is created to facilitate real-time perception sharing over wireless ad-hoc networks. Consequently, the environmental uncertainties of each robot are reduced by interaction fields that deliver complementary information. Finally, path optimization is achieved in a probabilistic way, enabling self-organized coordination with effective convergence, divergence, and collision avoidance. Our method is fully interpretable and ready for deployment without gaps. Comprehensive simulations and real-world experiments demonstrate reduced path redundancy, robust performance across various tasks, and minimal demands on computation and communication.

Abstract:
Industrial robotics demands significant energy to operate, making energy-reduction methodologies increasingly important. Strategies for planning minimum-energy trajectories typically involve solving nonlinear optimal control problems (OCPs), which rarely cope with real-time requirements. In this paper, we propose a paradigm for generating near minimum-energy trajectories for manipulators by learning from optimal solutions. Our paradigm leverages a residual learning approach, which embeds boundary conditions while focusing on learning only the adjustments needed to steer a standard solution to an optimal one. Compared to a computationally expensive OCP-based planner, our paradigm achieves 87.3% of the performance near the training dataset and 50.8% far from the dataset, while being two to three orders of magnitude faster.

Abstract:
Curb following is a critical technology for autonomous road sweeping vehicles. However, existing solutions face two primary challenges: unreliable curb detection and inefficient motion generation. Unreliable curb detection stems from the wide variability in curb dimensions and types, as well as interference from roadside features such as vegetation and infrastructure. Inefficient motion generation occurs when existing methods prioritize tracking accuracy while neglecting task completion efficiency, leading to prolonged operation times. To address these challenges, we propose Curb-Tracker, an integrated curb-following system designed for autonomous vehicles operating in diverse road environments. Firstly, we develop a robust and adaptive curb detection algorithm that leverages a 2.5D elevation map of the local environment and dynamically adjusts key parameters online to ensure reliable detection across varying scenarios. Secondly, to achieve accurate and efficient curb-aligned motion generation, we leverage Model Predictive Contouring Control (MPCC) as a tailored framework specifically designed for the curb-following task to generate an optimal control sequence for the vehicle to ma

Abstract:
High-definition (HD) map learning serves as an essential component of autonomous driving scene understanding, providing structured priors for planning and prediction. Recent transformer-based methods regress vectorized map elements via deformable attention over Birds-Eye View (BEV) features. They typically employ a single-pass paradigm, starting from a set of initial queries. However, these queries struggle to precisely localize map elements within the large-scale BEV space. This difficulty is severely amplified when using lightweight backbones that produce less distinctive features. To address this, we propose RefDiffMap, which recasts map construction as a progressive refinement process driven by a diffusion model. We introduce a novel denoising query generator that, at each step, leverages the intermediate noisy geometry to sample relevant features from adaptive BEV RoIs. These features are distilled into context-aware queries that guide the decoder's next refinement. This creates a powerful geometry-feature co-evolution loop, allowing the model to iteratively correct localization errors. Comprehensive experiments show that RefDiffMap achieves competitive performance on the nuScenes and Argoverse 2 datasets. Notably, its robustness is highlighted with a ResNet-18 backbone, where it improves mAP by a significant 11.3% over our baseline MapTRv2. Further ablation studies validate the effectiveness of our approach.

Abstract:
Different Length Alignment Sewing (DLAS), which involves stretching the shorter fabric to match the longer one and sewing them together in a straight line, is a challenging task that needs to satisfy several requirements when automating the sewing process. To address the challenges, this research proposes a novel robotic sewing system, Different Length Robotic Sewing System (DLRoSS), which consists of a roller type end-effector, attached to a 6-DoF manipulator. The end-effector composed of active shorter and longer fabric rollers, and a passive press-roller attached to the shorter-fabric roller. Assuming that one end of the two fabric layers are initially positioned under the sewing machines presser foot, the system automates DLAS by operating in four distinct phases. (P1) Fabric wrapping: Individual fabric layers are picked, held, and wrapped from the other end onto the feed rollers. (P2) Sewing: During the sewing, the shorter fabric is stretched and aligned with the longer fabric in realtime using roller velocity control based on the sewing speed and apriori known length ratio. (P3) Sewing completion: In the final sewing round on the fabric rollers, the press roller is engaged to prevent the stretched fabric from slipping off due to internal tension. (P4) Sewing fabric release: At the end of sewing, the fabric edge moves past the press roller, and the fabric releases from the rollers. Experimental results demonstrate that DLRoSS achieves consistent, high-quality sewing of stretchable fabrics of different materials and lengths.

Abstract:
Although quadcopters boast impressive traversal capabilities enabled by their omnidirectional maneuverability, the need for continuous pilot control in complex environments impedes their application in GNSS and telemetry-denied scenarios. To this end, we propose a novel sensorimotor policy that uses stereo-vision depth and visual-inertial odometry (VIO) to autonomously navigate through obstacles in an unknown environment to reach a goal point. The policy is comprised of a pre-trained autoencoder as the perception head followed by a planning and control LSTM network which outputs velocity commands that can be followed by an off-the-shelf commercial drone. We leverage reinforcement and privileged learning paradigms to train the policy in simulation through a two-stage process: 1) initial training with optimal trajectories generated by a global motion planner acting as a supervisory backbone, 2) further fine-tuning in a curriculum environment. To bridge the sim-to-real gap, we employ domain randomization and reward shaping to create a policy that is both robust to noise and domain shift. In outdoor experiments, our approach achieves successful zero-shot transfer to both obstacle environments and a drone platform that were never encountered during training.

Abstract:
This letter addresses the challenging problem of Semi-Constrained End-Effector Path Planning for robotic manipulators. This problem arises when complex specifications restrict the end-effectors motion during the execution of industrial tasks. Traditional path planning algorithms often struggle with such problems due to the difficulty of exploring the robot's valid configuration space, or constrained manifold, under these conditions. In this work, we propose a novel sampling-based approach that efficiently navigates the constrained manifold by exploring an alternative space representing the end-effectors degrees of freedom, such as process-related tolerances, throughout the task. This method retains the simplicity of sampling-based techniques. Building on this approach, we introduce the F-RRT algorithm, an adaptation of the renowned RRT planner (LaValle and Kuffner, 2001). F-RRT demonstrates enhanced speed and robustness compared to existing solutions, particularly in complex and cluttered environments.

Abstract:
In this paper, we propose OccTENS, a generative occupancy world model that enables controllable, high-fidelity long-term occupancy generation while maintaining computational efficiency. Different from visual generation, the occupancy world model must capture the fine-grained 3D geometry and dynamic evolution of the 3D scenes, posing great challenges for the generative models. Recent approaches based on autoregression (AR) have demonstrated the potential to predict vehicle movement and future occupancy scenes simultaneously from historical observations, but they typically suffer from inefficiency, temporal degradation in long-term generation and lack of controllability. To holistically address these issues, we reformulate the occupancy world model as a temporal next-scale prediction (TENS) task, which decomposes the temporal sequence modeling problem to the modeling of spatial scale-by-scale generation and temporal scene-by-scene prediction. With a textbfTensFormer, OccTENS can effectively manage the temporal causality and spatial relationships of occupancy sequences in a flexible and scalable way. To enhance the pose controllability, we further propose a holistic pose aggregation strategy, which features a unified sequence modeling for occupancy and ego-motion. Experiments show that OccTENS outperforms the state-of-the-art method with both higher occupancy quality and faster inference time.

Abstract:
Isoperimetric robots can dramatically change shape to adapt to different tasks. They are built from triangle modules, each formed by a continuous structural member that passes through three roller units, one at each corner. The robot changes shape as the roller units drive along the structural member, changing the location of the joints. Previous designs used inflated fabric tubes as the structural member, but these systems are prone to leaking and changes in pressure due to temperature effects. We present an isoperimetric robot composed of tape-springs (curved spring steel tapes) as the primary structural member, and assemble an octahedron robot. We detail the design of the roller modules that can drive along the tape spring. We also show that with tape springs, all three roller units at the vertices of each triangle can drive along the tape spring. This increases the robot's speed moving between configurations and enables new types of behaviors, such as motion of the beam without motion of the rollers. We also present an optimization procedure for the tape spring isoperimetric robot that minimizes the time required to reach a desired configuration, assuming each roller is limited to a maximum speed.

Abstract:
Conventional robotic grippers relying on external force sensors or high gear-ratio actuators suffer from high mechanical impedance and limited control bandwidth. To address these limitations, this study proposes a novel 9-DOF, three-fingered Direct-Drive Differential (DDD) gripper that integrates DD motors with an low gear ratio (1:2) differential transmission. This mechanism centralizes the actuator mass at the base to achieve an ultra-low inertia design, while the differential architecture couples motors in parallel to amplify torque for flexion movements. Performance evaluations demonstrate that the prototype delivers a nominal grasping force of 15 N and a fingertip force of 3.1 N, while maintaining a remarkably low system inertia (motor contribution of 0.236%) and mechanical impedance (<700 N/m) within the typical human manipulation frequency range. The proposed hardware successfully resolves the trade-offs among torque, transparency, and kinematics, establishing a robust foundation for highly responsive, sensorless proprioceptive force estimation in dynamic environments.

Abstract:
Robotic coverage tasks often require teams of robots to not only survey regions of interest but also trace and interact with linear features such as cracks, seams, or pipelines. We term this the double coverage problem, where robots must balance two competing roles: wide-area exploration for inspection and precise trajectory following for servicing linear structures. This paper develops an optimal multi-robot planning framework that unifies area coverage and line servicing. We formulate a topological analysis in manifold space and introduce the hierarchical cyclic merging regulation (HCMR) method, for which optimality under a fixed sweep direction is proven. The framework is experimentally validated for a multi-robot crack survey and filling application. Benchmark comparisons demonstrate that HCMR reduces planned path length by at least 10.0%, shortens task completion time by at least 16.9%, and ensures complete crack coverage with virtually conflict-free operation, outperforming state-of-the-art coverage planners. These results highlight the feasibility and efficiency of deploying topology-informed multi-robot planning for practical inspection and repair scenarios.

Abstract:
Imitation learning for robotics often uses action chunking to mitigate the compounding errors associated with autoregressive policies. By predicting multiple future actions simultaneously, action chunking limits the accumulation of errors but introduces new difficulties. In particular, it relies on outdated observations to predict future actions, which can lead to inaccuracies. In this study, we propose Shifted Flow Policy (SFP), a simple yet effective alternative to action chunking. The SFP reparameterizes time by linearly shifting the time steps for future actions, thereby capturing the natural increase in uncertainty over time. This formulation allows each predicted action to be conditioned on up-to-date observations. Experimental results on the Push-T and MimicGen benchmarks demonstrate that SFP outperforms state-of-the-art action chunking methods across a variety of manipulation tasks by achieving higher success rates and faster inference. These findings suggest that shifted flow provides a robust and practical alternative to action chunking in visuomotor policy learning. Our code is available at https://shifted-flow-policy.github.io

Abstract:
Large-scale data has driven breakthroughs in robotics, from language models to vision-language-action models in bimanual manipulation. However, humanoid robotics lacks equally effective data collection frameworks. Existing humanoid teleoperation systems either use decoupled control or depend on expensive motion capture setups. We introduce TWIST2, a portable, mocap-free humanoid teleoperation and data collection system that preserves full whole-body control while advancing scalability. Our system leverages PICO4U VR for obtaining real-time whole-body human motions, with a custom 2-DoF robot neck (cost around 250) for egocentric vision, enabling holistic human-to-humanoid control. We demonstrate long-horizon dexterous and mobile humanoid skills and we can collect 100 demonstrations in 15 minutes with an almost 100% success rate. Building on this pipeline, we propose a hierarchical visuomotor policy framework that autonomously controls the full humanoid body based on egocentric vision. Our visuomotor policy successfully demonstrates whole-body dexterous manipulation and dynamic kicking tasks. The entire system is fully reproducible and open-sourced at https://yanjieze.com/TWIST2/ . Our collected dataset is also open-sourced at https://twist-data.github.io/ .

Abstract:
In recent years, there has been a surge of interest in open-vocabulary 3D scene reconstruction facilitated by visual language models (VLMs), which showcase remarkable capabilities in open-set retrieval. However, existing methods face some limitations: they either focus on learning point-wise features, resulting in blurry semantic understanding, or solely tackle object-level reconstruction, thereby overlooking the intricate details of the object's interior. To address these challenges, we introduce OpenObj, an innovative approach to build open-vocabulary object-level Neural Radiance Fields (NeRF) with fine-grained understanding. In essence, OpenObj establishes a robust framework for efficient and watertight scene modeling and comprehension at the object-level. Moreover, we incorporate part-level features into the neural fields, enabling a nuanced representation of object interiors. This approach captures object-level instances while maintaining a fine-grained understanding. The results on multiple datasets demonstrate that OpenObj achieves superior performance in zero-shot semantic segmentation and retrieval tasks. Additionally, OpenObj supports real-world robotics tasks at multiple scales, including global movement and local manipulation.

Abstract:
Recent advancements in visual odometry systems have improved autonomous navigation, yet challenges persist in complex environments like forests, where dense foliage, variable lighting, and repetitive textures compromise the accuracy of feature correspondences. To address these challenges, we introduce ForestGlue. ForestGlue enhances the SuperPoint feature detector through four configurations - grayscale, RGB, RGB-D, and stereo-vision inputs - optimised for various sensing modalities. For feature matching, we employ LightGlue or SuperGlue, both of which have been retrained using synthetic forest data. ForestGlue achieves comparable pose estimation accuracy to baseline LightGlue and SuperGlue models, yet require only 512 keypoints, just 25% of the 2048 keypoints used by baseline models, to achieve an LO-RANSAC AUC score of 0.745 at a 10 degree threshold. With a 1/4 of the keypoints required, ForestGlue has the potential to reduce computational overhead whilst being effective in dynamic forest environments, making it a promising candidate for real-time deployment on resource-constrained platforms such as drones or mobile robotic platforms. By combining ForestGlue with a novel transformer based pose estimation model, we propose ForestVO that estimates relative camera poses using the 2D pixel coordinates of matched features between frames. On challenging TartanAir forest sequences, ForestVO achieves an average relative pose error (RPE) of 1.09 m and kitti_score of 2.33%, outperforming direct-based methods such as DSO in dynamic scenes, while maintaining competitive performance with TartanVO despite being a significantly lighter model trained on only 10% of the dataset. This work establishes an end-to-end deep learning pipeline tailored for visual odometry in forested environments, leveraging forest-specific training data to optimise feature correspondence and pose estimation for improved accuracy and robustness in autonomous navigation systems.

Abstract:
Robotics has long sought to develop robots capable of completing previously unseen long-horizon tasks. Hierarchical approaches offer a pathway for achieving this goal by executing skill combinations arranged by a task planner, with each visuomotor skill pre-trained using a specific imitation learning (IL) algorithm. However, even in simple long-horizon tasks like skill chaining, hierarchical approaches often struggle due to a problem we identify as Observation Space Shift (OSS), where the sequential execution of preceding skills causes shifts in the observation space, disrupting the performance of subsequent individually trained skill policies. To understand OSS and evaluate its impact on long-horizon tasks, we introduce BOSS (a Benchmark for Observation Space Shift). BOSS comprises three distinct challenges: "Single Predicate Shift," "Accumulated Predicate Shift," and "Skill Chaining," each designed to assess a different aspect of OSS's negative effect. We evaluated several recent popular IL algorithms on BOSS, including three Behavioral Cloning methods and the Visual Language Action model OpenVLA. Even on the simplest challenge, we observed average performance drops of 67%, 35%, 34%, and 54%, respectively, when comparing skill performance with and without OSS. Additionally, we investigate three potential solutions, including using frozen robotics-specific vision encoders, switching to 3D pointcloud-based inputs, and applying data augmentation to expand visual diversity. Our results show that none of these approaches are sufficient to resolve OSS. The project page is: https://boss-benchmark.github.io/

Abstract:
This work proposes a method for learning features from a batch of 2D sonar images to predict a multi-view point-cloud for achieving a dense 3D-reconstruction. In comparison to vision-based sensors, acoustics are considered a reliable sensing modality in underwater environments. The output of sonars is a 2D image which is unable to represent the scanned scene in all three dimensions. Estimation of this missing information, known as the elevation angle, is the key to performing 3d-reconstruction from acoustic images. One of the approaches is to predict a depth-map from the 2D sonar image, and transforming it into a point-cloud. In this paper, this idea is further improved into learning features from a batch of 2D acoustic images and predicting multiple depthmaps of the scanned object which covers it from different viewpoints. For training the deep learning model, and due to the lack of datasets from real environments, data was generated synthetically. For reducing the simulation-to-real gap, a Cycle-GAN was trained on real images for transferring the realistic style into the synthetically generated images. The conducted experiments in simulation showed that the proposed method is able to perform dense 3D reconstruction. The approach was then further tested in a real environment using an underwater vehicle, which accurately 3d-reconstructed the scanned objects achieving an average chamfer distance error of 0.06 meters when compared to a laser-scanned ground-truth.

Abstract:
Online planning under uncertainty remains a critical challenge in robotics and autonomous systems. While tree search techniques are commonly employed to construct partial future trajectories within computational constraints, most existing methods discard information from previous planning sessions considering continuous spaces. This study presents a novel, computationally efficient approach that leverages historical planning data in current decision-making processes. We provide theoretical foundations for our information reuse strategy and introduce an algorithm based on Monte Carlo Tree Search (MCTS) that implements this approach. Experimental results demonstrate that our method significantly reduces computation time while maintaining high performance levels. Our findings suggest that integrating historical planning information can substantially improve the efficiency of online decisionmaking in uncertain environments, paving the way for more responsive and adaptive autonomous systems.

Abstract:
When a robot autonomously performs a complex task, it frequently must balance competing objectives while maintaining safety. This becomes more difficult in uncertain environments with stochastic outcomes. Enhancing transparency in the robots behavior and aligning with user preferences are also crucial. This paper introduces a novel framework for multi-objective reinforcement learning that ensures safe task execution, optimizes trade-offs between objectives, and adheres to user preferences. The framework has two main layers: a multi objective task planner and a high-level selector. The planning layer generates a set of optimal trade-off plans that guarantee satisfaction of a temporal logic task. The selector uses active inference to decide which generated plan best complies with user preferences and aids learning. Operating iteratively, the framework updates a parameterized learning model based on collected data. Case studies and benchmarks on both manipulation and mobile robots show that our framework outperforms other methods and (i) learns multiple optimal trade-offs, (ii) adheres to a user preference, and (iii) allows the user to adjust the balance between (i) and (ii).

Abstract:
Diffusion-based planners have achieved generalization comparable to classical planners by leveraging inference-time optimization through guidance. However, their limited ability to capture environmental variations often constrains their responsiveness in unseen settings. In addition, the diversity-consistency trade-off inherent in guidance has remained unresolved. In this work, we propose Prior-Constrained Explorative Guidance (PCEG), a novel approach that gathers environmental information through local exploration and prevents guided samples from converging prematurely to similar solutions by leveraging a trajectory prior. The collected information is included in the guidance via stochastic gradient estimation, while a succinct parameter scheduling strategy enables latent optimization driven by environmental signals without significant computational overhead. Furthermore, during the modal-seeking stages of the reverse diffusion process, we employ a Gaussian Process (GP) to enforce dynamics-informed priors, effectively constraining the exploration region of each sample and thereby enhancing solution diversity. Across diverse benchmarks including 7-degree-of-freedom (7-DoF) robot-arm manipulation, PCEG substantially improves success rate by up to 30 percentage points compared to competitive diffusion planners without compromising trajectory quality, even in scenarios involving unseen obstacles. Real-world experiments further validate these findings, showcasing the generation of smooth, collision-free trajectories in novel environments. The project page is available at https://rml-unist.github.io/PCEG/.

Abstract:
We introduce Hand-Objectemph(HO)GraspFlow, an affordance-centric approach that retargets a single RGB with hand-object interaction (HOI) into multi-modal executable parallel jaw grasps without explicit geometric priors on target objects. Building on existing learning-based hand reconstruction and the vision foundation model, we synthesize SE(3) grasp poses with denoising flow matching (FM), conditioned on the following three complementary cues: RGB foundation features as visual semantics, HOI contact reconstruction, and taxonomy-aware prior on grasp types. Our approach demonstrates high fidelity in grasp synthesis without explicit HOI contact input or object geometry, while maintaining strong contact and taxonomy recognition. Another controlled comparison shows that emphHOGraspFlow consistently outperforms diffusion-based variants (emphHOGraspDiff), achieving high distributional fidelity and more stable optimization in SE(3). We demonstrate a reliable, object-agnostic grasp synthesis from human demonstrations in real-world experiments, where an average success rate of over 83% is achieved. Code: https://github.com/YitianShi/HOGraspFlow

Abstract:
Contact-rich bimanual manipulation involves precise coordination of two arms to change object states through strategically selected contacts and motions. Due to the inherent complexity of these tasks, acquiring sufficient demonstration data and training policies that generalize to unseen scenarios remains a largely unresolved challenge. Building on recent advances in planning through contacts, we introduce Planning-Guided Diffusion Policy Learning (LIDE), an approach that effectively learns to solve contact-rich bimanual manipulation tasks by leveraging model-based motion planners to generate demonstration data in high-fidelity physics simulation. Through efficient planning in randomized environments, our approach generates large-scale and high-quality synthetic motion trajectories for tasks involving diverse objects and transformations. We then train a task-conditioned diffusion policy via behavior cloning using these demonstrations. To reduce the sim-to-real gap, we propose a set of designs in feature extraction, action prediction, and data augmentation that enable learning robust prediction of smooth action sequences and generalization to unseen scenarios. Through experiments in both simulation and the real world, we demonstrate that our approach can enable a bimanual robotic system to effectively manipulate objects of diverse geometries, dimensions, and physical properties.

Abstract:
The integration of human expertise into reinforcement learning has gained increasing attention as a means to improve sample efficiency and stability. Current approaches often depend on pre-collected expert demonstrations or virtual reality setups, which are costly to generate and difficult to adapt to dynamic training conditions. In this work, a framework is introduced that augments deep reinforcement learning with real-time demonstrations provided through mixed reality interaction. A structured robotic pick-and-place task serves as the benchmark, where a robot must execute sequential phases of grasping, transporting, and releasing an object. Expert guidance is delivered via mixed reality annotations, which are converted into reference trajectories and injected into the learning process whenever performance falls below a predefined threshold. A modified replay buffer accommodates both agent-generated and expert-generated transitions, allowing controlled sampling with a dynamically adjusted expert-to-agent ratio. Training in the real workspace through mixed reality reduces the simulation-to-reality gap considerably, as confirmed by experiments on a physical robot platform. Experimental evaluation demonstrates that the proposed framework accelerates policy convergence, ensures stability under noisy feedback, and achieves strong generalization to unseen task configurations. These findings highlight the potential of demonstration-augmented reinforcement learning through mixed reality as a data-efficient and robust approach to robot training in real-world scenarios.

Abstract:
This paper focuses on scene reconstruction under nighttime conditions in autonomous driving simulation. Recent methods based on Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) have achieved photorealistic modeling in autonomous driving scene reconstruction, but they primarily focus on normal-light conditions. Low-light driving scenes are more challenging to model due to their complex lighting and appearance conditions, which often causes performance degradation of existing methods. To address this problem, this work presents a novel approach that integrates physically based rendering into 3DGS to enhance nighttime scene reconstruction for autonomous driving. Specifically, our approach integrates physically based rendering into composite scene Gaussian representations and jointly optimizes Bidirectional Reflectance Distribution Function (BRDF) based material properties. We explicitly model diffuse components through a global illumination module and specular components by anisotropic spherical Gaussians. As a result, our approach improves reconstruction quality for outdoor nighttime driving scenes, while maintaining real-time rendering. Extensive experiments across diverse nighttime scenarios on two real-world autonomous driving datasets, including nuScenes and Waymo, demonstrate that our approach outperforms the state-of-the-art methods both quantitatively and qualitatively.

Abstract:
While skill-centric approaches leverage foundation models to enhance generalization in compositional tasks, they often rely on fixed skill libraries, limiting adaptability to new tasks without manual intervention. To address this, we propose Uni-Skill, a Unified Skill-centric framework that supports skill-aware planning and facilitates automatic skill evolution. Unlike prior methods that restrict planning to predefined skills, Uni-Skill requests for new skill implementations when existing ones are insufficient, ensuring adaptable planning with self-augmented skill library. To support automatic implementation of diverse skills requested by the planning module, we construct SkillFolder, a VerbNet-inspired repository derived from large-scale unstructured robotic videos. SkillFolder introduces a hierarchical skill taxonomy that captures diverse skill descriptions at multiple levels of abstraction. By populating this taxonomy with large-scale, automatically annotated demonstrations, Uni-Skill shifts the paradigm of skill acquisition from inefficient manual annotation to efficient offline structural retrieval. Retrieved examples provide semantic supervision over behavior patterns and fine-grained references for spatial trajectories, enabling few-shot skill inference without deployment-time demonstrations. Comprehensive experiments in both simulation and real-world settings verify the state-of-the-art performance of Uni-Skill over existing VLM-based skill-centric approaches, highlighting its advanced reasoning capabilities and strong zero-shot generalization across a wide range of novel tasks.

Abstract:
The challenge of dynamic scenes has long been one of the core issues in the application and generalization of SLAM systems. Traditional visual SLAM systems often rely on depth sensors and prior camera parameters, making it difficult to correct dynamic challenges from arbitrary input images while simultaneously constructing dense maps. Recently, neural network-based methods for two-view point cloud prediction have gained attention, and SLAM systems such as DUST3R and MAST3R have emerged based on this approach. However, these systems face challenges when applied to dynamic scenes and cannot directly use traditional methods for correction, such as semantic masking or optical flow segmentation. To address this issue, we propose MASTD3R-SLAM, a SLAM method specifically designed for dynamic scenes that supports arbitrary video inputs. The method combines fused mask-based processing with coarse-to-fine pointmap alignment and optimization to achieve point cloudtopose re-mapping correction, and further performs Gaussian rendering to remove rendering artifacts and suppress dynamic mapping interference. Compared to the original baseline, our approach improves tracking ATE accuracy by more than 90% and successfully restores the correct 3D map.

Abstract:
Crown preparation aims to create an optimal foundation for durable and functional restoration by reshaping the tooth with a cutting tool. Robotic crown preparation has emerged as a promising approach to overcome the inherent limitations of manual procedures, yet challenges remain in achieving efficient cutting path generation, collision-free orientation adjustment and precise cutting path following, since the oral cavity is a confined space with the target tooth tightly surrounded by other teeth. This paper introduces a novel, in-situ automated robotic full crown preparation system comprising (1) Preoperative Path Planning: generating high-efficiency universal cutting paths based on tooth morphological features; (2) Intraoral Collision Avoidance: optimizing the cutting tool's orientation within the constrained oral cavity; (3) MPC-Based Adaptive Control: modulating the path-following feed rate using model predictive control (MPC) according to intraoperative force feedback. The proposed system was thoroughly validated on a human head phantom targeting a permanent tooth to simulate a real clinical scenario, yielding an average root-mean-square (RMS) error (tooth shape after preparation) of 0.17 mm and an overall mean execution time of 347.77 s, achieving a 74.2% improvement in cutting efficiency over state-of-the-art methods. A comparative evaluation against conventional dental guides further demonstrates its technical feasibility and significant potential for clinical translation.

Abstract:
Implicit representations for LiDAR-based Simultaneous Localization and Mapping (SLAM) offer significant advantages in storage efficiency and expressive power over traditional explicit maps. However, a critical limitation for implicit SLAM is their deterministic nature, which prevents the quantification of prediction uncertainty in sparse or noisy conditions. Furthermore, the accuracy of the underlying Signed Distance Field (SDF) is often compromised by systematic errors arising from the angular dependency of LiDAR measurements, where oblique incident angles lead to biased distance estimations and degrade map quality. To address these challenges, this paper introduces a framework that enhances the robustness and accuracy of implicit LiDAR SLAM by integrating uncertainty estimation and an adaptive sampling strategy. We propose a neural network-based approach to learn and predict SDF uncertainty, which is then effectively incorporated into both localization and mapping processes. Concurrently, to mitigate incident angle-induced errors, we develop an adaptive sampling scheme that weights LiDAR rays based on surface normal information. Validation on public datasets and a custom experimental platform demonstrates that our approach outperforms baseline methods in terms of localization, mapping accuracy, and robustness.

Abstract:
Humanoid Whole-Body Controllers trained with reinforcement learning (RL) have recently achieved remarkable performance, yet many target a single robot embodiment. Variations in dynamics, degrees of freedom (DoFs), and kinematic topology still hinder a single policy from commanding diverse humanoids. Moreover, obtaining a generalist policy that not only transfers across embodiments but also supports richer behaviorsbeyond simple walking to squatting, leaningremains especially challenging. In this work, we tackle these obstacles by introducing EAGLE, an iterative generalist-specialist distillation framework that produces a single unified policy that controls multiple heterogeneous humanoids without per-robot reward tuning. During each cycle, embodiment-specific specialists are forked from the current generalist, refined on their respective robots, and new skills are distilled back into the generalist by training on the pooled embodiment set. Repeating this loop until performance convergence produces a robust Whole-Body Controller validated on robots such as Unitree H1, G1, and Fourier N1. We conducted experiments on five different robots in simulation environments and four in real-world settings. Through quantitative evaluations, EAGLE achieves high tracking accuracy and robustness compared to other methods, marking a step toward scalable, fleet-level humanoid control.

Abstract:
Human assistance in robotics spans around several tasks such as navigation, object manipulation, and placement, where a key challenge is selecting target destinations that align with human intentions or preferences. We focus on this challenge in the context of Virtual Placement (VP), the task of identifying all plausible target locations given scene context and human-centric constraints. This differs from traditional placement tasks that typically focus on a single, predefined target location. The VP problem is complex, as it requires both global and local reasoning about the scene's geometry, semantics, and plausibility. To address this gap, we introduce bf Assistant Placement Aria, the first benchmark to explore diverse aspects of VP, including global, local, and human-centric constraints. It contains both synthetic and real indoor scenes annotated for three tasks: (i)~2D Panel Placement, (ii)~Sitting Suggestion, and (iii)~TV Placement. Each scene includes 2D images, a 3D point cloud, and a textual description of the objects within the scene. By contributing this benchmark, we aim to encourage further research in this underexplored and challenging field that is critically dependent on relevant data. We also evaluate several foundation models for object detection and segmentation on our benchmark.

Abstract:
Enabling robots to execute novel manipulation tasks zero-shot is a central goal in robotics. Most existing methods assume in-distribution tasks or rely on fine-tuning with embodiment-matched data, limiting transfer across platforms. We present NovaFlow, an autonomous manipulation framework that converts a task description into an actionable plan for a target robot without any demonstrations. Given a task description, NovaFlow synthesizes a video using a video generation model and distills it into 3D actionable object flow using off-the-shelf perception modules. From the object flow, it computes relative poses for rigid objects and realizes them as robot actions via grasp proposals and trajectory optimization. For deformable objects, this flow serves as a tracking objective for model-based planning with a particle-based dynamics model. By decoupling task understanding from low-level control, NovaFlow naturally transfers across embodiments. We validate on rigid, articulated, and deformable object manipulation tasks using a table-top Franka arm and a Spot quadrupedal mobile robot, and achieve effective zero-shot execution without demonstrations or embodiment-specific training.

Abstract:
Efficient brain functional training with rehabilitation robots has been an important and challenging topic in the human-machine interaction (HMI) field. Adjusting the interaction and gaming behaviors between human and machine to effectively activate the brains functional behavior is still a substantial challenge. In this paper, we take the visual-attention training as an example, and propose a novel human-machine co-gaming interaction framework by integrating a dual-task gaming paradigm and a humanmachine gaming strategy. It has a remarkable capability of effectively utilizing the gaming characteristics of HMI behaviors and tasks, to effectively and precisely activate the humans active attention and passive attention for training. Specifically, we design a gaze-driven dual-task gaming paradigm to co-activate the active and passive attention-network competition for systematically engaging human visual-attention allocation and training. We further develop a reinforcement-learning-based humanmachine gaming strategy to adjust the task parameters for improving the attention training efficiency. Consequently, we conduct an experiment study with 8 healthy participants, by jointly analyzing participants EEG and eye-tracking data through the training process. Results show that our method can achieve improvement of brain engagement by an average of 15.6% over the widely-employed staircase strategy.

Abstract:
Wireless capsule endoscopy (WCE) has transformed gastrointestinal (GI) diagnostics by enabling noninvasive visualization of the digestive tract, yet its diagnostic yield remains constrained by the absence of biopsy capability, as histological analysis is still the gold standard for confirming disease. Conventional biopsy using forceps, needles, or rotating blades is invasive, limited in reach, and carries risks of perforation or mucosal trauma, while fluid- or microbiota-sampling capsules cannot provide structured tissue for pathology, leaving a critical gap in swallowable biopsy solutions. Here we present the Kiri-Capsule, a kirigami-inspired capsule robot that integrates deployable PI-film flaps actuated by a compact dual-cam mechanism to achieve minimally invasive and repeatable tissue collection. The kirigami surface remains flat during locomotion but transforms into sharp protrusions upon cam-driven stretching, enabling controlled penetration followed by rotary scraping, with specimens retained in internal fan-shaped cavities. Bench tests confirmed that PI films exhibit a Young's modulus of approximately 20 MPa and stable deployment angles (about 34^\circ at 15% strain), while ex vivo porcine studies demonstrated shallow penetration depths (median ～0.61 mm, range 0.46--0.66 mm) and biopsy yields comparable to standard forceps (mean ～10.9 mg for stomach and ～18.9 mg for intestine), with forces within safe ranges reported for GI biopsy. These findings demonstrate that the Kiri-Capsule bridges passive imaging and functional biopsy, providing a swallowable, depth-controlled, and histology-ready solution that advances capsule-based diagnostics toward safe and effective clinical application.

Abstract:
Many dexterous manipulation tasks are non-markovian in nature, yet little attention has been paid to this fact in the recent upsurge of the vision-language-action (VLA) paradigm. Although they are successful in bringing internet-scale semantic understanding to robotics, existing VLAs are primarily "stateless" and struggle with memory-dependent long horizon tasks. In this work, we explore a way to impart both spatial and temporal memory to a VLA by incorporating a language scratchpad. The scratchpad makes it possible to memorize task-specific information, such as object positions, and it allows the model to keep track of a plan and progress towards subgoals within that plan. We evaluate this approach on a split of memory-dependent tasks from the ClevrSkills environment, on MemoryBench, as well as on a challenging real-world pick-and-place task. We show that incorporating a language scratchpad significantly improves generalization on these tasks for both non-recurrent and recurrent models.

Abstract:
Robotic laparoscopic surgery has gained increasing attention in recent years for its potential to deliver more efficient and precise minimally invasive procedures. However, adoption of surgical robotic platforms remains largely confined to high-resource medical centers, exacerbating healthcare disparities in rural and low-resource regions. To close this gap, a range of solutions has been explored, from remote mentorship to fully remote telesurgery. Yet, the practical deployment of surgical robotic systems to underserved communities remains an unsolved challenge. Humanoid systems offer a promising path toward deployability, as they can directly operate in environments designed for humans without extensive infrastructure modifications -- including operating rooms. In this work, we introduce LapSurgie, the first humanoid-robot-based laparoscopic teleoperation framework. The system leverages an inverse-mapping strategy for manual-wristed laparoscopic instruments that abides to remote center-of-motion constraints, enabling precise hand-to-tool control of off-the-shelf surgical laparoscopic tools without additional setup requirements. A control console equipped with a stereo vision system provides real-time visual feedback. Finally, a comprehensive user study across platforms demonstrates the effectiveness of the proposed framework and provides initial evidence for the feasibility of deploying humanoid robots in laparoscopic procedures.

Abstract:
Current cross-view localization methods predominantly rely on satellite imagery as the aerial modality. Although recent work explores planimetric maps (e.g., OpenStreetMap tiles), these approaches often lag in performance. Yet both modalities are widely available and possess complementary properties. Satellite images are closer to ground-level camera imagery, offering finer detail, whereas planimetric maps contain annotated objects (e.g., streetlamps) and remain informative in areas where the ground is occluded, such as by foliage. Despite this, only one prior work provides an end-to-end method to fuse the two modalities, and it does not demonstrate their potential within state-of-the-art methods. To combine the strengths of both modalities, we propose a new fusion module that augments standard encoders and demonstrates that integrating satellite imagery with planimetric maps improves state-of-the-art single-modality methods. The module comprises (i) cross-modal conditioning, which processes each modalitys encoding with awareness of the other, and (ii) a patch-level fusion rule that controls the granularity of information exchange. We achieve state-of-the-art results, reducing the mean localization error by 30.13%. Qualitatively, the fusion adaptively selects the more informative modality, improving overall accuracy.

Abstract:
The scarcity of large-scale, high-quality demonstration data remains a bottleneck in scaling imitation learning for robotic manipulation. We present COBALT, a teleoperation platform designed to democratize robot learning at scale both in simulation and in the real world. By leveraging vectorized environments, our scalable, load-balanced infrastructure supports concurrent teleoperation by multiple users on a single GPU, yielding a significant reduction in teleoperation cost. Operators can connect from nearly anywhere on Earth using commonly available devices, including single or dual smartphones, VR headsets, 3D mice, and keyboards. An in-memory data cache and efficient video streaming keep control and rendering synchronous, sustaining dozens of concurrent users at 20 Hz with sub-100 ms end-to-end latency. We demonstrate concurrent support for 256 clients across 8 GPUs, underscoring the systems ability to scale across hardware and within individual servers. We perform a comprehensive user study showing that phone-based teleoperation performs comparably to or better than specialized hardware, enabling faster, more ergonomic data collection. To ensure data quality, COBALT logs a suite of real-time metrics to automatically filter suboptimal demonstrations. We further demonstrate that a structured user training curriculum significantly improves data collection quality. Guided by insights from our user study, we crowdsource the collection of a large-scale, high-quality pilot dataset with 7500+ demonstrations (50+ hours) collected with smartphones across nine countries over five days. We validate the dataset's quality by training state-of-the-art imitation learning algorithms. Please visit cobalt-teleop.github.io for more details.

Abstract:
This paper examines the problem of coordinating the observations of multiple agents constrained to periodic trajectories that communicate asynchronously with a central planner. We are motivated by settings such as active monitoring missions tracking stochastic and spatially spreading events like wildfires or flooding, where a rapid response is essential and the spatial extent can be large. In such cases, "always-on" networking may be infeasible and continuous coordination may be prohibitively costly. Periodic trajectories are a natural constraint for relevant classes of systems, e.g., UAV swarms that cycle around recharging stations or Earth observation satellite constellations; moreover, these lead to recurring communication opportunities with compute-capable infrastructure. We introduce the Multi-Agent Asynchronous Periodic Partially Observable MDP (MA-APPOMDP), a new planning framework that formalizes asynchronous check-in times and centralized but delayed information flow. We propose two algorithms tailored to this new model: the Asynchronous Belief Branching Algorithm (ABBA), which performs exact belief branching over unknown observations, and SB-ABBA, a sampling-based approximation where scalability is prioritized over exactness. Empirical results on different wildfire event monitoring problems show that our methods consistently achieve higher event coverage and lower detection delay than several heuristic and planning baselines, with SB-ABBA scaling to larger problem instances.

Abstract:
Sub-30 g nano-sized aerial robots can leverage their agility and form factor to explore cluttered and narrow environments, like in industrial inspection and search and rescue missions. However, the price for their tiny size is a strong limit in their resources, i.e., sub-100 mW microcontroller units (MCUs) delivering ~100 GOps/s at best, and memory budgets well below 100 MB. Despite these strict constraints, we aim to enable complex vision-based tasks aboard nano-drones, such as dense 3D scene reconstruction: a key robotic task underlying fundamental capabilities like spatial awareness and motion planning. Top-performing 3D reconstruction methods leverage neural radiance fields (NeRF) models, which require GBs of memory and massive computation, usually delivered by high-end GPUs consuming 100s of Watts. Our work introduces Tiny-DroNeRF, a lightweight NeRF model, based on Instant-NGP, and optimized for running on a GAP9 ultra-low-power (ULP) MCU aboard our nano-drones. Then, we further empower our Tiny-DroNeRF by leveraging a collaborative federated learning scheme, which distributes the model training among multiple nano-drones. Our experimental results show a 95% reduction in Tiny-DroNeRF's memory footprint compared to Instant-NGP with only a 17% drop in reconstruction accuracy. Finally, our federated learning scheme allows Tiny-DroNeRF to train with an amount of data otherwise impossible to keep in a single drone's memory, increasing the overall reconstruction accuracy. Ultimately, our work combines, for the first time, NeRF training on an ULP MCU with federated learning on nano-drones.

Abstract:
Learning versatile whole-body skills by tracking various human motions is a fundamental step toward general-purpose humanoid robots. This task is particularly challenging because a single policy must master a broad repertoire of motion skills while ensuring stability over long-horizon sequences. To this end, we present VMS, a unified whole-body controller that enables humanoid robots to learn diverse and dynamic behaviors within a single policy. Our framework integrates a hybrid tracking objective that balances local motion fidelity with global trajectory consistency, and an Orthogonal Mixture-of-Experts (OMoE) architecture that encourages skill specialization while enhancing generalization across motions. A segment-level tracking reward is further introduced to relax rigid step-wise matching, enhancing robustness when handling global displacements and transient inaccuracies. We validate VMS extensively in both simulation and real-world experiments, demonstrating accurate imitation of dynamic skills, stable performance over minute-long sequences, and strong generalization to unseen motions. These results highlight the potential of VMS as a scalable foundation for versatile humanoid whole-body control. The project page is available at kungfubot2-humanoid.github.io.

Abstract:
We explore a navigation problem for a simple robot with extremely noisy sensing and significant movement uncertainty. We are particularly interested in environments containing large regions in which relatively little distinguishing sensor information is available to assist with localization. This paper proposes a navigation algorithm for this setting that strategically directs the robot through such regions when possible, but with a careful view of the need to regain relatively accurate localization at certain points in the execution. Reasoning directly about the robot's uncertainty, the approach utilizes a local entropy metric to identify regions where sensors have strong informative value. This metric informs the selection of coarse global paths that guide a more precise local planner. We discuss an implementation of this algorithm, and provide simulation results demonstrating its effectiveness in spite of large errors in both sensing and actuation.

Abstract:
Robotic assembly presents a long-standing challenge due to its requirement for precise, contact-rich manipulation. While simulation-based learning has enabled the development of robust assembly policies, their performance often degrades when deployed in real-world settings due to the sim-to-real gap. Conversely, real-world reinforcement learning (RL) methods avoid the sim-to-real gap, but rely heavily on human supervision and lack generalization ability to environmental changes. In this work, we propose a hybrid approach that combines a simulation-trained base policy with a real-world residual policy to efficiently adapt to real-world variations. The base policy, trained in simulation using low-level state observations and dense rewards, provides strong priors for initial behavior. The residual policy, learned in the real world using visual observations and sparse rewards, compensates for discrepancies in dynamics and sensor noise. Extensive real-world experiments demonstrate that our method, SPARR, achieves near-perfect success rates across diverse two-part assembly tasks. Compared to the state-of-the-art zero-shot sim-to-real methods, SPARR improves success rates by 38.4% while reducing cycle time by 29.7%. Moreover, SPARR requires no human expertise, in contrast to the state-of-the-art real-world RL approaches that depend heavily on human supervision. Please visit the project webpage at https://research.nvidia.com/labs/srl/projects/sparr/

Abstract:
Bimanual manipulation is essential for advanced robotic systems because it offers higher efficiency and flexibility compared to single-arm configurations. However, existing approaches either lack inter-arm interaction or ignore the need for a dynamic division of labor, treating the arms as functionally equivalent. To address these limitations, this paper draws inspiration from human bimanual manipulation where one arm handles core operations and the other provides auxiliary support, and proposes PA-BiCoop, a new single-model bimanual cooperation framework with dynamic primary-auxiliary arm differentiation. PA-BiCoop categorizes robotic arms into primary and auxiliary arms with adaptively adjustable roles across task stages, employs two specialized decoders that share a global feature encoder: the primary decoder generates the primary arms base-coordinate pose and core-task affordance heatmaps, and the auxiliary decoder outputs the auxiliary arms relative pose in the primary arms coordinate system. Moreover, we design a dynamic role assignment module to automatically map roles to left/right arms without manual pre-definition. This design facilitates inter-arm knowledge sharing and coordinated manipulation. Extensive experiments demonstrate that our PA-BiCoop achieves superior performance: it outperforms state-of-the-art baselines by 48% on average in RLBench2 simulation tasks and by over 50% on average in real-world tasks, thereby verifying its effectiveness and advancement in bimanual manipulation.

Abstract:
Precision assembly tasks like peg-in-hole remain challenging for robotic manipulation. While visual servoing offers a robust framework, it depends heavily on accurate calibration and manual feature engineering. Learning-based methods, including vision-language models (VLMs), provide strong semantic understanding but often lack the precision needed for high-tolerance, contact-rich insertions. This paper introduces a novel framework that combines the semantic reasoning of large language models (LLMs) with adaptive visual servoing to bridge this gap. Our approach uses an LLM as a semantic feature extractor and correspondence engine for stereo visual servoing. The LLM processes generic point features from uncalibrated stereo images along with a task description in natural language, leveraging its spatial understanding to identify and correspond optimal features across views. These features drive a stereo adaptive visual servoing controller that estimates unknown calibration parameters online, enabling precise, calibration-free positioning. Extensive evaluations on cylindrical, square, and hexagonal peg-in-hole tasks across three trials demonstrate average success rates above 90% with steady-state errors of 1.8--2.8 pixels, closely comparable to calibrated methods (1.2--2.5 pixels). This is achieved without requiring prior models, calibration, or task-specific training, thereby advancing flexible and precise robotic assembly.

Abstract:
Robots operating in human-centric environments must be both robust to disturbances and provably safe from collisions. Achieving these properties simultaneously and efficiently remains a central challenge. While Dynamic Movement Primitives (DMPs) offer inherent stability and generalization from single demonstrations, they lack formal safety guarantees. Conversely, formal methods like Control Barrier Functions (CBFs) provide provable safety but often rely on computationally expensive, real-time optimization, hindering their use in high-frequency control. This paper introduces SafeDMPs, a novel framework that resolves this trade-off. We integrate the closed-form efficiency and dynamic robustness of DMPs with a provably safe, non-optimization-based control law derived from Spatio-Temporal Tubes (STTs). This synergy allows us to generate motions that are not only robust to perturbations and adaptable to new goals, but also guaranteed to avoid static and dynamic obstacles. Our approach achieves a closed-form solution for a problem that traditionally requires online optimization. Experimental results on a 7-DOF robot manipulator demonstrate that SafeDMPs is orders of magnitude faster and more accurate than optimization-based baselines, making it an ideal solution for real-time, safe, and collaborative robotics.

Abstract:
High-quality teleoperated demonstrations are a primary bottleneck for imitation learning (IL) in dexterous manipulation. However, haptic feedback provides operators with real-time contact information, enabling real-time finger posture adjustments, and thereby improving demonstration quality. Existing dexterous teleoperation platforms typically omit haptic feedback and remain bulky and expensive. We introduce CDF-Glove, a lightweight and low cost cable-driven force-feedback glove. The real-time state is available for 20 finger degrees of freedom (DoF), of which 16 are directly sensed and 4 are passively coupled (inferred from kinematic constraints). We develop a kinematic model and control stack for the glove, and validate them across multiple robotic hands with diverse kinematics and DoF. The CDF-Glove achieves distal joint repeatability of 0.4 degrees, and delivers about 200 ms force feedback latency, yielding a 4x improvement in task success rate relative to no-feedback teleoperation. We collect two bimanual teleoperation datasets, on which we train and evaluate Diffusion Policy baselines. Compared to kinesthetic teaching, the policies trained in our teleoperated demonstrations increase the average success rate by 55% and reduce the mean completion time by approximately 15.2 seconds (a 47.2% relative reduction). In particular, the CDF-Glove costs approximately US230. The code and designs are released as open source at https://cdfglove.github.io/.

Abstract:
We propose Observer-Actor (ObAct), a novel framework for active vision imitation learning in which the observer moves to optimal visual observations for the actor. We study ObAct on a dual-arm robotic system equipped with wrist-mounted cameras. At test time, ObAct dynamically assigns observer and actor roles: the observer arm constructs a 3D Gaussian Splatting (3DGS) representation from three images, virtually explores this to find an optimal camera pose, then moves to this pose; the actor arm then executes a policy using the observers observations. This formulation enhances the clarity and visibility of both the object and the gripper in the policy's observations. As a result, we enable the training of ambidextrous policies on observations that remain closer to the occlusion-free training distribution, leading to more robust policies. We study this formulation with two existing imitation learning methods -- trajectory transfer and behaviour cloning -- and experiments show that ObAct significantly outperforms static-camera setups: trajectory transfer improves by 145% without occlusion and 233% with occlusion, while behavior cloning improves by 75% and 143%, respectively. Videos are available at: https://obact.github.io.

Abstract:
Due to the multi-node and multi-contact motion characteristics of tensegrity robots, existing methods fail to generate feasible reference rolling trajectories, and controllers are also limited to open-loop approaches. To address this issue, we utilize motion decomposition to extract the motion phase that should be the primary focus. Subsequently, we propose a method combining form-finding-based critical configuration search and polynomial trajectories to generate feasible trajectories. Then, an iLQR controller that accounts for reducing actuator load is designed for trajectory tracking control. A key distinction from existing methods is that our approach eliminates the need for reset operations after each rolling cycle. The results of simulations and physical experiments demonstrate that the robot achieves continuous rolling, with improvements of 18.3% in speed and 34.4% in actuation load compared to existing works.

Abstract:
Aerial manipulation (AM) promises to move Unmanned Aerial Vehicles (UAVs) beyond passive inspection to contact-rich tasks such as grasping, assembly, and in-situ maintenance. Most prior AM demonstrations rely on external motion capture (MoCap) and emphasize position control for coarse interactions, limiting deployability. We present a fully onboard perceptioncontrol pipeline for contact-rich AM that achieves accurate motion tracking and regulated contact wrenches without MoCap. The main components are (1) an augmented visualinertial odometry (VIO) estimator with contact-consistency factors that activate only during interaction, tightening uncertainty around the contact frame and reducing drift, and (2) image-based visual servoing (IBVS) to mitigate perceptioncontrol coupling, together with a hybrid forcemotion controller that regulates contact wrenches and lateral motion for stable contact. Experiments show that our approach closes the perception-to-wrench loop using only onboard sensing, yielding an velocity estimation improvement of 66.01% at contact, reliable target approach, and stable force holdingpointing toward deployable, in-the-wild aerial manipulation.

Abstract:
Deep stereo matching has advanced significantly on benchmark datasets through fine-tuning but falls short of the zero-shot generalization seen in foundation models in other vision tasks. We introduce CogStereo, a novel framework that addresses challenging regions, such as occlusions or weak textures, without relying on dataset-specific priors. CogStereo embeds implicit spatial cognition into the refinement process by using monocular depth features as priors, capturing holistic scene understanding beyond local correspondences. This approach ensures structurally coherent disparity estimation, even in areas where geometry alone is inadequate. CogStereo employs a dual-conditional refinement mechanism that combines pixel-wise uncertainty with cognition-guided features for consistent global correction of mismatches. Extensive experiments on Scene Flow, KITTI, Middlebury, ETH3D, EuRoc, and real-world demonstrate that CogStereo not only achieves state-of-the-art results but also excels in cross-domain generalization, shifting stereo vision towards a cognition-driven approach. More details are available at https://github.com/lhfang228/CogStereo.

Abstract:
In order to provide a robot with the ability to understand and react to a user's natural language inputs, the natural language must be connected to the robot's underlying representations of the world. Recently, large language models (LLMs) and 3D scene graphs (3DSGs) have become a popular choice for grounding natural language and representing the world, respectively. In this work, we address the challenge of using LLMs with 3DSGs to ground natural language. Existing methods encode the scene graph as serialized text and insert it into the LLM's context window, but this encoding does not scale to large or rich 3DSGs. Instead, we propose to use a form of Retrieval Augmented Generation to select a subset of the 3DSG relevant to the task. We encode a 3DSG in a graph database and provide a query language interface (Cypher) as a tool to the LLM with which it can retrieve relevant data for language grounding. We evaluate our approach on instruction following and scene question-answering tasks and compare against baseline context window and code generation methods. Through evaluations on scene question-answering, instruction grounding, and scene graph updating tasks, we compare our approach to existing context window-based methods and a novel code generation method. Our results show that using Cypher as an interface to 3D scene graphs scales significantly better to large, rich graphs, leading to large improvements in grounded language tasks while also substantially reducing the token count of the scene graph content.

Abstract:
Manta rays achieve efficient and maneuverable swimming through flapping of their large pectoral fins, where stiffness plays a critical role in hydrodynamic performance. Most existing manta-ray robots employ fixed or one-dimensional compliance, limiting their ability to replicate the two-dimensional stiffness variation essential for traveling wave propulsion. This paper presents a manta rayinspired robot equipped with an active stiffness control mechanism that enables reconfigurable, two-dimensional stiffness distributions in its pectoral fins. The design integrates a cable-driven actuation system with anisotropic disks, providing multiple distinct stiffness states that can be locked during operation. Mechanical characterization confirms periodic stiffness variation, with spanwise stiffness increasing by more than 30% and chordwise stiffness decreasing by about 10% as the disk rotates from 0degree to 90degree, then recovering from 90degree to 180degree. Robot experiments evaluate the influence of stiffness on fin kinematics, thrust generation, and free-swimming performance. Thrust tests demonstrate that stiffness substantially affects steady-state thrust; under certain conditions, the optimal setting produces up to five times more thrust than the least effective one. Free-swimming trials further reveal that stiffness alters swimming speed, with up to 20% variation observed in low-frequency, large-amplitude flapping. These results highlight the potential of active stiffness control to enhance the performance of bio-inspired underwater robots and provide new insights into the role of structural compliance in aquatic locomotion.

Abstract:
3D Gaussian Splatting (3DGS) has shown promising results for 3D scene modeling using mixtures of Gaussians, yet its existing simultaneous localization and mapping (SLAM) variants typically rely on direct, deterministic pose optimization against the splat map, making them sensitive to initialization and susceptible to catastrophic forgetting as map evolves. We propose Variational Bayesian Gaussian Splatting SLAM (VBGS-SLAM), a novel framework that couples the splat map refinement and camera pose tracking in a generative probabilistic form. By leveraging conjugate properties of multivariate Gaussians and variational inference, our method admits efficient closed-form updates and explicitly maintains posterior uncertainty over both poses and scene parameters. This uncertainty-aware method mitigates drift and enhances robustness in challenging conditions, while preserving the efficiency and rendering quality of existing 3DGS. Our experiments demonstrate superior tracking performance and robustness in long sequence prediction, alongside efficient, high-quality novel view synthesis across diverse synthetic and real-world scenes.

Abstract:
We introduce RoboEval, a structured evaluation framework and benchmark for robotic manipulation that augments binary success with principled behavioral and outcome metrics. Existing evaluations often collapse performance into outcome counts, masking differences in execution quality and obscuring failure structure. RoboEval provides eight bimanual tasks with systematically controlled variations, more than three thousand expert demonstrations, and a modular simulation platform for reproducible experimentation. All tasks are instrumented with standardized metrics that quantify fluency, precision, and coordination, as well as outcome measures that trace stagewise progress and localize failure modes. Through extensive experiments with state-of-the-art visuomotor policies, we validate these metrics by analyzing their stability under variation, discriminative power across policies with similar success rates, and correlation with task success. Project Page: https://robo-eval.github.io

Abstract:
Automation of suturing subtasks, such as needle handover, has the potential to reduce surgeons' fatigue and improve surgical efficiency. Needle handover is particularly challenging due to the combinatorial nature of grasping and handover strategies, uncertainties in needle pose estimation, and inaccuracies inherent in cable-driven surgical robots such as the da Vinci system. In this work, we present a reinforcement learning framework for needle handover, spanning the process from initial pickup to a desired grasping state. We formulate the task as a goal-oriented planning problem and design a stateaction representation that captures grasping and handover configurations. A DQN-based policy is trained with disturbances that reflect real-world kinematic errors to ensure robustness. The learned policy was validated on the da Vinci Research Kit (dVRK) and quantitatively compared with human teleoperation. Results demonstrate that our approach achieves human-level efficiency in terms of handover attempts (1.65 ± 0.50 vs. 1.62 ± 0.55), while improving consistency and joint-limit avoidance. The proposed framework demonstrates the potential of reinforcement learning for safe and reliable automation of surgical handover and points to opportunities for extending autonomy to more complex handover scenarios.

Abstract:
Autonomous mobile robots operating in novel environments depend critically on accurate state estimation, often utilizing visual and inertial measurements. Recent work has shown that an invariant formulation of the extended Kalman filter improves the convergence and robustness of visual-inertial odometry by utilizing the Lie group structure of a robot's position, velocity, and orientation states. However, inertial sensors also require measurement bias estimation, yet introducing the bias in the filter state breaks the Lie group symmetry. In this paper, we design a neural network to predict the bias of an inertial measurement unit (IMU) from a sequence of previous IMU measurements. This allows us to use an invariant filter for visual inertial odometry, relying on the learned bias prediction rather than introducing the bias in the filter state. We demonstrate that an invariant multi-state constraint Kalman filter (MSCKF) with learned bias predictions achieves robust visual-inertial odometry in real experiments, even when visual information is unavailable for extended periods and the system needs to rely solely on IMU measurements.

Abstract:
In the field of autonomous driving, constructing high-precision maps, typically represented as 3D point cloud maps or bird's-eye view (BEV) grid maps, is essential for both offline and online applications. However, the presence of dynamic objects within a scene can introduce artifacts and noise that significantly degrade the quality of these maps. To address this challenge, we propose a method in this paper that can accurately identify those dynamic objects in both online and offline settings. Our approach fully exploits the spatio-temporal attributes of BEV grid maps and utilizes a point-grid-point (PGP) scheme to identify moving objects at both the 3D point cloud level and the 2D BEV grid level. Experimental results from public datasets, as well as a self-collected dataset, demonstrate that our method consistently outperforms state-of-the-art approaches in dynamic object removal in both online and offline contexts. The code and the newly introduced dataset will be made publicly available at: https://anonymous.4open.science/r/PGP-DOR-0686.

Abstract:
Wearable devices like exoskeletons are designed to reduce excessive loads on specific joints of the body. Specifically, single- or two-degrees-of-freedom (DOF) upper-body industrial exoskeletons typically focus on compensating for the strain on the elbow and shoulder joints. However, during daily activities, there is no assurance that external loads are correctly aligned with the supported joints. Optimizing work processes to ensure that external loads are primarily (to the extent that they can be compensated by the exoskeleton) directed onto the supported joints can significantly enhance the overall usability of these devices and the ergonomics of their users. Collaborative robots (cobots) can play a role in this optimization, complementing the collaborative aspects of human work. In this study, we propose an adaptive and coordinated control system for the human-cobot-exoskeleton interaction. This system adjusts the task coordinates to maximize the utilization of the supported joints. When the torque limits of the exoskeleton are exceeded, the framework continuously adapts the task frame, redistributing excessive loads to non-supported body joints to prevent overloading the supported ones. We validated our approach in an equivalent industrial painting task involving a single-DOF elbow exoskeleton, a cobot, and four subjects, each tested in four different initial arm configurations with five distinct optimisation weight matrices and two different payloads.

Abstract:
Vector fields handle nonholonomic motion planning as they provide reference orientation for robots. However, additionally incorporating curvature constraints becomes challenging, due to the interconnection between the design of the curvature-bounded vector field and the tracking controller under underactuation. In this paper, we present a novel framework to co-develop the vector field and the control laws, guiding the nonholonomic robot to the target configuration with curvature-bounded trajectory. First, we construct a curvature-constrained vector field (CVF) via blending and distributing basic flow fields to provide curvature-bounded reference trajectory. Next, we propose the saturated control laws with a dynamic gain. Under the control laws, kinematically constrained nonholonomic robots are guaranteed to track the reference CVF and converge to the target positive limit set with bounded trajectory curvature. Numerical simulations show that the proposed CVF method outperforms other vector-field-based algorithms. Experiments on Ackermann UGVs and semi-physical fixed-wing UAVs demonstrate that the method can be effectively implemented in real-world scenarios.

Abstract:
This work proposes a novel data-driven anomaly detection framework for robotic systems, grounded in statistical concentration inequalities. The method leverages the Matrix Chernoff Inequality to establish probabilistic bounds on the eigenvalues of cumulative error covariance matrices computed over a sliding window of robot state deviations. An anomaly is flagged when the eigenvalues, computed in real time, violate these theoretical bounds. The proposed approach is model independent, computationally efficient, and straightforward to implement, requiring only the numerical solution of two transcendental equations to determine the bounds. It further offers design flexibility via tunable parameters such as the confidence level and window size. The effectiveness of the detector is validated through both simulation and hardware experiments across distinct anomaly scenarios for different robots, including input delay, sensor corruption, and external perturbations. A comprehensive performance evaluation is also presented using standard metrics such as Detection Rate, False Positive Rate, Accuracy, and Receiver Operating Characteristics (ROC), along with a method for effective parameter selection and comparison with existing works.

Abstract:
Artificial muscles for soft robots and human-interactive machines must deliver high force, fast dynamics, and large stroke (tens of percent strain) while remaining mechanically compliant. Sliding electrostatic film actuators enable compact multilayer integration without lateral expansion and provide large stroke (�?0% contraction) by generating shear forces between ultrathin (�?0150 µm) slider and stator films. However, in multi-stack configurations, force additivity degrades under curvature or misalignment due to uneven electrode overlap. Here, we introduce an electrostatic linear film actuator with passive mechanical phase switching via patterned brush contacts. As the slider moves, each stator electrode is locally assigned HV or ground at the correct position, maintaining force direction under bending and stretch, and enabling operation from a single DC source.

Abstract:
Enabling robots to imitate artists who observe objects in the 3D world and physically draw them in a specific style is a challenging problem. Shuimohua (Sumi-e) is a typical non-photorealistic oriental ink painting art that uses simple ink, water and brush to paint and convey poetic imagery. Although digital rendering of non-photorealistic paintings has been extensively studied, physical rendering of stylized painting from 3D space is still a challenge because it requires to consider the abstraction process of vectorized strokes with various styles in physical environment. In this paper, we propose a robotic rendering approach for physically drawing oriental ink painting from 3D shapes, and aim to mimic the painting process of a real-world artist using binocular vision to estimate 3D scenes. By referring to the artists' drawing habit, first, we extract expressive contours as drawing outlines and vectorize the contours to polylines. Next, the polylines are converted to line strokes with varying thicknesses by considering clamped B-spline fitting and isophote distances. In order to generate a typical dot shading effect, an oriented Poisson disk sampling approach is proposed to create dot strokes to depict the internal features of the 3D models. Finally, we build an ink gradient model and map the coordinates of the strokes to a robotic arm, and a control method of guide rails is proposed for robotic painting on large canvases.

Abstract:
Precise light-dose delivery is essential for photodynamic therapy (PDT), yet current handheld systems remain operator-dependent and lose accuracy under motion. We present a SLAM-guided, closed-loop control framework that enables co-temporal and co-spatial photodynamic diagnosis (PDD) and PDT with a single handheld endomicroscopic probe, while enforcing pixel-level dose control. The probe integrates a fiber bundle that shares a common optical path for both PDD and PDT and is paired with a digital micromirror device (DMD) for (mu)m-scale pattern projection. An extended Kalman filter fuses optical-tracking measurements with texture-limited endomicroscopic images at 30 Hz, providing robust six-degree-of-freedom pose estimates that expand the probes effective field of view and drive real-time pattern updates. A dose-map SLAM algorithm accumulates light dose over the reconstructed lesion surface during handheld scanning, while pixel-level dose control is enforced by referencing previously accumulated light at each location. Quantitative evaluation shows a spatial registration error between diagnostic and therapeutic systems within 5.2~mathrmmu m. Experiments on fluorescence phantoms achieved sub-millimeter localization accuracy ((0.3~mathrmmm) RMSE), significantly outperforming vision-only and tracker-only baselines. Finally, tests on targets with quadrant-specific dose limits confirmed SLAM-based dose control, achieving dose uniformity within pm 0.186~mathrmmJ/cm^2 across millimeter-scale regions.

Abstract:
Event cameras capture asynchronous pixel-level brightness changes with microsecond temporal resolution, offering unique advantages for high-speed vision tasks. Existing methods often convert event streams into intermediate representations such as frames, voxel grids, or point clouds, which inevitably require predefined time windows and thus introduce window latency. Meanwhile, pointwise detection methods face computational challenges that prevent real-time efficiency due to their high computational cost. To overcome these limitations, we propose the Variable-Rate Spatial Event Mamba, a novel architecture that directly processes raw event streams without intermediate representations. Our method introduces a lightweight causal spatial neighborhood encoder to efficiently capture local geometric relations, followed by Mamba-based state space models for scalable temporal modeling with linear complexity. During inference, a controller adaptively adjusts the processing speed according to the event rate, achieving an optimal balance between window latency and inference latency.

Abstract:
Therapist-in-the-loop robotic rehabilitation has shown the promise to enhance rehabilitation outcomes by integrating the strengths of therapists and robotic systems. However, its broader adoption is limited due to insufficient interaction and limited adaptation capability. This article proposes a novel telerobotics-mediated framework that enables therapists to deliver assist-as-needed~(AAN) therapy based on two primary contributions. First, the reference motion for movement therapy is generated to encourage the active participant of patients based on their motion preferences encoded using a probabilistic model. Second, the telerobotics-mediated system enable the therapist to inform the via-points, enabling minimal but effective assistance for AAN therapy by partially deforming the reference motion. The effectiveness of the proposed strategy was validated a telerobotic system through two representative rehabilitation tasks, demonstrating its potential for remote AAN therapy.

Abstract:
Navigating within narrow spaces is a fundamental challenge in robotics, requiring precise localisation, localisation error recovery, dynamic path planning, and adaptive control for effective manoeuvring. This paper presents a modular and perception-driven navigation framework designed for constrained environments, focusing primarily on agricultural applications. The proposed method integrates a multi-step point cloud processing pipeline for robust local perception, including pole detection, boundary line estimation, and trajectory refinement to ensure safe and precise traversal by refining initial trajectories based on detected environmental constraints and dynamically adapting to kinematic limitations. Experimental validation in a real strawberry polytunnel demonstrates superior trajectory accuracy and control stability compared to state-of-the-art navigators, achieving an average lateral deviation of 0.08 ± 0.01 m. The adaptive trajectory tracking and regulated pure pursuit control of the framework contribute to consistent navigation, even under increased velocity constraints, outperforming the resilient timed elastic band (RTEB) and model predictive path integral (MPPI) methods. This modular and generalisable framework offers significant potential for advancing autonomous navigation in narrow-space applications.

Abstract:
VINGS-Mono is a monocular inertial Gaussian Splatting (GS) SLAM framework designed for large-scale scenes. It integrates four main components: VIO Front End, 2D Gaussian Map, NVS Loop Closure, and Dynamic Eraser. The VIO Front End processes RGB frames with dense bundle adjustment and uncertainty estimation to extract scene geometry and poses. The mapping module incrementally builds a 2D Gaussian map with up to 50 million Gaussian ellipsoids. Key components like a Sample-based Rasterizer, Score Manager, and Pose Refinement enhance mapping efficiency and localization accuracy for large-scale urban environments. To ensure global consistency, the NVS Loop Closure uses Novel View Synthesis for loop detection and map correction, while the Dynamic Eraser addresses dynamic objects in outdoor scenes. Evaluations demonstrate localization performance comparable to Visual-Inertial Odometry and surpass GS/NeRF SLAM methods in mapping and rendering. A mobile app further verifies real-time capability, generating high-quality Gaussian maps using a smartphone camera and low-frequency IMU. VINGS-Mono is the first monocular Gaussian SLAM framework for outdoor, kilometer-scale scenes.

Abstract:
Accurate localization in GPS-denied environments remains a critical challenge for autonomous robot navigation. Animals exhibit remarkable navigational abilities in complex, dynamic environments by relying on mental cognitive maps. Inspired by neural representations such as head direction cells and grid cells, numerous robotic cognitive mapping systems can efficiently cover large areas; however, they often lack the precise metric information required for accurate localization. To address this challenge, we propose a neurodynamically driven monocular visual topometric localization approach based on road network constraints. We introduce the Roadnetwork-Constraint Hidden Markov Model (RC-HMM) to enhance semi-metric maps by incorporating road network constraints, forming a coherent topometric map that maintains vertex relationships and improves localization accuracy. Experimental results in the CARLA Town07 environment demonstrate the remarkable efficiency of our topometric cognitive map. Compared to semi-metric maps, our approach achieves a 95% reduction in Absolute Pose Error (APE) and an 81% reduction in Relative Pose Error (RPE). Compared to binocular ORB-SLAM3, our monocular approach reduces CPU usage by 96.7% and map storage by 77.7%, with an APE of 3.6 m and RPE of 1.4 mclosely matching ORB-SLAM3s 3.86 m APE and 0.96 m RPE. Furthermore, by leveraging neurodynamics of grid cells and head direction cells, our monocular topometric localization robustly delivers a localization accuracy of 3.86 meters, comparable to binocular ORB-SLAM3. This approach integrates road network metrics into topological maps, enhancing brain-inspired navigation with topometric maps in complex environments.

Abstract:
Eye gaze-based control interfaces provide a non-invasive means of enhancing human-robot collaboration for activities of daily living and can reduce the cognitive burden on operators performing complex tasks. Eye gaze has traditionally been used for "gaze triggering," where fixating on an object activates pre-programmed robotic movements. In this work, we propose a gaze-based robotic teleoperation approach that utilizes real-time gaze data to guide the freeform movement of robotic manipulators. The proposed approach incorporates a Gaussian Mixture Regression (GMR)-based intent inference model to capture the nonlinear relationship between gaze data and the operators intended robotic movements. For benchmarking, we further implemented a Gaussian Hidden Markov Model (G-HMM) to provide a comparable probabilistic framework for intent inference. Experimental results demonstrate that the GMR-based approach achieves a statistically significant improvement over G-HMM in terms of control efficiency, trajectory smoothness against involuntary eye fluctuations, as well as enhancing the users sense of involvement and control.

Abstract:
Pre-defined 3D object templates are widely used in 3D reconstruction of hand-object interactions. However, they often require substantial manual efforts to capture or source, and inherently restrict the adaptability of models to unconstrained interaction scenarios, e.g., heavily-occluded objects. To overcome this bottleneck, we propose a new Text-Instructed Generation and Refinement (TIGeR) framework, harnessing the power of intuitive text-driven priors to steer the object shape refinement and pose estimation. We use a two-stage framework: a text-instructed prior generation and vision-guided refinement. As the name implies, we first leverage off-the-shelf models to generate shape priors according to the text description without tedious 3D crafting. Considering the geometric gap between the synthesized prototype and the real object interacted with the hand, we further calibrate the synthesized prototype via 2D-3D collaborative attention. TIGeR achieves competitive performance, i.e., 1.979 and 5.468 object Chamfer distance on the widely-used Dex-YCB and Obman datasets, respectively, surpassing existing template-free methods. Notably, the proposed framework shows robustness to occlusion, while maintaining compatibility with heterogeneous prior sources, e.g., retrieved hand-crafted prototypes, in practical deployment scenarios. Our code will be available at https://github.com/huangyiyNUS/TIGeR.

Abstract:
Real-world autonomous driving, particularly in urban environments with numerous corner cases, requires rigorous testing to ensure product safety and robustness. However, few studies have explored integrating adversarial scenario generation with the training of safety agents in closed-loop testing, enabling efficient co-evolution and mutual enhancement of both. To address this challenge, an adversarial behavior knowledge repository is constructed by applying rule-based filtering to an open-source dataset, combined with knowledge retrieval modules tailored for simulation environments. A large language model (LLM) is employed to integrate knowledge-, data-, and adversarial-driven approaches, generating safety-critical traffic scenarios customized to user needs. Additionally, while evaluating the generated scenarios, we employ reinforcement learning models to train the behaviors of different types of vehicles, thereby enriching scenario diversity beyond existing datasets while preserving realism. Experimental results demonstrate that the proposed framework improves the accuracy of domain-specific language generation by 12%. Moreover, the success rate of newly generated scenario transformations increases by 8%, while obstacle-avoidance capability is enhanced by 30%. For the complete manuscript, please refer to: https://zhenhaooo.github.io/PCASim.github.io/

Abstract:
Reconstructing and decomposing dynamic urban scenes is crucial for autonomous driving, urban planning, and scene editing. However, existing methods fail to perform instance-aware decomposition without manual annotations, which is crucial for instance-level scene editing. We propose UnIRe, a 3D Gaussian Splatting (3DGS) based approach that decomposes a scene into a static background and individual dynamic instances using only RGB images and LiDAR point clouds. At its core, we introduce 4D superpoints, a novel representation that clusters multi-frame LiDAR points in 4D space, enabling unsupervised instance separation based on spatiotemporal correlations. These 4D superpoints serve as the foundation for our decomposed 4D initialization, i.e., providing spatial and temporal initialization to train a dynamic 3DGS for arbitrary dynamic classes without requiring bounding boxes or object templates. Furthermore, we introduce a smoothness regularization strategy in both 2D and 3D space, further improving the temporal stability. Experiments on benchmark datasets show that our method outperforms existing methods in decomposed dynamic scene reconstruction while enabling accurate and flexible instance-level editing, making it a practical solution for real-world applications.

Abstract:
Recently, multi-robot systems have gained significant attention for their promise of scalable efficiency, reliability, and cost savings. A crucial capability is collaborative transportation, where a team of robots works together to transport a payload. However, key challenges remain, such as potential conflicts between team-level decisions and individual-level robot controls, team kinematic constraints imposed by the robot-payload coupling, and diverse obstacles encountered in 3D terrain. We present Collaborative Quadruped Transportation with Constrained Diffusion (CQTD), enabling a team of closely coupled quadruped robots to collaboratively transport a payload across 3D terrain. A diffusion-based upper level learns terrain-aware team-level trajectories satisfying team kinematic constraints due to the payload coupling, while a lower level optimizes velocity controls of individual robots satisfying collision and anisotropic velocity constraints. Experiments in high-fidelity simulations and on real-world quadruped robot teams demonstrate that CQTD outperforms baseline methods in challenging 3D terrain scenarios requiring closely-coupled collaboration between the quadruped robots.

Abstract:
Mudskippers are unique amphibious fish capable of locomotion in diverse environments, including terrestrial surfaces, aquatic habitats, and highly viscous substrates such as mud. This versatile locomotion is largely enabled by their powerful tail, which stores and rapidly releases energy to produce impulsive jumps. Inspired by this biological mechanism, we present the design and development of a multi-terrain centimeter-scale skipping and crawling robot. The robot is predominantly 3D printed and features onboard sensing, computation, and power. It is equipped with two side fins for crawling, each integrated with a hall effect sensor for gait control, while a rotary springtail driven by a 10mm planetary gear motor enables continuous impulsive skipping across a range of substrates to achieve multi-terrain locomotion. We modeled and experimentally characterized the tail, identifying an optimal length of 25mm that maximizes the mean propulsive force (4N, peaks up to 6N) for forward motion. In addition, we evaluated skipping on substrates where fin based crawling alone fails, and varied the moisture content of uniform sand and bentonite clay powder to compare skipping with crawling. Skipping consistently produced higher mean velocities than crawling, particularly on viscous and granular media. Finally, outdoor tests on grass, loose sand, and hard ground confirmed that combining skipping on entangling and granular terrain with crawling on firm ground extends the operational range of the robot in real-world environments.

Abstract:
Autonomous manipulation systems have achieved remarkable capabilities, yet the integration of human expertise with diffusion-based policies in shared control remains relatively unexplored. In this paper, we propose Human-In-The-Loop Diffusion (HITL-D), a shared control framework that enhances user performance in multistep, insertion, and fine manipulation tasks. HITL-D leverages a novel combination of diffusion-based policies and human control to provide autonomous end effector orientation updates conditioned on a scene point cloud and the Cartesian position of the end effector. This approach reduces the number of joystick control axes required, thereby lowering mental workload. In a multi-task user study with 12 participants, HITL-D reduced average task completion times by 40%, decreased perceived workload by 37%, and improved Likert-scale ratings for independence, intuitiveness, and confidence compared to traditional teleoperation methods. These results demonstrate that HITL-D effectively integrates human expertise with autonomous assistance, improving both objective and subjective aspects of teleoperation.

Abstract:
Robotic manipulation commonly involves pick-and-place tasks in which regrasp may be necessary for low-dexterity manipulators. Many existing approaches rely on sampling, which becomes inefficient when repeated regrasp is required in high-dimensional configuration spaces. We propose a modular planning framework that comprises differentiable optimization-based modules: grasp generation, stable pose prediction, inverse kinematics solving, and path planning. The modular design yields a systematic pipeline, enabling direct pick-and-place, static or non-static release, and repeated regrasp by solving each module as needed. Each module leverages differentiable geometric features to efficiently solve its corresponding optimization problem. Our framework explicitly accounts for grasp constraints across both task scenes and predicts stable poses for regrasp planning via optimization rather than expensive physics simulations, thereby improving the feasibility and efficiency of planning. We validated the framework in pick-and-place simulations and real-world experiments.

Abstract:
In an era dominated by multi-sensor fusion, this paper explores the operational limits of LiDAR-only odometry. We introduce Onion-LO++, which is designed to overcome two practical limitations of Onion-LO: poor performance in geometrically degenerate environments and instability under high-motion conditions. In order to mitigate point cloud degradation, we propose a coarse-to-fine point cloud segmentation approach that extracts intensity and weak corner features from planar regions, while dynamically adjusting the downsampling rate based on the proportion of planar points to maximize geometric constraints. To handle high-motion scenarios, we integrate a continuous-time trajectory model into the backend optimization and introduce an adaptive onion factor that adjusts optimization parameters in real time. Extensive experiments on five challenging public datasets demonstrate that Onion-LO++ outperforms state-of-the-art methods and operates reliably across narrow spaces, degenerate scenes, high-speed motion, and high-altitude aerial mapping. We open-source the code on GitHub.1

Abstract:
Transferring human skills to robots through learning from demonstrations has been an important topic in the robotics community, and many models have been developed for learning and adapting such skills. Among them, nonparametric representations are an appealing choice, since nonparametric solutions alleviate the explicit definition of basis functions, require fewer hyperparameters, and facilitate straightforward generalization for tasks involving high-dimensional inputs (e.g., human-robot collaboration and dual-arm manipulation). However, a commonly raised concern for nonparametric models is their computational complexity. In this paper, we propose a computationally efficient solution for nonparametric skill learning, whose computation time grows quadratically with the length of demonstrations, as opposed to the cubic growth in a standard nonparametric model. The solution is further improved by exploiting local models and fusing their predictions. We evaluate our approach in a 2-D writing task with time input, a 3-D human-guided obstacle avoidance task, and a dual-arm transportation task associated with 7-D input. The results show that our solution achieves comparable performance to the parametric method and enables instant adaptations in tasks associated with time or multi-dimensional inputs.

Abstract:
Accurate full-body motion prediction is essential for the safe, autonomous navigation of legged robots, enabling critical capabilities like limb-level collision checking in cluttered environments. Simplified kinematic models often fail to capture the complex, closed-loop dynamics of the robot and its low-level controller, limiting their predictions to simple planar motion. To address this, we present a learning-based observer-predictor framework that accurately predicts this motion. Our method features a neural observer with provable Uniformly Ultimately Bounded (UUB) guarantees that provides a reliable latent state estimate from a history of proprioceptive measurements. This stable estimate initializes a computationally efficient predictor, designed for the rapid, parallel evaluation of thousands of potential trajectories required by modern sampling-based planners. We validated the system by integrating our neural predictor into an Model Predictive Path Integral (MPPI)-based planner on a Vision 60 quadruped. Hardware experiments successfully demonstrated effective, limb-aware motion planning in a challenging, narrow passage and over small objects, highlighting our systems ability to provide a robust foundation for high-performance, collision-aware planning on dynamic robotic platforms.

Abstract:
Safety is of utmost importance in surgical robots, as they operate in a complex and dynamic environment that directly impacts the patients health based on the surgical procedures success. One of the main difficulties in the control of surgical manipulators is in efficiently encoding dynamic nonlinear safety constraints into trajectory planning and robot control strategies. Control Barrier Functions (CBFs) represent a valuable control method for safety-critical environments such as the surgical one since its rigorous formulation aims at ensuring safety in controlled dynamic systems. This work represents a step forward in autonomous surgical task execution since it defines Lipschitz-continuous critical and autonomously prioritized dynamic constraints enforced through a CBF framework for the safe execution of surgical robotic tasks. The proposed framework, moreover, leverages Dual Quaternion (DQ) algebra for a unified and computationally efficient representation of geometric tasks and constraints, allowing for the straightforward definition of complex, time-varying surgical constraints. The safety framework is tested in simulation on the da Vinci Research Kit (dVRK) CoppeliaSim simulator and with the real dVRK robot in several surgical sub-tasks.

Abstract:
Rapid progress in imitation learning, foundation models, and large-scale datasets has led to robot manipulation policies that generalize to a wide-range of tasks and environments. However, rigorous evaluation of these policies remains a challenge. Typically in practice, robot policies are often evaluated on a small number of hardware trials without any statistical assurances. We present SureSim, a framework to augment large-scale simulation with relatively small-scale real-world testing to provide reliable inferences on the real-world performance of a policy. Our key idea is to formalize the problem of combining real and simulation evaluations as a prediction-powered inference problem, in which a small number of paired real and simulation evaluations are used to rectify bias in large-scale simulation. We then leverage non-asymptotic mean estimation algorithms to provide confidence intervals on mean policy performance. Using physics-based simulation, we evaluate both diffusion policy and multi-task fine-tuned (pi_0) on a joint distribution of objects and initial conditions, and find that our approach saves over (20-25%) of hardware evaluation effort to achieve similar bounds on policy performance.

Abstract:
Embodied agents for creative tasks like photography must bridge the semantic gap between high-level language commands and geometric control. We introduce PhotoAgent, an agent that achieves this by integrating Large Multimodal Models (LMMs) reasoning with a novel control paradigm. PhotoAgent first translates subjective aesthetic goals into solvable geometric constraints via LMM-driven, chain-of-thought (CoT) reasoning, allowing an analytical solver to compute a high-quality initial viewpoint. This initial pose is then iteratively refined through visual reflection within a photorealistic internal world model built with 3D Gaussian Splatting (3DGS). This ``mental simulation'' replaces costly and slow physical trial-and-error, enabling rapid convergence to aesthetically superior results. Evaluations confirm that PhotoAgent excels in spatial reasoning and achieves superior final image quality.

Abstract:
Neural surface reconstruction relies heavily on accurate camera poses as input. Despite utilizing advanced pose estimators like COLMAP or ARKit, camera poses can still be noisy. Existing pose-NeRF joint optimization methods handle poses with small noise (inliers) effectively but struggle with large noise (outliers), such as mirrored poses. In this work, we focus on mitigating the impact of outlier poses. Our method integrates an inlier-outlier confidence estimation scheme, leveraging scene graph information gathered during the data preparation phase. Unlike previous works directly using rendering metrics as the reference, we employ a detached color network that omits the viewing direction as input to minimize the impact caused by shape-radiance ambiguities. This enhanced confidence updating strategy effectively differentiates between inlier and outlier poses, allowing us to sample more rays from inlier poses to construct more reliable radiance fields. Additionally, we introduce a re-projection loss based on the current Signed Distance Function (SDF) and pose estimations, strengthening the constraints between matching image pairs. For outlier poses, we adopt a Monte Carlo re-localization method to find better solutions. We also devise a scene graph updating strategy to provide more accurate information throughout the training process. We validate our approach on the SG-NeRF and DTU datasets. Experimental results on various datasets demonstrate that our methods can consistently improve the reconstruction qualities and pose accuracies.

Abstract:
To address the challenge of achieving both low drag and high maneuverability in complex water-filled pipeline environments, this study proposes a novel dual-spherical pipeline robot with integrated leak detection and mapping capabilities. A multi-objective optimization framework was established to simultaneously improve hydrodynamic performance, motion stability, and internal spatial layout, while adopting a streamlined shell design to achieve both low-drag and sensor integration requirements. Based on a task-driven configuration optimization method, an energy-efficient propeller arrangement was derived under the constraint of maintaining maneuvering performance. The robot employs a helical differential propulsion system and integrates multiple sensors, including a vision module, an inertial navigation unit, and a pressure sensor, to enable leak detection and mapping. Its fully sealed spherical housing ensures stable operation in water-filled pipelines. Based on the proposed configuration, an experimental platform incorporating four representative pipeline environments was constructed, and a series of inspection, mapping, and environmental adaptability tests were conducted. The results demonstrate that the robot can achieve agile turning and stable locomotion in water-filled pipelines, showing strong potential for practical engineering applications.

Abstract:
Teleoperation is crucial for hazardous environment operations and serves as a key tool for collecting expert demonstrations in robot learning. However, existing methods face robotic hardware dependency and control frequency mismatches between teleoperation devices and robotic platforms. Our approach introduces a unified interface that automatically extracts kinematic parameters from Unified Robot Description Format (URDF) files, enabling plug-and-play deployment across diverse robotic systems. The proposed interpolation algorithm bridges the frequency gap between low-rate human inputs and high-frequency robotic control commands through online continuous trajectory generation, without requiring access to the closed, bottom-level control loop. To further reduce latency, a joint prediction module is incorporated to anticipate operator intent and compensate for delays. Moreover, we introduce a minimum-stretch spline to optimize motion smoothness and quality. The system supports both precision and rapid operation modes for different task requirements. Experiments on three robotic platforms, including dual-arm setups, demonstrate our framework's generality, smoothness, and responsiveness. Teleoperation latency remains below 50ms at 30Hz input and approaches 15ms at 200Hz input. The code is developed in C++ with a Python interface, and available at https://github.com/IRMV-Manipulation-Group/UTTG.

Abstract:
Traditional rigid actuators in soft robotics, particularly for bionic hands, suffer from structural complexity, bulkiness, and limited biomimetic motion. To address these limitations, we developed an electrospun composite fiber membrane composed of thermoplastic polyurethane (TPU) and liquid crystal elastomer (LCE), and demonstrated its feasibility as a tendon-like soft actuator in an artificial finger. TPU provides elasticity and mechanical robustness, while LCE contributes reversible thermal contraction as the actuation unit. The resulting TPU-LCE fibers exhibit high flexibility comparable to biological muscle and outstanding actuation performance. Under thermal stimulation, the actuator achieved a contraction strain of up to 44.4% and a load-bearing capacity exceeding 3,500 times its own weight, while maintaining durability over 120 actuation cycles without significant degradation. Integrated into a tendon-driven biomimetic finger, the actuator enabled smooth and natural joint motions, closely resembling human finger flexionextension gestures. This work presents a reliable and scalable bio-inspired actuation strategy, offering promising potential for soft robotics applications.

Abstract:
In communicationless environments, multi-robot systems must operate without the constant information exchange that many coordination strategies typically assume. This paper presents a novel dynamic epistemic planning framework that enables implicit coordination and long horizon planning through higher-order reasoning among robots. With our approach, robots form and propagate higher-order belief particles, update world beliefs using Bayesian inference, and select actions via a behavior tree that anticipates teammates likely decisions. A temporally aware Model Predictive Path Integral (MPPI) controller integrates this reasoning into low-level execution, allowing robots to plan intercepts and adapt trajectories under partial observability. The proposed framework is evaluated in both simulations and physical experiments, where it consistently reduces task completion time compared to a first-order baseline, demonstrating that epistemic logic can serve as a robust foundation for resilient coordination in communication-restricted domains.

Abstract:
Fine-tolerance peg-in-hole manipulation demands high precision under contact-rich, nonsmooth dynamics, where irregular geometries, inclinations, and tight-clearance interference often cause model-free reinforcement learning (RL) to fail. We propose the Curriculum-Guided Temporal Haptic World Model (CG-THWM), which couples a world model with temporal haptic information and trains it via a staged curriculum. The world model supports efficient long-horizon planning with value estimation, while temporal haptic signals expose critical contact events; the curriculum stabilizes training and improves generalization. To enable rigorous evaluation, we construct a dataset for complex insertions that covers irregular, inclined, and interference-rich settings. In simulation, CG-THWM attains a 100% success rate on standard baselines and a 70% mean success rate in scenarios where conventional RL fails. These results highlight CG-THWM's potential for industrial and service applications.

Abstract:
Teleoperated ultrasound can improve diagnostic medical imaging access for remote communities. Having accurate force feedback is important for enabling sonographers to apply the appropriate probe contact force to optimize ultrasound image quality. However, large time delays in communication make direct force feedback impractical. Prior work investigated using point cloud-based model-mediated teleoperation and internal potential field models to estimate contact forces and torques. We expand on this by introducing a method to update the internal potential field model of the patient with measured positions, forces and torques for more transparent model-mediated tele-ultrasound. We first generate a point cloud model of the patient's surface and transmit this to the sonographer in a compact data structure. This is converted to a static voxelized volume where each voxel contains a potential field value. These values determine the forces and torques, which are rendered based on overlap between the voxelized volume and a point shell model of the ultrasound transducer. We solve for the potential field using a convex quadratic that combines the spatial Laplace operator with measured forces and torques. This was evaluated on volunteers (n=4) by assessing the accuracy of rendered forces and torques. Results showed the addition of measurements to the model reduced the force magnitude RMSE by an average of 7.42 N, the force vector angle error by an average of 3.71o, and the torque vector angle error by an average of 64.0o compared to using only Laplace's equation.

Abstract:
Navigating through dense human crowds remains a significant challenge for mobile robots. A key issue is the freezing robot problem, where the robot struggles to find safe motions and becomes stuck within the crowd. To address this, we propose HiCrowd, a hierarchical framework that integrates reinforcement learning (RL) with model predictive control (MPC). HiCrowd leverages surrounding pedestrian motion as guidance, enabling the robot to align with compatible crowd flows. A high-level RL policy generates a follow point to align the robot with a suitable pedestrian group, while a low-level MPC safely tracks this guidance with short horizon planning. The method combines long-term crowd aware decision making with safe short-term execution. We evaluate HiCrowd against reactive and learning-based baselines in offline setting (replaying recorded human trajectories) and online setting (human trajectories are updated to react to the robot in simulation). Experiments on a real-world dataset and a synthetic crowd dataset show that our method outperforms in navigation efficiency and safety, while reducing freezing behaviors. We further validate through real-world deployment in a public museum and Expo 2025 Osaka, where it navigates dense pedestrian flows without retraining, demonstrating robust and socially aware behavior. Our results suggest that leveraging human motion as guidance, rather than treating humans solely as dynamic obstacles, provides a powerful principle for safe and efficient robot navigation in crowds.

Abstract:
Vision-Language Models (VLMs) have shown remarkable capabilities in spatial reasoning, yet they remain fundamentally limited to qualitative assessments and lack the computational precision required for real-world robotics. Current approaches fail to leverage metric information from depth sensors and camera calibration, instead reducing geometric problems to pattern recognition tasks that cannot deliver the centimeter-level accuracy essential for robotic manipulation. We present TIGeR (Tool-Integrated Geometric Reasoning), a novel framework that transforms VLMs from perceptual estimators to geometric computers by enabling them to generate and execute precise geometric computations through external tools. Rather than attempting to internalize complex geometric operations within neural networks, TIGeR empowers models to recognize geometric reasoning requirements, synthesize appropriate computational code, and invoke specialized libraries for exact calculations. To support this paradigm, we introduce TIGeR, a comprehensive tool-invocationoriented dataset covering point transformations, pose estimation, trajectory generation, and spatial compatibility verification, complete with tool invocation sequences and intermediate computations. Through a two-stage training pipeline combining supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), TIGeR achieves state-of-the-art performance on geometric reasoning benchmarks while demonstrating centimeter-level precision in real-world robotic manipulation tasks.

Abstract:
Existing works on controlling a concentric tube robot (CTR) mostly focus on the trajectory of its tip position or pose. In order to safely send CTRs in a confined lumen space, we propose to continuously steer the CTRs so that its entire shape will always attempt to approximate target curves over time. We focus on stiffness-dominant CTRs. Considering the differential geometry of such CTR shapes, we propose to work on the curvature domain to reduce the computational cost in searching the configuration of the CTRs. With our formulation, we model the curvature control of the CTR to find the optimal translation of each tube and then search for the rotation of the tubes to fit the target shapes. We demonstrate our method using sets of different target paths. The computational time per frame, ranging between 0.1 to 0.3 seconds across all experiments, highlights the efficiency of our approach in aligning the complete shape of the CTR with specified paths. Notably, for time-varying trajectories that could be reproduced by the CTR with its maximum deployment length reaching 150 mm, the root mean square error and median error were 0.98mm and 0.46mm, respectively.

Abstract:
X-band radar serves as the primary sensor on maritime vessels, however, its application in autonomous navigation has been limited due to low sensor resolution and insufficient information content. To enable X-band radar-only autonomous navigation in maritime environments, this paper proposes a place recognition algorithm specifically tailored for X-band radar, incorporating an object density-based rule for efficient candidate selection and intentional degradation of radar detections to achieve robust retrieval performance. The proposed algorithm was evaluated on both public maritime radar datasets and our own collected dataset, and its performance was compared against state-of-the-art radar place recognition methods. An ablation study was conducted to assess the algorithm's performance sensitivity with respect to key parameters.

Abstract:
CoNi-MPC provides an efficient framework for UAV control in air-ground cooperative tasks by relying exclusively on relative states, eliminating the need for global state estimation. However, its lack of environmental information poses significant challenges for obstacle avoidance. To address this issue, we propose a novel obstacle avoidance algorithm, Cooperative Non-inertial frame-based Obstacle Avoidance (CoNi-OA), designed explicitly for UAV-UGV cooperative scenarios without reliance on global state estimation or obstacle prediction. CoNi-OA uniquely utilizes a single frame of raw LiDAR data from the UAV to generate a modulation matrix, which directly adjusts the quadrotor's velocity to achieve obstacle avoidance. This modulation-based method enables real-time generation of collision-free trajectories within the UGV's non-inertial frame, significantly reducing computational demands (less than 5 ms per iteration) while maintaining safety in dynamic and unpredictable environments. The key contributions of this work include: (1) a modulation-based obstacle avoidance algorithm specifically tailored for UAV-UGV cooperation in non-inertial frames without global states; (2) rapid, real-time trajectory generation based solely on single-frame LiDAR data, removing the need for obstacle modeling or prediction; and (3) adaptability to both static and dynamic environments, thus extending applicability to featureless or unknown scenarios.

Abstract:
Utilizing neural networks to predict potential regions containing optimal paths in advance and subsequently biasing the sampling probability towards these promising regions has been proven to effectively enhance the path planning efficiency of sampling-based algorithms. %In complex scenarios, uniform sampling often leads to prolonged computation time, whereas the biased information provided by promising regions can guide the algorithm to reduce sampling in irrelevant areas, thereby significantly shortening the computation time. Undoubtedly, the accuracy of the promising regions is of paramount importance. Currently, the generalizability of many CNN- or Transformer-based promising region prediction models remains limited, often performing poorly in unknown environments. Incorrect region predictions may reduce the planning efficiency, sometimes even underperforming uniform sampling. This work aims to leverage diffusion models to generate more accurate promising regions, referred to as the DiffRP (Diffusion-based Region Prediction) model, thereby designing a non-uniform sampler to improve sampling efficiency and reduce computation time. We propose three paradigms for generating promising regions using diffusion models, among which we innovatively introduce a biased noise initialization method for the diffusion process. Specifically, we bias the mean of the noise distribution using obstacle maps and design a map-conditioned denoising model to progressively generate accurate promising regions from the biased noise. Experiments on public datasets demonstrate that our proposed DiffRP method outperforms existing state-of-the-art (SOTA) models by 30% in promising region prediction accuracy. Moreover, the non-uniform sampling alg

Abstract:
Path planning in large-scale, complex 3D environments is fundamentally constrained by a trade-off between path quality and computational speed. This paper presents RUSH (Recursive and Scalable 3D Coarse To Fine Path Planning), a hierarchical framework that resolves this trade-off. RUSH decomposes the long-range planning task into a coarse plan followed by fine-grained, independent subproblems that can be solved in parallel. These subproblems are addressed by a unified, diffusion-based network that refines an initial estimate path by learning its residual to an optimal path. This approach allows RUSH to leverage rich geometric information directly from 3D voxel maps without being bottlenecked by the full maps complexity. We validate our method on large-scale outdoor (KITTI, MulRan) and indoor (HM3D) datasets, each spanning a 200m×200m×6m map. Experimental results demonstrate that RUSH generates feasible, high-quality paths with remarkable efficiency, achieving up to a 12.59× speedup over a hierarchically accelerated A baseline, while maintaining a path cost within 24% of the optimal solution. This performance gain positions RUSH as a powerful and practical solution for applications requiring rapid global path planning in large-scale 3D maps.

Abstract:
The Integrated Aerial Platform (IAP) uses multiple quadrotor sub-vehicles, acting as independent thrust generators, connected to a central platform via passive joints. This setup allows the sub-vehicles to collectively apply forces and torques to the central platform, achieving full six-degree-of-freedom (6-DoF) motion through coordinated thrust and posture adjustments. The IAP's modular design offers significant advantages in terms of mechanical simplicity, reconfigurability for diverse scenarios, and enhanced mission adaptability. This paper presents a comprehensive framework for IAP modeling and optimal design. We introduce a ``design matrix" that encapsulates key architectural parameters, including the number of sub-vehicles, their spatial configuration, and the types of passive joints used. To improve control performance and ensure balanced wrench generation capabilities, we propose an optimized design strategy that minimizes the condition number of this design matrix. Two distinct IAP configurations were optimally designed based on two typical application scenarios. The efficacy of the proposed optimization methodology was subsequently validated through comparative analysis against unoptimized platforms. Moreover, the full actuation capability of the IAP was empirically confirmed via extensive simulations and real-world flight experiments, which also demonstrated its operational performance through direct wrench control experiment.

Abstract:
Learned robot policies have consistently been shown to be versatile, but they typically have no built-in mechanism for handling the complexity of open environments, making them prone to execution failures; this implies that deploying policies without the ability to recognise and react to failures may lead to unreliable and unsafe robot behaviour. In this paper, we present a framework that couples a learned policy with a method to detect visual anomalies during policy deployment and to perform recovery behaviours when necessary, thereby aiming to prevent failures. Specifically, we train an anomaly detection model using data collected during nominal executions of a trained policy. This model is then integrated into the online policy execution process, so that deviations from the nominal execution can trigger a three-level sequential recovery process that consists of (i) pausing the execution temporarily, (ii) performing a local perturbation of the robot's state, and (iii) resetting the robot to a safe state by sampling from a learned execution success model. We verify our proposed method in two different scenarios: (i) a door handle reaching task with a Kinova Gen3 arm using a policy trained in simulation and transferred to the real robot, and (ii) an object placing task with a UFactory xArm 6 using a general-purpose policy model. Our results show that integrating policy execution with anomaly detection and recovery increases the execution success rate in environments with various anomalies, such as trajectory deviations and adversarial human interventions.

Abstract:
This paper presents a shear-based control scheme for grasping and manipulating delicate objects with a Pisa/IIT anthropomorphic SoftHand equipped with soft biomimetic tactile sensors on all five fingertips. These `microTac' tactile sensors are miniature versions of the TacTip vision-based tactile sensor, and can extract precise contact geometry and force information at each fingertip for use as feedback into a controller to modulate the grasp while a held object is manipulated. Using a parallel processing pipeline, we asynchronously capture tactile images and predict contact pose and force from multiple tactile sensors. Consistent pose and force models across all sensors are developed using supervised deep learning with transfer learning techniques. We then develop a grasp control framework that uses contact force feedback from all fingertip sensors simultaneously, allowing the hand to safely handle delicate objects even under external disturbances. This control framework is applied to several grasp-manipulation experiments: first, retaining a flexible cup in a grasp without crushing it under changes in object weight; second, a pouring task where the center of mass of the cup chang

Abstract:
Estimating 3D geometry from monocular colonoscopy images is challenging due to non-Lambertian surfaces, moving light sources, and large textureless regions. While recent 3D geometric foundation models eliminate the need for multi-stage pipelines, their performance deteriorates in clinical scenes. These models are primarily trained on natural scene datasets and struggle with specularity and homogeneous textures typical in colonoscopy, leading to inaccurate geometry estimation. In this paper, we present ColonAdapter, a self-supervised fine-tuning framework that adapts geometric foundation models for colonoscopy geometry estimation. Our method leverages pretrained geometric priors while tailoring them to clinical data. To improve performance in low-texture regions and ensure scale consistency, we introduce a Detail Restoration Module (DRM) and a geometry consistency loss. Furthermore, a confidence-weighted photometric loss enhances training stability in clinical environments. Experiments on both synthetic and real datasets demonstrate that our approach achieves state-of-the-art performance in camera pose estimation, monocular depth prediction, and dense 3D point map reconstruction, without requiring ground-truth intrinsic parameters.

Abstract:
Robotic-assisted procedures offer enhanced precision, but while fully autonomous systems are limited in task knowledge, difficulties in modeling unstructured environments, and generalisation abilities, fully manual teleoperated systems also face challenges such as delay, stability, and reduced sensory information. To address these, we developed an interactive control strategy that assists the human operator by predicting their motion plan at both high and low levels. At the high level, a surgeme recognition system is employed through a Transformer-based real-time gesture classification model to dynamically adapt to the operator's actions, while at the low level, a Confidence-based Intention Assimilation Controller adjusts robot actions based on user intent and shared control paradigms. The system is built around a robotic suturing task, supported by sensors that capture the kinematics of the robot and task dynamics. Experiments across users with varying skill levels demonstrated the effectiveness of the proposed approach, showing statistically significant improvements in task completion time and user satisfaction compared to traditional teleoperation.

Abstract:
Vision-language models (VLMs) excel in visual understanding but often lack reliable grounding capabilities and actionable inference rates. Integrating them with open-vocabulary object detection (OVD), instance segmentation, and tracking leverages their strengths while mitigating these drawbacks. We utilize VLM-generated structured descriptions to identify visible object instances, collect application-relevant attributes, and inform an open-vocabulary detector to extract corresponding bounding boxes that are passed to a video segmentation model providing segmentation masks and tracking. Once initialized, this model directly extracts segmentation masks, processing image streams in real time with minimal computational overhead. Tracks can be updated online as needed by generating new structured descriptions and detections. This combines the descriptive power of VLMs with the grounding capability of OVD and the pixel-level understanding and speed of video segmentation. Our evaluation across datasets and robotics platforms demonstrates the broad applicability of this approach, showcasing its ability to extract task-specific attributes from non-standard objects in dynamic environments.

Abstract:
This work demonstrates a front-flip on bicycle robots via reinforcement learning, particularly by imitating reference motions that are infeasible and imperfect. To address this, we propose Iterative Motion Imitation (IMI), a method that iteratively imitates trajectories generated by prior policy rollouts. Starting from an initial reference that is kinematically or dynamically infeasible, IMI helps train policies that lead to feasible and agile behaviors. We demonstrate our method on Ultra-Mobility Vehicle (UMV), a bicycle robot that is designed to enable agile behaviors. From a self-colliding table-to-ground flip reference generated by a model-based controller, we are able to train policies that enable ground-to-ground and ground-to-table front-flips. We show that compared to a single-shot motion imitation, IMI results in policies with higher success rates and can transfer robustly to the real world. To our knowledge, this is the first unassisted acrobatic flip behavior on such a platform.

Abstract:
Robot-assisted dressing remains challenging due to the close physical humanrobot interaction and the highly deformable nature of garments. This work presents a purely vision-based approach that transfers human-mastered dressing skills to robots while accommodating dynamic human arm movements. The proposed method adopts a hierarchical structure. At the high level, a diffusion model serves as the policy to learn action distributions conditioned on point cloud observations. During execution, a diffused scalar field is constructed to infer an object-centric axial distribution of the human arm from cluttered points. Local point cloud registration across consecutive frames further captures arm motion, enabling real-time adaptation of robot actions to user dynamics. Comprehensive evaluations have been conducted in both simulation and real-world dressing scenarios using a UR10e robot with human participants of diverse genders and body types.

Abstract:
Mobile manipulators must coordinate end-effector (EE) tracking and mobile base motion to perform manipulation tasks robustly. However, even when the same EE trajectory is feasible, different base poses can lead to substantially different manipulator configurations, manipulability levels, and proximity to singularities. Thus, accurate EE tracking does not guarantee kinematically suitable whole-body behavior. To address this issue, a hierarchical framework is proposed that combines 1) an manipulator controller for EE tracking considering base motion, 2) an inverse reachability map (IRM) that encodes kinematically feasible base regions for the current and predicted EE states, and 3) a model predictive controller (MPC) that optimizes base velocity using the IRM as a soft cost. In the proposed architecture, the manipulator executes the task, the IRM evaluates which base regions are more reachable for the task, and the MPC generates base motion accordingly. Simulation results demonstrate that the proposed method improves manipulability while maintaining accurate EE tracking, highlighting the importance of reachability-aware base behavior in mobile manipulation.

Abstract:
Accurate needle steering using bevel-tip needles remains challenging due to nonlinear needle-tissue interactions and structural limitations of conventional robotic insertion systems in imaging-guided environments. This paper presents a cable-driven parallel robot (CDPR)-based needle steering framework that enables curvature-induced steering through coordinated control of the needle base pose. The proposed system provides 6-DoF needle orientation control using eight cables and an additional Bowden cable mechanism for axial rotation. Phantom insertion experiments demonstrate that steering direction can be regulated by bevel-tip orientation and that obstacle-avoidance insertion toward a desired target location is achievable. These results confirm the feasibility of CDPR-based needle steering for imaging-compatible minimally invasive intervention scenarios.

Abstract:
Dropped head syndrome, caused by neck muscle weakness from neurological diseases, severely impairs an individuals ability to support and move their head, causing pain and making everyday tasks challenging. Our long-term goal is to develop an assistive powered neck exoskeleton that restores natural movement. However, predicting a users intended head movement remains a key challenge. We leverage virtual reality (VR) to collect coupled eye and head movement data from healthy individuals to train models capable of predicting head movement based solely on eye gaze. We also propose a novel multi-layer controller selection framework, where head control strategies are evaluated across decreasing levels of abstractionfrom simulation and VR to a physical neck exoskeleton. This pipeline effectively rejects poor-performing controllers early, identifying two novel gaze-driven models that achieve strong performance when deployed on the physical exoskeleton. Our results reveal that no single controller is universally preferred, highlighting the necessity for personalization in gaze-driven assistive control. Our work demonstrates the utility of VR-based evaluation for accelerating the development of intuitive, safe, and personalized assistive robots.

Abstract:
Collaborative robots must quickly adapt to their partners intent and preferences to proactively identify helpful actions. This is especially true in situated settings where human partners can continually teach robots new high-level behaviors, visual concepts, and physical skills (e.g., through demonstration), growing the robots capabilities as the human-robot pair work together to accomplish diverse tasks. In this work, we argue that robots should be able to infer their partners goals from early interactions and use this information to proactively plan behaviors ahead of explicit instructions from the user. Building from the strong commonsense priors and steerability of large language models, we introduce ProVox (Proactive Voice), a novel framework that enables robots to efficiently personalize and adapt to individual collaborators. We design a meta-prompting protocol that empowers users to communicate their distinct preferences, intent, and expected robot behaviors ahead of starting a physical interaction. ProVox then uses the personalized prompt to condition a proactive language model task planner that anticipates a users intent from the current interaction context and robot capabilities to suggest helpful actions; in doing so, we alleviate user burden, minimizing the amount of time partners spend explicitly instructing and supervising the robot. We evaluate ProVox through user studies grounded in household manipulation tasks (e.g., assembling lunch bags) that measure the efficiency of the collaboration, as well as features such as perceived helpfulness, ease of use, and reliability. Our analysis suggests that both meta-prompting and proactivity are critical, resulting in 38.7% faster task completion times and 31.9% less user burden relative to non-active baselines. Videos and Supplementary Material: https://provox-2025.github.io

Abstract:
Acoustic feedback is a critical indicator for assessing the contact condition between the tool and the workpiece when humans perform grinding tasks with rotary tools. In contrast, robotic grinding systems typically rely on force sensing, with acoustic information largely ignored. This reliance on force sensors is costly and difficult to adapt to different grinding tools, whereas audio sensors (microphones) are low-cost and can be mounted on any medium that conducts grinding sound. This paper introduces a low-cost Acoustic Feedback Robotic Grinding System (AFRG) that captures audio signals with a contact microphone, estimates grinding force from the audio in real time, and enables closed-loop force control of the grinding process. Compared with conventional force-sensing approaches, AFRG achieves a 4-fold improvement in consistency across different grinding disc conditions. AFRG relies solely on a low-cost microphone, which is approximately 200-fold cheaper than conventional force sensors, as the sensing modality, providing an easily deployable, cost-effective robotic grinding solution.

Abstract:
While navigating unknown environments, robots rely primarily on proximate features for guidance in decision making, such as depth information from lidar or stereo to build a costmap, or local semantic information from images. The limited range over which these features can be used may result in poor robot behavior when assumptions about the cost of the map beyond the range of proximate features misguide the robot. Integrating far-field image features that originate beyond these proximate features into the mapping pipeline has the promise of enabling more intelligent and aware navigation through unknown terrain. To navigate with far-field features, key challenges must be overcome. As far-field features are typically too distant to localize precisely, they are difficult to place in a map. Additionally, the large distance between the robot and these features makes connecting these features to their navigation implications more challenging. We propose FITAM, an approach that learns to use far-field features to predict costs to guide navigation through unknown environments from previous experience in a self-supervised manner. Unlike previous work, our approach does not rely on flat ground plane assumptions or range sensors to localize observations. We demonstrate the benefits of our approach through simulated trials and real-world deployment on a Clearpath Robotics Warthog navigating through a forest environment.

Abstract:
Perception algorithms are ubiquitous in modern autonomy stacks, providing necessary environmental information to operate in the real world. Many of these algorithms depend on the visibility of keypoints, which must remain within the robots line-of-sight (LoS), for reliable operation. This paper tackles the challenge of maintaining LoS on such keypoints during robot movement. We propose a novel method that addresses these issues by ensuring applicability to various sensor footprints, adaptability to arbitrary nonlinear system dynamics, and constant enforcement of LoS throughout the robot's path. Our experiments show that the proposed approach achieves significantly reduced LoS violation and runtime compared to existing state-of-the-art methods in several representative and challenging scenarios.

Abstract:
Reliable, drift-free global localization presents significant challenges yet remains crucial for autonomous navigation in large-scale dynamic environments. In this paper, we introduce a tightly-coupled Semantic-LiDAR-Inertial-Wheel Odometry fusion framework, which is specifically designed to provide high-precision state estimation and robust localization in large-scale dynamic environments. Our framework leverages an efficient semantic-voxel map representation and employs an improved scan matching algorithm, which utilizes global semantic information to significantly reduce long-term trajectory drift. Furthermore, it seamlessly fuses data from LiDAR, IMU, and wheel odometry using a tightly-coupled multi-sensor fusion Iterative Error-State Kalman Filter (iESKF). This ensures reliable localization without experiencing abnormal drift. Moreover, to tackle the challenges posed by terrain variations and dynamic movements, we introduce a 3D adaptive scaling strategy that allows for flexible adjustments to wheel odometry measurement weights, thereby enhancing localization precision. This study presents extensive real-world experiments conducted in a one-million-square-meter automated port, encompassing 3,575 hours of operational data from 35 Intelligent Guided Vehicles (IGVs). The results consistently demonstrate that our system outperforms state-of-the-art LiDAR-based localization methods in large-scale dynamic environments, highlighting the framework's reliability and practical value.

Abstract:
Light Detection and Ranging (LiDAR) sensors have become a de-facto sensor for many robot state estimation tasks, spurring development of many LiDAR Odometry (LO) methods in recent years. While some smoothing-based LO methods have been proposed, most require matching against multiple scans, resulting in sub-real-time performance. Due to this, most prior works estimate a single state at a time and are ``submap''-based. This architecture propagates any error in pose estimation to the fixed submap and can cause jittery trajectories and degrade future registrations. We propose Fixed-Lag Odometry with Reparative Mapping (FORM), a LO method that performs smoothing over a densely connected factor graph while utilizing a single iterative map for matching. This allows for both real-time performance and active correction of the local map as pose estimates are further refined. We evaluate on a wide variety of datasets to show that FORM is robust, accurate, real-time, and provides smooth trajectory estimates when compared to prior state-of-the-art LO methods.

Abstract:
Visual servoing is fundamental to robotic applications, enabling precise positioning and control. However, applying it to textureless objects remains a challenge due to the absence of reliable visual features. Moreover, adverse visual conditions, such as occlusions, often corrupt visual feedback, leading to reduced accuracy and instability in visual servoing. In this work, we build upon learning-based keypoint detection for textureless objects and propose a method that enhances robustness by tightly integrating perception and control in a closed loop. Specifically, we employ an Extended Kalman Filter (EKF) that integrates per-frame keypoint measurements to estimate 6D object pose, which drives pose-based visual servoing (PBVS) for control. The resulting camera motion, in turn, enhances the tracking of subsequent keypoints, effectively closing the perception-control loop. Additionally, unlike standard PBVS, we propose a probabilistic control law that computes both camera velocity and its associated uncertainty, enabling uncertainty-aware control for safe and reliable operation. We validate our approach on real-world robotic platforms using quantitative metrics and grasping experiments, demonstrating that our method outperforms traditional visual servoing techniques in both accuracy and practical application.

Abstract:
In this paper, we present a receding-horizon, sampling-based planner capable of reasoning over multimodal policy distributions. By using the cross-entropy method to optimize a multimodal policy under a common cost function, our approach increases robustness against local minima and promotes effective exploration of the solution space. We show that our approach naturally extends to multi-robot collision-free planning, enables agents to share diverse candidate policies to avoid deadlocks, and allows teams to minimize a global objective without incurring the computational complexity of centralized optimization. Numerical simulations demonstrate that employing multiple modes significantly improves success rates in trap environments and in multi-robot collision avoidance. Hardware experiments further validate the approach's real-time feasibility and practical performance.

Abstract:
This work introduces an analytical approach for detecting and estimating external forces acting on deformable linear objects (DLOs) using only their observed shapes. In many robot-wire interaction tasks, contact occurs not at the end-effector but at other points along the robots body. Such scenarios arise when robots manipulate wires indirectly (e.g., by nudging) or when wires act as passive obstacles in the environment. Accurately identifying these interactions is crucial for safe and efficient trajectory planning, helping to prevent wire damage, avoid restricted robot motions, and mitigate potential hazards. Existing approaches often rely on expensive external force-torque sensor or that contacts occur at the end-effector for accurate force estimation. Using wire shape information acquired from a depth camera and under the assumption that the wire is in or near its static equilibrium, our method estimates both the location and magnitude of external forces without additional prior knowledge. This is achieved by exploiting derived consistency conditions and solving a system of linear equations based on force-torque balance along the wire. The approach was validated through simulation, where it achieved high accuracy, and through real-world experiments, where accurate estimation was demonstrated in selected interaction scenarios.

Abstract:
Transparent object depth perception remains a major challenge in robotics and logistics due to the limitations of standard 3D sensors in capturing accurate depth on transparent and reflective surfaces. This affects applications relying on depth maps and point clouds, particularly in robotic manipulation. To address this, we propose ClearDepth, a vision transformer-based algorithm for stereo depth recovery of transparent objects, enhanced by a novel feature post-fusion module that refines depth estimation using structural visual features. To mitigate the high costs of stereo dataset collection, we introduce a physically realistic, domain-adaptive Sim2Real framework for efficient data generation. Our method outperforms state-of-the-art stereo matching approaches on transparent depth recovery. Furthermore, in transparent object grasping experiments, ClearDepth improves transparent-scene perception and achieves at least an 18% higher grasp success rate compared to the state-of-the-art methods for transparent object manipulation. Our method demonstrates strong Sim2Real generalization, enabling precise depth perception of transparent objects for robotic applications in the real world. Dataset and project details are available at https://sites.google.com/view/cleardepth-anonymous.

Abstract:
Aerialground localization is difficult due to large viewpoint and modality gaps between ground-level LiDAR and overhead imagery. We propose TransLocNet, a cross-modal attention framework that fuses LiDAR geometry with aerial semantic context. LiDAR scans are projected into a birds-eye-view representation and aligned with aerial features through bidirectional attention, followed by a likelihood map decoder that outputs spatial probability distributions over position and orientation. A contrastive learning module enforces a shared embedding space to improve cross-modal alignment. Experiments on CARLA and KITTI show that TransLocNet outperforms state-of-the-art baselines, reducing localization error by up to 63% and achieving sub-meter, sub-degree accuracy. These results demonstrate that TransLocNet provides robust and generalizable aerialground localization in both synthetic and real-world settings.

Abstract:
We present a novel dual-end complementary method for online depth estimation and mesh reconstruction, termed DepthMesh. Unlike most existing state-of-the-art methods that produce either only depth online or surface mesh offline, our method tightly couples online multiview depth estimation and Truncated Signed Distance Function (TSDF) reconstruction to achieve fast online mesh reconstruction. For each keyframe from 6DoF tracking, we first obtain the prior depth and normal maps via ultra-fast raycasting from TSDF, which is incrementally fused from historical keyframe depths. Then, these priors, combined with segmentation results, are used to generate local planar hypotheses that optimize both depth accuracy and computational efficiency. Finally, the optimized depth estimates further enhance the accuracy of mesh reconstruction. Through this dual-end complementary mechanism, our system achieves high accuracy and efficiency. Experiments with qualitative and quantitative evaluations on the ScanNetV2 and self-collected datasets demonstrate the effectiveness of our method. Our method can generate depth and mesh online with accuracy (< 3 cm) on mobile devices, which is useful for robotic autonomous navigation and mixed reality applications such as real-time occlusion and collision handling.

Abstract:
3D occupancy prediction is critical for comprehensive scene understanding in vision-centric autonomous driving. Recent advances have explored utilizing 3D semantic Gaussians to model occupancy while reducing computational overhead, but they remain constrained by insufficient multi-view spatial interaction and limited multi-frame temporal consistency. To overcome these issues, in this paper, we propose a novel Spatial-Temporal Gaussian Splatting (ST-GS) framework to enhance both spatial and temporal modeling in existing Gaussian-based pipelines. Specifically, we develop a guidance-informed spatial aggregation strategy within a dual-mode attention mechanism to strengthen spatial interaction in Gaussian representations. Furthermore, we introduce a geometry-aware temporal fusion scheme that effectively leverages historical context to improve temporal continuity in scene completion. Extensive experiments on the large-scale nuScenes occupancy prediction benchmark showcase that our proposed approach not only achieves state-of-the-art performance but also delivers markedly better temporal consistency compared to existing Gaussian-based methods.

Abstract:
Recent evidence has shown that, contrary to expectations, it is difficult for novices to teach robots tasks through learning from demonstration (LfD). Novices often struggle with understanding the relationship between robot states and actions, leading to suboptimal demonstrations. This paper introduces a framework that leverages machine teaching algorithms to train novices in a controlled, ideal environment where optimal control parameters are predefined. The training enables participants to internalise fundamental control principles, preparing them to adapt to new skills that share similar properties. The study evaluates whether such teaching ability is (i) retained beyond the training period (including a long-term follow-up) and (ii) generalised so that novices teach robots more effectively in environments where control parameters are not predefined. It reports a series of between-subjects studies that demonstrate that trained novice teachers achieve a 75% improvement in teaching ability, with these gains retained even after guidance is removed, and exhibit a 71% enhancement in applying skills beyond the training content.

Abstract:
The pooling layer plays a vital role in aggregating local descriptors into the metrizable global descriptor in the LiDAR Place Recognition (LPR). In particular, the second-order pooling is capable of capturing higher-order interactions among local descriptors. However, its existing methods in the LPR adhere to conventional implementations and post-normalization, and incur the descriptor unsuitable for Euclidean distancing. Based on the recent interpretation that associates NetVLAD with the second-order statistics, we propose to integrate second-order pooling with the inductive bias from Voronoi cells. Our novel pooling method aggregates local descriptors to form the second-order matrix and whitens the global descriptor to implicitly measure the Mahalanobis distance while conserving the cluster property from Voronoi cells, addressing its numerical instability during learning with diverse techniques. We demonstrate its performance gains through the experiments conducted on the Oxford Robotcar and Wild-Places benchmarks and analyze the numerical effect of the proposed whitening algorithm.

Abstract:
Autonomous inspection robots for monitoring industrial sites can reduce costs and risks associated with human-led inspection. However, accurate readings can be challenging due to occlusions, limited viewpoints, or unexpected environmental conditions. We propose a hybrid framework that combines supervised failure classification with anomaly detection, enabling classification of inspection tasks as a success, known failure, or anomaly (i.e., out-of-distribution) case. Our approach uses a world model backbone with compressed video inputs. This policy-agnostic, distribution-free framework determines classifications based on two decision functions set by conformal prediction (CP) thresholds before a human observer. We evaluate the framework on gauge inspection feeds collected from office and industrial sites and demonstrate real-time deployment on a Boston Dynamics Spot. Experiments show over 90% accuracy in distinguishing between successes, failures, and OOD cases, with classifications occurring earlier than a human observer. These results highlight the potential for robust, anticipatory failure detection in autonomous inspection tasks or as a feedback signal for model training to assess and improve the quality of training data. Project website: https://autoinspection-classification.github.io/

Abstract:
Vision-Language Models (VLMs) are becoming the cornerstone of high-level reasoning for robotic automation, enabling robots to parse natural language commands and perceive their environments. However, their susceptibility to hallucinations introduces critical failures in decision-making, posing significant safety and reliability risks in physical deployments. This challenge is exacerbated by the open-ended nature of real-world tasks, where questions vary vastly in difficulty and modality, demanding robust and adaptable reasoning strategies. To tackle this, we propose the underlinePseudocode-guided underlineStructured Reunderlineasoning funderlineramework (PStar), which adaptively selects structured pseudocode reasoning paths to help VLMs perform flexible and step-by-step reasoning. We design a set of abstract reasoning functions and formulate a structured pseudocode library to represent modular reasoning strategies. Crucially, we design a Difficulty Feature Vector (DFV) that allows the model to assess question complexity and adaptively choose appropriate reasoning strategiesenhancing robustness and interpretability. Extensive experiments demonstrate that PStar significantly reduces hallucination rates, achieving state-of-the-art scores of 87.1% on POPE and 68.0% on MMStar, outperforming even GPT-4V. By providing a validated mechanism to reduce visual-language errors, PStar offers a critical step toward deploying more trustworthy and deterministic VLMs for real-world automated systems, where such errors can lead to catastrophic outcomes.

Abstract:
Currently, many manipulation tasks for deformable objects focus on activities like folding clothes, handling ropes, and manipulating bags. However, research on contact-rich tasks involving deformable objects remains relatively underdeveloped. When humans use cloth or sponges to wipe surfaces, they rely on both vision and tactile feedback. Yet, current algorithms still face challenges with issues like occlusion, while research on tactile perception for manipulation is still evolving and requires further development. Tasks such as covering surfaces with deformable objects demand not only perception but also precise robotic manipulation. To address this, we propose a method that leverages efficient and accessible simulators for task execution. Specifically, we train a reinforcement learning agent in a simulator to manipulate deformable objects for surface wiping tasks. We simplify the state representation of object surfaces using UV mapping, process contact feedback from the simulator on 2D feature maps, and use scaled grouped convolutions to extract features from these maps. The agent then outputs actions in a reduced-dimensional action space to generate coverage paths. Experiments demonstrate that our method outperforms previous approaches in key metrics, including total path length and coverage area. We deploy these paths on the Kinova Gen3 manipulator to perform wiping experiments on the back of a torso model, validating the feasibility of our approach.

Abstract:
This work proposed a 3D autoencoder architecture, named LiLa-Net, which encodes efficient features from real traffic environments, employing only the LiDAR's point clouds. For this purpose, we have real semi-autonomous vehicle, equipped with Velodyne LiDAR. The system leverage skip connections concept to improve the performance without using extensive resources as the state-of-the-art architectures. Key changes include reducing the number of encoder layers and simplifying the skip connections, while still producing an efficient and representative latent space which allows to accurately reconstruct the original point cloud. Furthermore, an effective balance has been achieved between the information carried by the skip connections and the latent encoding, leading to improved reconstruction quality without compromising performance. Finally, the model successfully reconstruct objects unrelated to the original traffic environment.

Abstract:
Mobile robots operating in human-centered environments must generate not only collision-free paths but also trajectories that follow local behavioral conventions. Conventional costmap-based navigation emphasizes geometric feasibility and often overlooks such requirements, which can result in socially inappropriate behaviors. This paper presents NORM-Nav, a zero-shot framework that integrates natural language behavioral constraints into costmap-based planning. An LLM parses each instruction into structured constraints and grounds them using real-time vision--LiDAR perception. These constraints are encoded as multi-layer costmaps that represent geometric, semantic, directional, and velocity cues and are directly compatible with standard grid-based planners. Simulation and real-world experiments indicate that NORM-Nav improves task success rates and produces trajectories closer to human references than representative baselines. The project website is available at https://ei-nav.github.io/NORM-Nav.

Abstract:
We present a supervised autonomous robot-assisted ureteroscopy (ARA-URS) system for the treatment of kidney stones, integrated with a digital-twin (DT). A three degrees of freedom (3-DOF) robotic system was developed to actuate a Wolf disposable ureteroscope, enabling the ARA-URS to autonomously position the ureteroscope for laser lithotripsy procedures. The DT of the robotic system was developed to enable the generation of diverse synthetic intraoperative scenarios, which are used to train vision and control models for improved precision in ARAURS. By mapping joint motion to virtual counterparts via precise physics simulation, the system ensures realistic representation and reliable validation. This frameworks performance was assessed with particular focus on endoscopic tip positioning. Initial in-air simulation experiments demonstrated root mean square error (RMSE) values of (1.871, 1.725, 1.194) mm in the x-, y-, and z-directions, respectively, computed as the deviation between the desired laser target position and the achieved ureteroscope tip position. Corresponding real-world experiments yielded RMSE values of (5.029, 3.919, 6.681) mm. The comparison between simulated and physical experiments indicates that the DT is able to reproduce the motion behavior of the physical system with good agreement. Further benchtop and simulation experiments demonstrated the systems capacity for stone targeting (quantified by the percentage of the image occluded by stone). In digital simulation, the ureteroscope achieved 88.77% average stone area coverage, while in the benchtop model, coverage averaged 59.04%. Together, this proof of concept highlights the potential of DT technology in robotic-assisted URS, offering a scalable and interactive platform for refining surgical techniques.

Abstract:
Autonomous navigation often requires the simultaneous optimization of multiple objectives. The most common approach scalarizes these into a single cost function using a weighted sum, but this method is unable to find all possible trade-offs and can therefore miss critical solutions. An alternative, the weighted maximum of objectives, can find all Pareto-optimal solutions, including those in non-convex regions of the trade-off space that weighted sum methods cannot find. However, the increased computational complexity of finding weighted maximum solutions in the discrete domain has limited its practical use. To address this challenge, we propose a novel search algorithm based on the Large Neighbourhood Search framework that efficiently solves the weighted maximum planning problem. Through extensive simulations, we demonstrate that our algorithm achieves comparable solution quality to existing weighted maximum planners with a runtime improvement of 1-2 orders of magnitude, making it a viable option for autonomous navigation.

Abstract:
Visual navigation is fundamental to autonomous systems, yet generating reliable trajectories in cluttered and uncertain environments remains a core challenge. Recent generative models promise end-to-end synthesis, but their reliance on unstructured noise priors often yields unsafe, inefficient, or unimodal plans that cannot meet real-time requirements. We propose StepNav, a novel framework that bridges this gap by introducing structured, multimodal trajectory priors derived from variational principles. StepNav first learns a geometry-aware success probability field to identify all feasible navigation corridors. These corridors are then used to construct an explicit, multi-modal mixture prior that initializes a conditional flow-matching process. This refinement is formulated as an optimal control problem with explicit smoothness and safety regularization. By replacing unstructured noise with physically-grounded candidates, StepNav generates safer and more efficient plans in significantly fewer steps. Experiments in both simulation and real-world benchmarks demonstrate consistent improvements in robustness, efficiency, and safety over state-of-the-art generative planners, advancing reliable trajectory generation for practical autonomous navigation. The code has been released at https://github.com/LuoXubo/StepNav.

Abstract:
Flight control for autonomous micro aerial vehicles (MAVs) is evolving from steady flight near equilibrium points toward more aggressive aerobatic maneuvers, such as flips, rolls, and Power Loop. Although reinforcement learning (RL) has shown great potential in these tasks, conventional RL methods often suffer from low data efficiency and limited generalization. This challenge becomes more pronounced in multi-task scenarios where a single policy is required to master multiple maneuvers. In this paper, we propose a novel end-to-end multi-task reinforcement learning framework, called GEAR (Geometric Equivariant Aerobatics Reinforcement), which fully exploits the inherent SO(2) rotational symmetry in MAV dynamics and explicitly incorporates this property into the policy network architecture. By integrating an equivariant actor network, FiLM-based task modulation, and a multi-head critic, GEAR achieves both efficiency and flexibility in learning diverse aerobatic maneuvers, enabling a data-efficient, robust, and unified framework for aerobatic control. GEAR attains a 98.85% success rate across various aerobatic tasks, significantly outperforming baseline methods. In real-world experiments, GEAR demonstrates stable execution of multiple maneuvers and the capability to combine basic motion primitives to complete complex aerobatics.

Abstract:
We present a closed-loop framework for autonomous raceline optimization that combines NURBS-based trajectory representation, CMA-ES global trajectory optimization, and controller-guided spatial feedback. Instead of treating tracking errors as transient disturbances, our method exploits them as informative signals of local track characteristics via a Kalman-inspired spatial update. This enables the construction of an adaptive, acceleration-based constraint map that iteratively refines trajectories toward near-optimal performance under spatially varying track and vehicle behavior. In simulation, our approach achieves a 17.38% lap time reduction compared to a controller parametrized with maximum static acceleration. On real hardware, tested with different tire compounds ranging from low to high friction, we obtain a 7.60% lap time improvement without explicitly parametrizing friction. This demonstrates robustness to changing grip conditions in real-world scenarios.

Abstract:
Safe autonomy is a critical requirement and a key enabler for robots to operate in complex environments. Control barrier functions and safe motion corridors are two widely used but distinct safety methods, functional and geometric, respectively, for planning and control. Control barrier functions filter control inputs to limit the decay rate of safety, whereas safe motion corridors are geometrically constructed to define a local safe zone around the system state. This paper introduces a new notion of control barrier corridors, unifying these two approaches by converting control barrier functions into local safe goal regions for reference goal selection in feedback control systems. We show, with examples on fully actuated systems, kinematic unicycles, and linear output regulation systems, that individual state safety can be extended locally over control barrier corridors for convex barrier functions, provided the control convergence rate matches the barrier decay rate. Such safe control barrier corridors enable safely reachable persistent goal selection over continuously changing barrier corridors during motion, which we demonstrate for verifiably safe path following in autonomous exploration of unknown environments.

Abstract:
The automation of warehouse operations is crucial for improving productivity and reducing human exposure to hazardous environments. One operation frequently performed in warehouses is bin-packing where items need to be placed into containers, either for delivery to a customer, or for temporary storage in the warehouse. Whilst prior bin-packing works have largely been focused on packing items into empty containers and have adopted collision-free strategies, it is often the case that containers will already be partially filled with items, often in suboptimal arrangements due to transportation about a warehouse. This paper presents a contact-aware packing approach that exploits purposeful interactions with previously placed objects to create free space and enable successful placement of new items. This is achieved by using a contact-based multi-object trajectory optimizer within a model predictive controller, integrated with a physics-aware perception system that estimates object poses even during inevitable occlusions, and a method that suggests physically-feasible locations to place the object inside the container.

Abstract:
High-mix low-volume (HMLV) industrial assembly, common in small and medium-sized enterprises (SMEs), requires the same precision, safety, and reliability as high-volume automation while remaining flexible to product variation and environmental uncertainty. Current robotic systems struggle to meet these demands. Manual programming is brittle and costly to adapt, while learning-based methods suffer from poor sample efficiency and unsafe exploration in contact-rich tasks. To address this, we present SHaRe-RL, a reinforcement learning framework that leverages multiple sources of prior knowledge. By (i) structuring skills into manipulation primitives, (ii) incorporating human demonstrations and online corrections, and (iii) bounding interaction forces with per-axis compliance, SHaRe-RL enables efficient and safe online learning for long-horizon, contact-rich industrial assembly tasks. Experiments on the insertion of industrial Harting connector modules with 0.20.4,mm clearance show reliable learning within practical wall-clock budget and improved performance over an unstructured human-in-the-loop RL baseline. We further show that the learned policy generalizes to previously unseen connector variants. Overall, our results show that process expertise alone can effectively guide real-world RL, making deployment safer, more robust, and economically viable for industrial assembly.

Abstract:
Assistive teleoperation enhances efficiency via shared control, yet inter-operator variability, stemming from diverse habits and expertise, induces highly heterogeneous trajectory distributions that undermine intent recognition stability. We present Adaptor, a few-shot framework for robust cross-operator intent recognition. The Adaptor bridges the domain gap through two stages: (i) preprocessing, which models intent uncertainty by synthesizing trajectory perturbations via noise injection and performs geometry-aware keyframe extraction; and (ii) policy learning, which encodes the processed trajectories with an Intention Expert and fuses them with the pre-trained visionlanguage model context to condition an Action Expert for action generation. Experiments on real-world and simulated benchmarks demonstrate that Adaptor achieves state-of-the-art performance, improving success rates and efficiency over baselines. Moreover, the method exhibits low variance across operators with varying expertise, demonstrating robust cross-operator generalization.

Abstract:
Articulation modeling enables robots to learn joint parameters of articulated objects for effective manipulation which can then be used downstream for skill learning or planning. Existing approaches often rely on prior knowledge about the objects, such as the number or type of joints. Some of these approaches also fail to recover occluded joints that are only revealed during interaction. Others require large numbers of multi-view images for every object, which is impractical in real-world settings. Furthermore, prior works neglect the order of manipulations, which is essential for many multi-DoF objects where one joint must be operated before another, such as a dishwasher. We introduce PokeNet, an end-to-end framework that estimates articulation models from a single human demonstration without prior object knowledge. Given a sequence of point cloud observations of a human manipulating an unknown object, PokeNet predicts joint parameters, infers manipulation order, and tracks joint states over time. PokeNet outperforms existing state-of-the art methods, improving joint axis and state estimation accuracy by an average of over 27% across diverse objects, including novel and unseen categories. We demonstrate these gains in both simulation and real-world environments.

Abstract:
Bronchoscopy is a critical procedure for diagnosing and treating pulmonary diseases, but its safe and effective execution demands substantial operator training. Insufficient experience is associated with higher complication rates, including bleeding, pneumothorax, and bronchospasm. Existing assessment tools provide structured evaluations, yet they remain heavily reliant on subjective expert judgment and limited sensory feedback. To address this limitation, we propose a model-based framework for objective performance evaluation in navigational bronchoscopy. Our approach leverages pose data from electromagnetic (EM) trackers, routinely used in clinical navigation, and embeds nonholonomic kinematic constraints that characterize expert-like trajectories. Using the model and a Model Predictive Path Integral (MPPI) control, we generate optimal reference trajectories and define error metrics that quantify deviations between operator-executed and model-predicted motions. We hypothesize that these deviations provide robust discriminative features for distinguishing between expert and novice performance. Experiments on a phantom lung dataset comprising 11 operators and 98 procedures demonstrate that the proposed metrics significantly separate skill levels, enabling the construction of an effective classifier for operator proficiency. This framework offers an interpretable, data-driven alternative to supervisor-dependent assessments and represents a step toward scalable, objective skill evaluation and transfer in bronchoscopy training and robotic platforms.

Abstract:
Navigating to out-of-sight targets from human instructions in unfamiliar environments is a core capability for service robots. Despite substantial progress, most approaches underutilize reusable, persistent memory, constraining performance in lifelong settings. Many are additionally limited to single-modality inputs and employ myopic greedy policies, which often induce inefficient back-and-forth maneuvers (BFMs). To address such limitations, we introduce SSMG-Nav, a framework for object navigation built on a Semantic Skeleton Memory Graph (SSMG) that consolidates past observations into a spatially aligned, persistent memory anchored by topological keypoints (e.g., junctions, room centers). SSMG clusters nearby entities into subgraphs, unifying entity- and space-level semantics to yield a compact set of candidate destinations. To support multimodal targets (images, objects, and text), we integrate a vision-language model (VLM). For each subgraph, a multimodal prompt synthesized from memory guides the VLM to infer a target belief over destinations. A long-horizon planner then trades off this belief against traversability costs to produce a visit sequence that minimizes expected path length, thereby reducing backtracking. Extensive experiments on challenging lifelong benchmarks and standard ObjectNav benchmarks demonstrate that, compared to strong baselines, our method achieves higher success rates and greater path efficiency, validating the effectiveness of SSMG-Nav.

Abstract:
Controlling symmetric objects is an indispensable but challenging task in robotic manipulation. Mainstream perception-action frameworks rely on accurate 6D pose estimation to guide the controller. However, the majority of existing 6D pose estimation methods for symmetric objects are designed to output a single pose, which can flicker between multiple equivalent solutions across consecutive frames, leading to instability in the control loop. While some approaches can output multiple hypotheses to represent the ambiguity, above methods generally cannot achieve model-free manner and strong generalization simultaneously. In this paper, we formulate the problem from a multi-solution task in pose space to an end-to-end visual servo task that admits a unique optimal solution. We propose a visual servo framework Sym-Servo. Sym-Servo uses a joint learning mechanism where a deterministic policy is trained with a diffusion-based generator to encourage the shared vision encoder to learn a symmetry-aware representation, and the policy is then refined via reinforcement and self-imitation learning to produce an efficient and stable final policy. We validate Sym-Servo with simulations and real-world experiments, demonstrating its efficiency and generalization in controlling symmetric objects in a model-free manner.

Abstract:
Current 3D scene graph generation (3DSGG) approaches heavily rely on a single-agent assumption and small-scale environments, exhibiting limited scalability to real-world scenarios. In this work, we introduce Multi-Agent 3D Scene Graph Generation (MA3DSG) model, the first framework designed to tackle this scalability challenge using multiple agents. We develop a training-free graph alignment algorithm that efficiently merges partial query graphs from individual agents into a unified global scene graph. Leveraging extensive analysis and empirical insights, our approach enables conventional single-agent systems to operate collaboratively without requiring any learnable parameters. To rigorously evaluate 3DSGG performance, we propose MA3DSG-Bencha benchmark that supports diverse agent configurations, domain sizes, and environmental conditionsproviding a more general and extensible evaluation framework. This work lays a solid foundation for scalable, multi-agent 3DSGG research.

Abstract:
Previous fastslow system architectures demonstrated that pairing a reactive E2E planner with a deliberative vision-language model (VLM) can address these long-tail scenarios. However, these dual-system models that query the slow module at fixed intervals are computationally inefficient and introduce unnecessary latency during normal operation. To bridge this gap, we introduce textbfFASIONAD, an adaptive fastslow framework for autonomous driving that selectively integrates E2E planning and VLM reasoning. A lightweight fast planner manages general control, while a slow reasoner is activated only when a Laplace-based uncertainty gate detects changed uncertainty. Rather than overriding control, the VLM provides concise planning states and high-level plans. These inform the planner through an information bottleneck and high-level action guidance, enhancing interpretability and safety. Evaluated on the nuScenes, Bench2Drive, and CARLA Town05 closed-loop benchmarks, FASIONAD lowers the average trajectory error by 6.7% and the collision rate by 28.1% compared with strong E2E baselines, while also markedly reducing computational overhead relative to always-on fastslow dual systems. These results demonstrate that adaptive fastslow fusion is a practical route to safer, more reliable, and more efficient autonomous driving.

Abstract:
Robots must adapt to diverse human instructions and operate safely in unstructured, open-world environments. Recent VisionLanguage models (VLMs) offer strong priors for grounding language and perception, but remain difficult to steer for navigation due to differences in action spaces and pretraining objectives that hamper transferability to robotics tasks. Towards addressing this, we introduce Ventura, a visionlanguage navigation system that finetunes internet-pretrained image diffusion models for path planning. Instead of directly predicting low-level actions, Ventura generates a path mask (i.e. a visual plan) in image space that captures fine-grained, context-aware navigation behaviors. A lightweight behavior-cloning policy grounds these visual plans into executable trajectories, yielding an interface that follows natural language instructions to generate diverse robot behaviors. To scale training, we supervise on path masks derived from self-supervised tracking models paired with VLM-augmented captions, avoiding manual pixel-level annotation or highly engineered data collection setups. In extensive real-world evaluations, Ventura outperforms state-of-the-art foundation model baselines on object reaching, obstacle avoidance, and terrain preference tasks, improving success rates by 33% and reducing collisions by 54% across both seen and unseen scenarios. Notably, we find that Ventura generalizes to unseen combinations of distinct tasks, revealing emergent compositional capabilities. Videos, code, and additional materials: https://venturapath.github.io.

Abstract:
Millimeter-wave radar provides perception robust to fog, smoke, dust, and low light, making it attractive for size, weight, and power constrained robotic platforms. Current radar imaging methods, however, rely on synthetic aperture or multi-frame aggregation to improve resolution, which is impractical for small aerial, inspection, or wearable systems. We present RadarSFD, a conditional latent diffusion framework that reconstructs dense LiDAR-like point clouds from a single radar frame without motion or SAR. Our approach transfers geometric priors from a pretrained monocular depth estimator into the diffusion backbone, anchors them to radar inputs via channel-wise latent concatenation, and regularizes outputs with a dual-space objective combining latent and pixel-space losses. On the RadarHD benchmark, RadarSFD achieves state-of-the-art performance against baseline models. Qualitative results show recovery of fine walls and narrow gaps, and experiments across new environments confirm strong generalization. Ablation studies highlight the importance of pretrained initialization, radar BEV conditioning, and the dual-space loss. Together, these results establish the first practical single-frame, no-SAR mmWave radar pipeline for dense point cloud perception in compact robotic systems. The project page is available at https://phi-lab-rice.github.io/RadarSFD/

Abstract:
Vision-Language Models (VLMs) offer a promising approach to end-to-end autonomous driving due to their human-like reasoning capabilities. However, troublesome gaps remains between current VLMs and real-world autonomous driving applications. One major limitation is that existing datasets with loosely formatted language descriptions are not machine-friendly and may introduce redundancy. Additionally, high computational cost and massive scale of VLMs hinder the inference speed and real-world deployment. To bridge the gap, this paper introduces a structured and concise benchmark dataset, NuScenes-S, which is derived from the NuScenes dataset and contains machine-friendly structured representations. Moreover, we present FastDrive, a compact VLM baseline with 0.9B parameters. In contrast to existing VLMs with over 7B parameters and unstructured language processing(e.g., LLaVA-1.5), FastDrive understands structured and concise descriptions and generates machine-friendly driving decisions with high efficiency. Extensive experiments show that FastDrive achieves competitive performance on structured dataset, with approximately 20% accuracy improvement on decision-making tasks, while surpassing massive parameter baseline in inference speed with over 10x speedup. Additionally, ablation studies further focus on the impact of scene annotations (e.g., weather, time of day) on decision-making tasks, demonstrating their importance on decision-making tasks in autonomous driving.

Abstract:
Localizing ground robots against aerial imagery provides a critical capability for autonomous navigation, especially in environments where GPS is unreliable or unavailable. This task is challenging due to large viewpoint differences and substantial environmental variability. Most prior methods localize each frame independently, using either global-descriptor retrieval or spatial feature alignment, which leaves them vulnerable to ambiguity and multi-modal pose hypotheses. While sequential reasoning can mitigate this uncertainty, adapting existing per-frame pipelines for sequential use introduces unfavorable trade-offs among accuracy, memory, and computation that limit their practical deployment. We propose BEV-Patch-PF, a vision-only, GPS-free sequential geo-localization system that integrates particle filtering with learned birds-eye-view (BEV) and aerial feature maps. For each 3-DoF particle pose hypothesis, we crop the corresponding patch from an aerial feature map computed from a local aerial image centered on the predicted pose. The resulting BEVaerial feature match defines a per-particle log-likelihood for particle-filter updates. In addition, we learn a frame-level uncertainty estimate that adaptively flattens the observation likelihood for unreliable observations, preventing overconfident particle collapse in ambiguous regions. On two real-world off-road datasets, our method achieves 9.7 lower absolute trajectory error (ATE) on seen routes and 6.6 lower ATE on unseen routes than a retrieval-based baseline, while remaining robust under partial canopy cover and shadowing. The system runs in real time at 10 Hz on an NVIDIA Tesla T4, enabling practical robot deployment.

Abstract:
Resolving vertical gradients of atmospheric variables in agroecosystems is essential for understanding surface-atmosphere exchange. It is also critical for emerging carbon monitoring frameworks. Existing methods, such as eddy covariance towers and satellite remote sensing, provide observations with limited spatial resolution, leaving fine-scale structure undersampled. This work introduces TetherBot, a tethered robotic profiler integrated into the Tethered Aircraft Uncrewed System. The robot traverses a hoisted power tether, enabling persistent vertical profiling with synchronized sensing and telemetry. Field experiments across a 40 m transect demonstrated reliable operation. Barometric pressure provided consistent altitude, temperature resolved subtle stratification, and relative humidity revealed surface-layer variability. Carbon dioxide measurements were dominated by sensor noise, highlighting the need for higher-fidelity analyzers. These results demonstrate the feasibility of tethered robotic profiling as a viable approach for atmospheric monitoring. They establish a foundation for future multi-robot arrays and high-precision flux applications.

Abstract:
Social robots have demonstrated great potential in various domains. Recent advancements in Large Language Models (LLMs) have expanded the conversational capabilities of these robots, enabling more personalized user interactions. However, current systems primarily focus on behavior or task personalization, or they require extensive pre-training and fine-tuning to achieve language personalization. This paper introduces adaptive prompting, a formal framework for real-time linguistic personalization in LLM-driven robots. By structuring interaction as a sequence of interdependent prompts, adaptive prompting enables controllable, efficient, and scalable personalization without additional model training. To validate our approach, we present a system that integrates adaptive prompting in a social robot to dynamically adapt to user attributes and preferences to provide personalized productivity coaching for college students with Attention Deficit Hyperactivity Disorder (ADHD). Our findings demonstrate that personalized coaching via adaptive prompting improves user engagement and overall coaching effectiveness compared to non-personalized coaching. This indicates the effectiveness of the proposed approach for user adaptation and personalization in social robots, particularly in the aforementioned contexts.

Abstract:
We propose an automated genomic interpretation module that transforms raw DNA sequences into actionable, interpretable decisions suitable for integration into medical automation and robotic systems. Our framework combines Chaos Game Representation (CGR) with a Concept Bottleneck Model (CBM), enforcing predictions to flow through biologically meaningful concepts such as GC content, CpG density, and k-mer motifs. To enhance reliability, we incorporate concept fidelity supervision, prior-consistency alignment, KL distribution matching, and uncertainty calibration. Beyond accurate classification of HIV subtypes across both in-house and LANL datasets, our module delivers interpretable evidence that can be directly validated against biological priors. A cost-aware recommendation layer further translates predictive outputs into decision policies that balance accuracy, calibration, and clinical utility, reducing unnecessary retests and improving efficiency. Extensive experiments demonstrate that the proposed system achieves state-of-the-art classification performance, superior concept prediction fidelity, and more favorable costbenefit trade-offs compared to existing baselines. By bridging the gap between interpretable genomic modeling and automated decision-making, this work establishes a reliable foundation for robotic and clinical automation in genomic medicine.

Abstract:
Vision-Language-Action (VLA) models have demonstrated robust performance across diverse robotic tasks. However, their high memory and computational demands often limit real-time deployment. While existing model compression techniques reduce the parameter footprint, they often drop in 3D spatial reasoning and scene layout understanding. This work introduces RetoVLA, an architecture designed to maintain spatial awareness in lightweight models by repurposing Register Tokenslearnable parameters originally introduced to mitigate attention artifacts in Vision Transformers. While these tokens are generally discarded once used, we repurpose them for their dense representation of global spatial context. RetoVLA integrates these recycled tokens directly into the action-planning module through a dedicated spatial context injection path. Our proposed design enables the recovery of global context without increasing the total parameter count. Real-world experiments using a 7-DOF manipulator show a 17.1%p improvement in average success rates over the baseline. Our results demonstrate that leveraging internal register tokens provides a highly effective mechanism for developing efficient, spatially-aware robotic agents. A video demonstration is available at: https://youtu.be/2CseBR-snZg

Abstract:
Recent advances in neural representations have shown great promise for enabling high-fidelity dense mapping in robotics. Given the inherently dynamic nature of real-world environments, many studies have attempted to learn static scene representations from dynamic observations. However, existing methods often fail to remove subtly moving objects and struggle to accurately recover occluded static backgrounds, which leads to critical limitations in practice. Furthermore, when static neural maps are used for localization, dynamic content in query images must be handled effectively. To overcome these challenges, we propose a static neural mapping framework that is robust to diverse dynamic environments and capable of processing dynamic content during localization. We evaluated our approach through extensive experiments on both public and in-house datasets. Our method improves both dynamic object removal and localization robustness under dynamic conditions, and constitutes a significant step toward resilient robot navigation in real-world environments.

Abstract:
We address the collaborative path planning problem for multi-agent systems with heterogeneous capabilities, subject to uncertainty and operating under complex task specifications. Conventional Probabilistic Signal Temporal Logic (PrSTL) frameworks exhibit significant limitations in describing multi-agent collaborative tasks with temporally cumulative properties. To address this challenge, we extend the PrSTL framework by introducing a Temporal Collective Counting Operator to characterize such spatio-temporal specifications. We then formulate the multi-agent collaborative planning problem under dynamics uncertainty as a Mixed-Integer Second-Order Cone Program. This formulation leverages PrSTL to specify tasks with cumulative temporal properties, while employing Polynomial Chaos Expansion to propagate uncertainty. Finally, we propose a constraint relaxation mechanism to address the conservatism introduced by formula transformations andprobabilistic constraints' approximation.

Abstract:
Visual simultaneous localization and mapping (SLAM) is of great significance for flapping-wing flying robots (FWFRs) to enhance their autonomous navigation capabilities in complex environments. However,during the motion of FWFRs, there are intense image vibrations accompanied by significant illumination changes, which would prevent existing visual SLAM algorithms from being directly applied to FWFRs. Therefore, this paper proposes a modified ORB-SLAM3 algorithm called FW-ORB-SLAM for FWFRs. First, we adopt the fast Fourier transform (FFT) method to map the original images to the frequency domain. Then, based on the characteristic flapping motion of the FWFR, we decompose the frequency domain jitter to obtain stabilized images. Moreover, to mitigate the impact of illumination variations on feature point tracking during outdoor flight, a local adaptive contrast enhancement method is proposed, which enhances the stability of feature point tracking and augments the robustness of the SLAM algorithm. Finally, flight experiments carried out using our self-developed FWFR named U-Dove demonstrate that FW-ORB-SLAM outperforms the state-of-the-art ORB-SLAM3 algorithm, which provides insights into performing vision-based SLAM tasks for the FWFR.

Abstract:
Learning control policies in simulation enables rapid, safe, and cost-effective development of advanced robotic capabilities. However, transferring these policies to the real world remains difficult due to the sim-to-real gap, where unmodeled dynamics and environmental disturbances can degrade policy performance. Existing approaches, such as domain randomization and Real2Sim2Real pipelines, can improve policy robustness, but either struggle under out-of-distribution conditions or require costly offline retraining. In this work, we approach these problems from a different perspective. Instead of relying on diverse training conditions before deployment, we focus on rapidly adapting the learned policy in the real world in an online fashion. To achieve this, we propose a novel online adaptive learning framework that unifies residual dynamics learning with real-time policy adaptation inside a differentiable simulation. Starting from a simple dynamics model, our framework continuously refines the model using real-world data to capture unmodeled effects and disturbances, such as payload changes and wind. The refined dynamics model is embedded in a differentiable simulation framework, enabling gradient backpropagation through the dynamics and thus rapid, sample-efficient policy updates. All components of our system are designed for rapid adaptation, enabling the policy to adjust to unseen disturbances within 5 seconds of training. We validate the approach on agile quadrotor control under various disturbances in both simulation and the real world. Our framework reduces hovering error by up to 81% compared to L1-MPC and 55% compared to DATT, while also demonstrating robustness in vision-based control without explicit state estimation.

Abstract:
In open-ended task settings, the ability of a robot to execute diverse tasks accurately by following language instructions is critical. Methods based on traditional imitation learning typically depend on extensive expert demonstrations and often struggle to generalize in the case of unseen scenarios or tasks. Recently, approaches leveraging large foundational models have demonstrated improved generalization by enhancing task comprehension in novel scenarios based on the intrinsic world knowledge embedded in these models. However, these methods rely on predefined motion primitives and lack a detailed understanding of the environment, which is essential for successful execution. Herein we introduce Task-Aware Robot Affordance-Centric Diffusion Policy (TARAD), a novel framework for robot manipulation. TARAD leverages LLMs and VLMs to perform high-level planning from natural language instructions and extract affordance information from the robots observations. A heuristic motion planner is employed for low-level motion planning, enabling zero-shot trajectory synthesis and the fully automatic generation of a dataset with language labels and affordances. By incorporating affordances into the observation space, our approach integrates the intrinsic commonsense and reasoning capabilities of foundation models into imitation learning, enabling the training of an affordance-centric, multi-task 3D diffusion policy. Empirical evaluations in both the RLBench simulated environments and real-world experiments with UR5e demonstrate that TARAD effectively combines the precise control of imitation learning with the strong generalization capabilities of foundation models, all without relying on expert demonstrations or predefined motion primitives.

Abstract:
This paper explores the challenge of optimal routing for a mobile robot navigating a dynamic and shared human environment. The primary goal is to minimize the risk of performance degradation during motion, such as delays in completing tasks due to the need for safe or acceptable human-robot encounters. The problem is formulated as a graph whose edge costs become progressively known only as the robot moves through the environment. We model this problem as a Markov Decision Process (MDP), enabling an offline evaluation of the expected cost of alternative routes based on statistical information about human spatial distributions and possible observations at each intersection. This compact state representation scales linearly with the number of intersections in the map. Since the memoryless property of the MDP may induce loops during online execution, we compute an offline policy and introduce an online policy adaptation mechanism to prevent cyclic behaviors. Extensive simulations across environments of different complexity, and using data collected from real-world experiments, demonstrate that our approach outperforms reactive and advanced state-of-the-art planners in terms of either performance or scalability.

Abstract:
Multilink transformable rotorcraft demonstrate exceptional flexibility when navigating confined spaces, yet face critical challenges including time-varying center of gravity, body misalignment, and the absence of a unified control strategy during dynamic reconfiguration, which severely restrict motion continuity and operational capability. To address these limitations, we propose an H-shaped multi-modal transformable rotorcraft. Its novelty lies in the co-design of a specialized mechanical architecture with 6 controllable degrees of freedom (CDOF) and an integrated control allocation framework, enabling the aircraft to achieve stable and continuous aerial transitions between high-passability, high fault-tolerance, and high-torque configurations. A dynamic PID control law based on motion characteristic values ensures system robustness against uncertainties, while a novel competency-based power distribution strategy uniquely constrains propeller thrust and lever arms to generate feasible control commands for each configuration. Experimental results demonstrate that our platform successfully overcomes stability challenges, maintaining positional deviation within 0.04 m during traversal through constrained spaces. The aircraft can reduce its footprint by up to 55.8%, sustain flight under single-propeller failure, and switch to a fault-tolerant configuration within 1.2 seconds, while exhibiting high-torque output capability sufficient for rotational operations. This work provides a comprehensive ontological platform that effectively bridges the technological gap between transformable reconfiguration and fault-tolerant control, enabling multi-scenario operational capabilities.

Abstract:
LiDAR odometry, fused by inertial measurement units (IMU), is an essential task in robotics navigation. Unlike the mainstream methods compensate the motion distortion of LiDAR data by high frequency inertial sensors, this paper deals with the distortion with continuous-time trajectory representation, and achieved competitive performance against state-of-the-art. We propose a compact framework of LiDAR odometry with adaptive non-uniform B-spline trajectory representation to formulate it as continuous-time estimation problem. We deploy point-to-plane registration and pseudo-velocity smoothing constraints to fully utilize geometric and kinematic information of odometry. For faster convergence of optimization, analytical Jacobian of constraints is derived to solve the non-linear least squares minimization. For more efficient B-spline representation, an adaptive knot spacing technique is proposed to adjust the time interval of control poses of spline. Extensive experiments on public and realistic datasets demonstrate validation and efficiency of our system compared with other LiDAR or LiDAR-inertial methods.

Abstract:
Learning from demonstration (LfD) enables robots to learn experts skills by human demonstration. Recently, LfD has been developed for learning and performing skills in contact-rich tasks. However, task performance has not been generalized to unknown poses in contact-rich tasks. In this paper, we propose a teleoperation-based learning from demonstration (LfD) framework for performing contact-rich tasks in unknown poses. Expert demonstrations are collected via a bilateral teleoperation system, with an orientation synchronization algorithm aiding intuitive manipulation. From demonstrations, position and wrench profiles are recorded. Task trajectories are learned using dynamic movement primitives (DMP), while strategy learning allocates input and compliance spaces based on affordance templates to adapt motion during contact. By combining trajectory and strategy learning, the framework successfully reproduces manipulation behaviors in novel configurations. Experiments on turning-valve and peg-in-hole insertion validate the method, showing improved success rates and robustness to pose variations.

Abstract:
Accurate traversability assessment is critical for mobile robot motion planning, yet sensor occlusions and model limitations often compromise cost map reliability. Therefore, analyzing spatial uncertainty is essential for robust risk management. We propose a novel Maximum Entropy Deep Inverse Reinforcement Learning (MEDIRL) framework that learns a traversability cost map while explicitly disentangling aleatoric and epistemic uncertainties. Aleatoric uncertainty is captured via latent sampling in a Conditional Variational Autoencoder, while epistemic uncertainty is estimated using a decoder ensemble. For kinematic fidelity, we introduce efficient continuous-state rollouts utilizing precomputed transition grids and bilinear interpolation. Fusing camera and LiDAR features, our model achieves stable convergence guided by a novel margin loss. Results demonstrate that learned state visitation frequencies match expert trajectories, and the decomposed uncertainties effectively identify high-risk terrains, providing a crucial foundation for safer autonomous navigation.

Abstract:
In robotic skill acquisition, rapid policy learning remains challenging due to high-dimensional state-action spaces and inefficient exploration in the early stage of training citep1. Although the pre-trained OpenVLA model exhibits cross-task generalization and can generate goal-directed actions for unseen tasks under suitable prompts, its direct application to novel manipulation tasks remains limited, while full fine-tuning is computationally expensive. To address this issue, we propose a hierarchical framework that combines OpenVLA with reinforcement learning for efficient skill acquisition. Specifically, OpenVLA is used to generate diverse task-related prior trajectories through prompt engineering, and reinforcement learning leverages these priors to fit local dynamics and constrain policy exploration. In this way, the proposed method improves adaptation efficiency and accelerates policy learning on new tasks. We evaluate the framework on multiple manipulation tasks in the LIBERO environment.

Abstract:
In model-based control, dynamics models are typically trained by minimizing open-loop prediction errors uniformly across all states. However, due to finite model capacity, this misallocates representational power, as not all prediction errors impact the downstream closed-loop performance equally. In this extended abstract, we propose a task-aware training methodology for a prediction model used in the context of Model Predictive Control (MPC). By extracting analytical sensitivities via differentiable MPC, we construct a loss function that weights multi-step dynamics model prediction errors based on their impact on the closed-loop task cost. Experimental results on a simulated 7DoF manipulator demonstrate that our sensitivity-weighted loss significantly improves closed-loop tracking performance compared to standard Mean Squared Error (MSE) or variance-based state standardization.

Abstract:
We propose a multimodal integration framework to enhance the precision of Vision-Language-Action (VLA) models in contact-rich robotic tasks. Although visual perception is essential for task grounding, it often lacks the force awareness required for high-precision alignment and insertion. To address this limitation, we leverage Feature-wise Linear Modulation (FiLM) to condition intermediate visual representations on 6-axis Force/Torque (F/T) data. This lightweight fusion strategy allows the model to modulate its action predictions based on real-time physical resistance without incurring significant computational overhead. Experimental results on a UR5e manipulator demonstrate that the proposed F/T-Vision integration enhances contact stability and precision in demanding manipulation tasks compared with vision-only baselines.

Abstract:
Long-horizon routing tasks of deformable linear objects (DLOs), such as cables and ropes, are common in industrial assembly lines and everyday life. These tasks are particularly challenging because they require robots to manipulate DLO with long-horizon planning and reliable skill execution. Successfully completing such tasks demands adapting to their nonlinear dynamics, decomposing abstract routing goals, and generating multi-step plans composed of multiple skills, all of which require accurate high-level reasoning during execution. In this paper, we propose a fully autonomous hierarchical framework for solving challenging DLO routing tasks. Given an implicit or explicit routing goal expressed in language, our framework leverages vision-language models (VLMs) for in-context high-level reasoning to synthesize feasible plans, which are then executed by low-level skills trained via reinforcement learning. To improve robustness over long horizons, we further introduce a failure recovery mechanism that reorients the DLO into insertion-feasible states. Our approach generalizes to diverse scenes involving object attributes, spatial descriptions, implicit language commands, and extended 5-clip settings. It achieves an overall success rate of 92% across long-horizon routing scenarios. Please refer to our project page: https://icra2026-dloroute.github.io/DLORoute/

Abstract:
Navigational bronchoscopy is critical for pulmonary interventions, yet current platforms depend heavily on pre-operative CT or external sensors, limiting their use in critical care and resource-constrained settings. Vision-only navigation offers a scalable alternative, but conventional visual odometry (VO) struggles with texture-poor airway images, specularities, and the vanishing-point singularities of tubular anatomy, leading to frequent tracking failures and drift. We present a geometry-aware VO framework that explicitly leverages vanishing-point cues from airway lumens. Detected lumens are back-projected to 3D rays, whose weighted fusion yields a stable forward heading even when parallax cues are absent. This heading, together with looming-based velocity estimates, is fused with noisy VO outputs using a bespoke high-gain observer that enforces airway-following priors and rejects drift. We validate the method on ex-vivo mechanically ventilated human lungs using electromagnetic tracking as ground truth. Compared to state-of-the-art pipelines (ORB-SLAM2, LoFTR-VO, DPVO), our approach reduces absolute trajectory error by more than 50% and achieves the lowest relative pose error across all test sequences.

Abstract:
In dynamic environments such as warehouses, hospitals, and homes, robots must seamlessly transition between gross motion and precise manipulations to complete complex tasks. However, current Vision-Language-Action (VLA) frameworks, largely adapted from pre-trained Vision-Language Models (VLMs), often struggle to reconcile general task adaptability with the specialized precision required for intricate manipulation. To address this challenge, we propose DAM-VLA, a dynamic action model-based VLA framework. DAM-VLA integrates VLM reasoning with diffusion-based action models specialized for arm and gripper control. Specifically, it introduces (i) an action routing mechanism, using task-specific visual and linguistic cues to select appropriate action models (e.g., arm movement or gripper manipulation), (ii) a dynamic action model that fuses high-level VLM cognition with low-level visual features to predict actions, and (iii) a dual-scale action weighting mechanism that enables dynamic coordination between the arm-movement and gripper-manipulation models. Across extensive evaluations, DAM-VLA achieves superior success rates compared to state-of-the-art VLA methods in simulated (SIMPLER, FurnitureBench) and real-world settings, showing robust generalization from standard pick-and-place to demanding long-horizon and contact-rich tasks.

Abstract:
While vacuum-based bending actuation offers benefits such as safety and compactness in soft robotics, it is often overlooked due to its limited actuation pressure, which restricts both bending angle and force output. This study presents a crease-free, origami-inspired vacuum bending actuator that advances both state-of-the-art vacuum bending actuators and traditional origami deformation principles by introducing orderly self-folding through optimized stiffness distribution. Achieved through finite element method (FEM), this design provides several advantages: (i) Self-folding allows for high bending angles (up to 138°) in a very compact form. (ii) The crease-free design facilitates 3D printing from a single soft material using a consumer-level fused filament fabrication (FFF) printer, specifically thermoplastic polyurethane (TPU) with a Shore hardness of 60A, potentially higher flexibility and durability. (iii) The compact configuration enables modular design, supporting reconfiguration as demonstrated in adaptable locomotion soft robots. (iv) The large bending angles allow the actuator to wrap around objects, offering extensive contact compared to other designs. This capability, c

Abstract:
Shape-morphing robots have shown benefits in industrial grasping. We propose form-flexible grippers for adaptive grasping. The design is based on the hybrid jamming and suction mechanism, which deforms to handle objects that vary significantly in size from the aperture, including both larger and smaller parts. Compared with traditional grippers, the gripper achieves self-closing to form an airtight seal. Under a vacuum, a wide range of grasping is realized through the passive morphing mechanism at the interface that harmonizes pressure and flow rate. This hybrid gripper showcases the capability to securely grasp an egg, as small as 54.5% of its aperture, while achieving a maximum load-to-mass ratio of 94.3.

Abstract:
Optical force-induced assembly is a promising yet scarcely explored approach for developing functional tools and objects at the microscale, with a wide range of potential applications. Our previous work was the first to investigate the manipulation of these assemblies in the XY plane. Here, we expand on these techniques by systematically exploring optical trap manipulation with the addition of Z-axis control. Manipulation of the Z-axis is referred to as axial displacement and is a viable approach for actively manipulating the assembly morphology. Experiments are conducted for the first time to explore and detail the response of the assembly during active 3D trap manipulation, informing the development of an autonomous control algorithm over the 2D area of the assembly during motion. This control presents techniques to increase assembly stability or alter the area of the assembly for tasks such as passing through constrictions. This work aims to develop the control techniques required to create a unique micromanufacturing approach inspired by the Kilobot thousand robot swarm.

Abstract:
Dielectric elastomer actuators (DEAs), also recognized as artificial muscle, have been widely developed for the soft locomotion robot. With the complaint skeleton and miniaturized dimension, they are well suited for the narrow space inspection. In this work, we propose a novel low profile (1.1mm) and lightweight (1.8g) bi-stable in-plane DEA (Bi-DEA) constructed by supporting a dielectric elastomer onto a flat bi-stable mechanism. It has an amplified displacement and output force compared with the in-plane DEA (I-DEA) without the bi-stable mechanism. Then, the Bi-DEA is applied to a thin soft robot, using three electrostatic adhesive pads (EA-Pads) as anchoring elements. This robot is capable of crawling and climbing to access millimetre-scale narrow gaps. A theoretical model of the bi-stable mechanism and the DEA are presented. The enhanced performance of the Bi-DEA induced by the mechanism is experimentally validated. EA-Pad provides the adhesion between the actuator and the locomotion substrate, allowing crawling and climbing on various surfaces, i.e., paper and acrylic. The thin soft robot has been demonstrated to be capable of crawling through a 4mm narrow gap with a speed up to 3.3mm/s (0.07 body length per second and 2.78 body thickness per second).

Abstract:
This paper introduces a new redundantly actuated parallel robot capable of sensorless physical human-robot interaction. Three 3-DoF legs are attached to an end-effector platform through spherical joints. This architecture alleviates most parallel singularities, thus enabling a very large workspace. The use of quasi-direct-drive actuators yields a backdrivable robot with very low impedance, since all actuators are fixed to the base. Furthermore, since the actuators are force/torque controlled, internal antagonistic forces can be controlled without additional sensing devices. Experiments are carried out on a physical prototype and validate the large workspace and physical interaction capabilities of the robot.

Abstract:
Uncertainty quantification is crucial for autonomous systems, enabling safe and robust decision making in tasks ranging from active perception to robotic planning. This paper introduces a novel approach to quantify uncertainty for radiance fields by deriving pixel-wise moment expressions from the rendering equation. While radiance fields offer powerful scene representations, their high dimensionality and complexity have historically made uncertainty quantification computationally prohibitive for real-time applications. This paper demonstrates that the probabilistic nature of the rendering process enables efficient and differentiable computation of higher-order moments for radiance field outputs, including color, depth, and semantic predictions. The proposed method outperforms existing radiance field uncertainty estimation techniques while offering a more direct, computationally efficient, and differentiable formulation without the need for post-processing. Beyond uncertainty quantification, this paper also illustrates the utility of the proposed approach in downstream applications such as next-best-view (NBV) selection and active ray sampling for neural radiance field training. Extensive experiments on both synthetic and real-world scenes demonstrate state-of-the-art performance, confirming that principled uncertainty quantification can be seamlessly integrated into radiance field pipelines without sacrificing efficiency or accuracy.

Abstract:
Humanoid facial expression shadowing enables robots to realistically imitate human facial expressions in real time, which is critical for lifelike, facially expressive humanoid robots and affective humanrobot interaction. Existing progress in humanoid facial expression imitation remains limited, often failing to achieve either real-time performance or realistic expressiveness due to offline video-based inference designs and insufficient ability to capture and transfer subtle facial details. To address these limitations, we present VividFace, a real-time and realistic facial expression shadowing system for humanoid robots. An optimized imitation framework X2CNet++ enhances expressiveness by fine-tuning the human-to-humanoid facial motion transfer module and introducing a feature-adaptation training strategy for better alignment across different image sources. Real-time shadowing is further enabled by a video-stream-compatible inference pipeline and a streamlined workflow based on asynchronous I/O for efficient communication across devices. VividFace produces vivid humanoid faces by mimicking human facial expressions within 0.05 seconds, while generalizing across diverse facial configurations. Extensive real-world demonstrations validate its practical utility. Videos are available at: https://lipzh5.github.io/VividFace/.

Abstract:
Humans subconsciously choose robust ways of selecting and using tools, for example, choosing a ladle over a flat spatula to serve meatballs. However, robustness under external disturbances remains underexplored in robotic tool-use planning. This paper presents a robustness-aware method that jointly selects tools and plans contact-rich manipulation trajectories, explicitly optimizing for robustness against disturbances. At the core of our method is an energy-based robustness metric that guides the planner toward robust manipulation behaviors. We formulate a hierarchical optimization pipeline that first identifies a tool and configuration that optimizes robustness, and then plans a corresponding manipulation trajectory that maintains robustness throughout execution. We evaluate our method across three representative tool-use tasks. Simulation and real-world results demonstrate that our method consistently selects robust tools and generates disturbance-resilient manipulation plans.

Abstract:
We introduce a novel framework that integrates Semantic Digital Twins (SDTs) with Large Language Models (LLMs) to enable adaptive and goal-driven robotic task execution in dynamic environments. The system decomposes natural language instructions into structured action triplets, which are grounded in contextual environmental data provided by the SDT. This semantic grounding allows the robot to interpret object affordances and interaction rules, enabling action planning and real-time adaptability. In case of execution failures, the LLM utilizes error feedback and SDT insights to generate recovery strategies and iteratively revise the action plan. We evaluate our approach using tasks from the ALFRED benchmark, demonstrating robust performance across various household scenarios. The proposed framework effectively combines high-level reasoning with semantic environment understanding, achieving reliable task completion in the face of uncertainty and failure.

Abstract:
Effective multi-human-robot collaboration is essential for expanding human-led operations in the challenging and high-risk underwater environment. For autonomous underwater vehicles (AUVs) to become true teammates, they must be able to comprehend their surroundings and recognize a diver's activities to offer assistance and ensure safety. Towards this goal, we introduce DAR-Net, a novel transformer-based framework that analyzes complex underwater scenes to classify diver activities. Our contribution lies in a semantically-guided learning formulation that couples transformer-based temporal reasoning with pixel-level scene supervision. This multi-loss training strategy explicitly aligns global activity recognition with local humanrobot interaction semantics, which is particularly critical in low-visibility underwater conditions. To address the significant challenge of data scarcity in this domain, we present the first-ever Underwater Diver Activity (UDA) dataset, a foundational resource containing over 2,600 annotated images with pixel-level masks. Through rigorous experimental evaluations in a controlled environment, we demonstrate that DAR-Net achieves promising accuracy in recognizing six distinct diver activities, outperforming state-of-the-art models. While this dataset provides a crucial baseline, our work serves as a pioneering step, laying the groundwork for future research and facilitating the development of more intelligent, collaborative underwater robotic systems.

Abstract:
3D semantic occupancy prediction is essential for achieving safe, reliable autonomous driving and robotic navigation. Compared to camera-only perception systems, multi-modal pipelines, especially LiDAR-camera fusion methods, can produce more accurate and fine-grained predictions. Although voxel-based scene representations are widely used for semantic occupancy prediction, 3D Gaussians have emerged as a continuous and significantly more compact alternative. In this work, we propose a multi-modal Gaussian-based semantic occupancy prediction framework utilizing 3D deformable attention, namely GaussianFormer3D. We introduce a voxel-to-Gaussian initialization strategy that provides 3D Gaussians with accurate geometry priors from LiDAR data, and design a LiDAR-guided 3D deformable attention mechanism to refine these Gaussians using LiDAR-camera fusion features in a lifted 3D space. Extensive experiments on real-world on-road and off-road autonomous driving datasets demonstrate that GaussianFormer3D achieves state-of-the-art prediction performance with reduced memory consumption and improved efficiency. Project website: https://lunarlab-gatech.github.io/GaussianFormer3D/.

Abstract:
Moving object segmentation (MOS) is foundational for autonomous vehicle safety. However, the increasing diversity of LiDAR sensors creates a significant domain shift problem, causing models trained on one sensor to perform poorly when deployed on another. A naive approach of training on combined data from heterogeneous sensors leads to a biased model that favors high-density sensors while failing on sparse, low-resolution sensors. To address this issue, we propose X-MOS, a novel generalization framework based on multi-teacher knowledge distillation. X-MOS generates sensor-specific expert teacher models and employs a sensor-aware knowledge distillation strategy. This strategy uses the sensor type as privileged information to activate the most appropriate teacher at each training step, providing unambiguous learning signals to a single student model. Extensive experiments on the HeLiMOS dataset, which comprises four different LiDAR sensors, demonstrate the effectiveness of our framework. X-MOS mitigates training bias and achieves an overall test mIoU of 0.717, outperforming both naive training and the best individual expert teacher. Notably, it more than doubles the performance on the most challenging low-channel sensor. Furthermore, our model exhibits strong zero-shot generalization to unseen datasets with similar sensor types. This work provides a robust and scalable methodology for achieving cross-sensor generalization, which is foundational for more practical and adaptable perception systems in autonomous driving.

Abstract:
Open-vocabulary 3D scene understanding is crucial for robotics applications, such as natural language-driven manipulation, human-robot interaction, and autonomous navigation. Existing methods for querying 3D Gaussian Splatting often struggle with inconsistent 2D mask supervision and lack a robust 3D point-level retrieval mechanism. In this work, (i) we present a novel point-level querying framework that performs tracking on segmentation masks to establish a semantically consistent ground-truth for distilling the language Gaussians; (ii) we introduce a GT-anchored querying approach that first retrieves the distilled ground-truth and subsequently uses the ground-truth to query the individual Gaussians. Extensive experiments on three benchmark datasets demonstrate that the proposed method outperforms state-of-the-art performance. Our method achieves an mIoU improvement of +4.14, +20.42, and +1.7 on the LERF, 3D-OVS, and Replica datasets. These results validate our framework as a promising step toward open-vocabulary understanding in real-world robotic systems.

Abstract:
This paper presents an integrated optoelectronic tweezers platform that unifies hydrogel microstructure fabrication with subsequent dynamic microsphere manipulation, enabled by a dual-wavelength optical strategy for seamless and programmable control. Initial tests in low-conductivity aqueous media confirmed high-fidelity OET patterning, with edge roughness below 20 μm. Using a biocompatible GelMA--LAP system, we achieved: (i) precise single-region hydrogel structures (hexagram patterns, average deviation 2.779 μm); (ii) scalable, customizable assemblies of complex topologies, including gear and maze arrays; and (iii) coordinated navigation of single, double, and triple microspheres within hydrogels, with triple-sphere velocity differences <0.9 μm/s. This approach overcomes the conventional OET limitation of decoupled photolithography and particle manipulation, integrating structure fabrication with dynamic control. It provides a versatile, reproducible platform for tissue engineering scaffolds, targeted drug delivery, and multi-particle coordination.

Abstract:
LiDAR place recognition is a critical component of LiDAR-based localization pipelines, tasked with identifying previously visited places across diverse environments and temporal conditions. A growing body of deep learningbased approaches has recently tackled this problem. However, their performance often degrades when the models are deployed in unseen environments. Although offline fine-tuning can partly recover performance, it is prone to catastrophic forgetting of previously acquired knowledge and cannot respond quickly enough to rapidly changing data distributions. In this paper, we introduce OCLPlace, an online continual learning framework that learns directly from highly temporally correlated LiDAR streams and strikes a trade-off between rapid domain adaptation and resistance to catastrophic forgetting. To the best of our knowledge, OCLPlace is the first LiDAR place-recognition approach enhanced by online continual learning that can automatically adapt to new environments while mitigating catastrophic forgetting. Experimental results on six large-scale datasets, which cover both ground-view and aerial-view scenarios, demonstrate the effectiveness and robustness of our method. The source code will be publicly available at: https://github.com/npu-ius-lab/OCLPlace.

Abstract:
This paper proposes a framework for integrating latent representations from multi-view images, using adaptive weighting based on situational context to facilitate the generation of robot actions. Specifically, we introduce the multi-view gating unit (MGU), which assigns context-dependent weights to each dimension of the latent representations extracted from different viewpoints. By summing the corresponding dimensions across all viewpoints, we construct a fused latent representation that serves as input to a policy model. To enhance the effectiveness of the MGU and improve the accuracy of action generation, we incorporate a KullbackLeibler (KL)-based alignment objective that encourages consistency between individual viewpoint representations and the fused representation. We evaluate the proposed framework through imitation-learning experiments in a kitchen-like real-robot environment across five tasks. The experimental results show that the MGU dynamically adapts to different contexts, thereby enabling successful task execution. Additionally, we compare our approach with a modified Action Chunking with Transformers (ACT) baseline and conduct an ablation study to assess the contribution of each component. The results show that our method achieves a task success rate of 84%, outperforming all baseline methods and validating the effectiveness of both the individual components and their integration within the proposed framework.

Abstract:
Preference-based Reinforcement Learning (RL) enables humans to shape complex goals via preference comparisons between sequences of state-action pairs. Most of the existing approaches focus on a singular objective, overlooking the complex causal reasoning that underpins preferences. However, many real-world challenges are multi-dimensional, and individuals can have different reasons behind their preferences. In this work, we rethink preference-based RL from a multi-objective perspective by distilling human preferences into multiple components. We leverage the zero-shot capabilities of large language models (LLMs) to infer preferences and better align various objectives from text prompts. This allows us to train an ensemble of reward functions, each optimizing for a specific objective. We demonstrate that our approach can address a variety of multi-objective control tasks, improving on approaches that consider a single preference per objective. We show the effectiveness of our approach in better shaping reward functions by utilizing real human preferences and prompts. Our code for the benchmarks, along with additional supplementary details, is available at https://sites.google.com/view/multi-pref/.

Abstract:
This paper introduces a novel automatic coverage path planning algorithm for bathymetry surveying with unmanned surface vehicles. The detection range of the mapping sensor employed -- a multibeam echo sounder -- is heavily influenced by local seafloor depths. Hence, a path designed to uniformly cover the sea surface does not guarantee uniform coverage of the seafloor. Yet this is currently the typical process for bathymetric surveys, with the simplistic boustrophedon scheme along manually selected waypoints at constant depths being the most widespread planner used. The proposed scheme incorporates coarse prior depth information to pre-process the target region and adaptively guide path generation and sensing range configuration. By explicitly accounting for depth variations, the proposed algorithm designs a coverage path with optimised spacing between survey passes that adjusts the sensing beam density to achieve more consistent seafloor coverage. The proposed method is shown to offer significant improvements in both controlled and real-world scenarios. Validations in challenging synthetic terrains achieves coverage ratios beyond 99%, a marked improvement when compared with traditional boustrophedon paths revealing a maximum 75% coverage. The same trend appears in realistic simulations using real bathymetric data from a coastal harbour, with coverage reaching over 92%, and significantly surpassing boustrophedon sweeps with coverage rates below 65%. Beyond improved performance, the scheme also brings a fully automated design, suitable for autonomous marine vehicles, thus offering practical utilities for real-world applications.

Abstract:
Whenever humans and robots work together, it is essential that unexpected robot behavior can be explained to the user. Especially in applications such as shared control the user and the robot must share the same model of the objects in the world, and the actions that can be performed on these objects. In this paper, we achieve this with a so-called model reconciliation framework. We leverage a Large Language Model to predict and explain the difference between the robot's and the human's mental models, without the need of a formal mental model of the user. Furthermore, our framework aims to solve the model divergence after the explanation by allowing the human to correct the robot. We provide an implementation in an assistive robotics domain, where we conduct a set of experiments with a real wheelchair-based mobile manipulator and its digital twin.

Abstract:
Visual inertial odometry (VIO) serves as a cornerstone of environmental perception and spatial localization, with broad applications in autonomous driving, robotic navigation, and embodied intelligence. Although recent deep learning based VIO methods have achieved impressive accuracy and computational efficiency, most approaches optimize errors within a maximum a posteriori (MAP) framework, often overlooking explicit prior modeling which constrains the upper bounds of achievable performance. To address this challenge, Diff-VIO is introduced, which is a VIO optimization framework grounded in diffusion models. An end-to-end coarse pose generator is first employed. It outputs an initial pose estimate and supplies priors for the diffusion refinement. To constrain the solution space, a diffusion-based refinement module injects pose priors during generation. This process is supported by a global context transformer encoder and a conditional decoder, which model long-range dependencies and predict residual noise for precise pose refinement. Experiments conducted on the KITTI benchmark demonstrate that the proposed method outperforms state-of-the-art VIO techniques in both accuracy and robustness. Additional evaluations on a dataset collected with an Intel RealSense D435i further validate the strong generalization capability of the proposed method across diverse hardware platforms. As the first diffusion-based VIO framework, Diff-VIO introduces a novel optimization paradigm for learning-based visual-inertial odometry systems.

Abstract:
Vision Language Models (VLMs) have demonstrated remarkable performance in open-world zero-shot visual recognition. However, their potential in space-related applications remains largely unexplored. In the space domain, accurate manual annotation is particularly challenging due to factors such as low visibility, illumination variations, and object blending with planetary backgrounds. Developing methods that can detect and segment spacecraft and orbital targets without requiring extensive manual labeling is therefore of critical importance. In this work, we propose an annotation-free detection and segmentation pipeline for space targets using VLMs. Our approach begins by automatically generating pseudo-labels for a small subset of unlabeled real data with a pre-trained VLM. These pseudo-labels are then leveraged in a teacherstudent label distillation framework to train lightweight models. Despite the inherent noise in the pseudo-labels, the distillation process leads to substantial performance gains over direct zero-shot VLM inference. Experimental evaluations on the SPARK-2024, SPEED+, and TANGO datasets on segmentation tasks demonstrate consistent improvements in average precision (AP) by up to 10 points. Code and models are available at https://github.com/giddyyupp/annotation-free-spacecraft-segmentation.

Abstract:
Cross-modal Visual Geo-localization often aims to retrieve a satellite visible-light image of the same geographic lo cation from a large-scale database using an infrared image cap tured by an unmanned aerial vehicle (UAV), thereby achieving precise localization. This capability is crucial for autonomous drone localization and navigation in low-light conditions such as nighttime or smoky environments. However, research in this field is still in its nascent stage, with existing methods being few in number and limited in precision. To address these issues, this paper proposes a structure-aware and fusion-loss constrained cross-modal geo-localization network (SAFL-Geo), which enhances the accuracy of cross-modal image retrieval. Specifically, we design a structure-aware module embedded into the network backbone, substantially enhancing the models abil ity to perceive and extract cross-modally consistent structural features (such as road and building contours). Furthermore, we propose a feature enhancement and aggregation module that projects the refined multi-modal representations into a unified embedding space, effectively reducing the cross-modal representation gap while preserving discriminative semantic structures. Finally, we propose a fusion loss constraint strategy that constructs intermediate fused features as a bridge to constrain the distribution distances between infrared and fused features, as well as between visible and fused features, thereby indirectly mitigating the modality gap. Extensive experiments on the Boson datasets show that our SAFL-Geo achieves superior state-of-the-art performance.

Abstract:
As of today, automating contact-rich industrial manipulation processes, such as insertion, plugging, and screw-driving, is tedious and requires expert knowledge. The processes consist of programmable, common action units, like moving to a pose and establishing contact. However, the user still has to decide on fixed transition conditions to successfully complete each sub-action. Instead, we introduce a tactile memory-driven policy blending framework based on unified force-impedance control to transition autonomously. At the core of our approach lies a structured representation of manipulation as a sequence of basic operations combined into any relevant process, each governed by real-time sensory feedback and annotated with process quality metrics (PQMs), capturing motion, force, and energy-level interactions. A bidirectional long-short term memory model (BiLSTM) encodes the recent PQM's histories to determine basic operation success. Later, soft blending weights are generated, allowing smooth, adaptive transitions between operations without manual phase definition. To ensure functional safety during contact, we integrate an energy tank mechanism that enforces passivity by regulating energy exchange. The resulting control scheme enables robust and continuous tactile manipulation across variations in object geometry and spatial configurations. Experimental validation across four processes over five objects and two position variants demonstrates successful transfer and resilience to position disturbances. Our findings highlight that learned tactile memory and quality feedback embedded in the control loop serve as a principled foundation for intelligent and transferable manipulation, allowing fully autonomous process planning and execution in the future.

Abstract:
Sampling-based model predictive control (MPC) is experiencing a resurgence in robotics following both recent hardware successes and advancements in parallelized physics simulation. However, to build on this progress, the robotics community needs to develop shared tools for prototyping, benchmarking, and deploying sampling-based controllers. We introduce judo, a software package designed to address this need. To facilitate rapid prototyping and evaluation, judo provides robust implementations of common sampling-based MPC algorithms and a comprehensive suite of benchmark tasks. It emphasizes usability with simple but extensible interfaces for controller and task definitions, asynchronous execution for straightforward simulation-to-hardware transfer, and a highly customizable interactive GUI for tuning controllers interactively. While the high-level library is written in Python, judo leverages MuJoCo as its physics backend to achieve real-time performance. We present example benchmarking results using judo to compare standard sampling-based controllers across its tasks. We also provide real-world case studies in deploying judo on hardware for two contact-rich tasks: in-hand cube rotation and quadrupedal loco-manipulation. Code at https://github.com/bdaiinstitute/judo.

Abstract:
Reliable dual-arm grasping is essential for manipulating large and complex objects but remains a challenging problem due to stability, collision, and generalization requirements. Prior methods typically decompose the task into two independent grasp proposals, relying on region priors or heuristics that limit generalization and provide no principled guarantee of stability. We propose DAGDiff, an end-to-end framework that directly denoises to grasp pairs in the SE(3) x SE(3) space. Our key insight is that stability and collision can be enforced more effectively by guiding the diffusion process with classifier signals, rather than relying on explicit region detection or object priors. To this end, DAGDiff integrates geometry-, stability-, and collision-aware guidance terms that steer the generative process toward grasps that are physically valid and force-closure compliant. We comprehensively evaluate DAGDiff through analytical force-closure checks, collision analysis, and large-scale physics-based simulations, showing consistent improvements over previous work on these metrics. Finally, we demonstrate that our framework generates dualarm grasps directly on real-world point clouds of previously unseen objects, which are executed on a heterogeneous dualarm setup where two manipulators reliably grasp and lift them.

Abstract:
Fluid-driven soft robots promise high-dimensional motion and cost-effective scalability, but their performance is constrained by the limited flow capacity of compact solenoid valves and the integration challenges of large membrane valves. This paper introduces a modular hybrid valve architecture that couples a high-flow membrane valve with an integrated miniature solenoid pilot. The resulting composite element achieves both high mass flow rates and precise electronic control while maintaining a compact, lightweight, and fabrication-friendly design. We present the design, modeling, and control strategies for these valves, and evaluate their performance through three experiments: tank pressure regulation, actuation of a weight-curling robot, and integration into a planar two-stage tossing robot. Across all cases, the hybrid membrane valves significantly outperformed solenoid valves, exhibiting faster pressurization and venting, higher bandwidth, and over a five-fold increase in mechanical power output. These results demonstrate that membrane-solenoid hybrid valves provide a scalable and integrable solution for overcoming the piping-problem, enabling hyper-actuated soft robotic systems.

Abstract:
We propose a decentralized, learning-based framework for dynamic coalition formation in Multi-Robot Task Allocation (MRTA). Our approach extends MAPPO by integrating spatial action maps, robot motion planning, intention sharing, and task allocation revision to enable effective and adaptive coalition formation. Extensive simulation studies confirm the effectiveness of our model, enabling each robot to rely solely on local information to learn timely revisions of task selections and form coalitions with other robots to complete collaborative tasks. The results also highlight the proposed frameworks ability to handle large robot populations and adapt to scenarios with diverse task sets.

Abstract:
Sampling-based motion planning algorithms, like the Rapidly-Exploring Random Tree (RRT) and its widely used variant, RRT-Connect, provide efficient solutions for high-dimensional planning problems faced by real-world robots. However, these methods remain computationally intensive, particularly in complex environments that require many collision checks. To improve performance recent efforts have explored parallelizing specific components of RRT such as collision checking, or running multiple planners independently. However, little has been done to develop an integrated parallelism approach, co-designed for large-scale parallelism. In this work we present pRRTC, a RRT-Connect based planner co-designed for GPU acceleration across the entire algorithm through parallel expansion and SIMT-optimized collision checking. We evaluate the effectiveness of pRRTC on the MotionBenchMaker dataset using robots with 7, 8, and 14 degrees-of-freedom (DoF). Compared to the state-of-the-art, pRRTC achieves as much as a 10× speedup on constrained reaching tasks with a 5.4× reduction in standard deviation. pRRTC also achieves a 1.4× reduction in average initial path cost. Finally, we deploy pRRTC on a 14-DoF dual Franka Panda arm setup and demonstrate real-time, collision-free motion planning with dynamic obstacles. We open-source our planner to support the wider community.

Abstract:
Autonomous surgical systems must adapt to highly dynamic environments where tissue properties and visual cues evolve rapidly. Central to such adaptability is feedback: the ability to sense, interpret, and respond to changes during execution. While feedback mechanisms have been explored in surgical robotics, ranging from tool and tissue tracking to error detection, existing methods remain limited in handling the topological and perceptual challenges of tissue dissection. In this work, we propose a feedback-enabled framework for autonomous tissue dissection that explicitly reasons about topological changes from endoscopic images after each dissection action. This structured feedback guides subsequent actions, enabling the system to localize dissection progress and adapt policies online. To improve the reliability of such feedback, we introduce visibility metrics that quantify tissue exposure and formulate optimal controller designs that actively manipulate tissue to maximize visibility. Finally, we integrate these feedback mechanisms with both control-based and learning-based dissection methods, and demonstrate experimentally that they significantly enhance autonomy, reduce errors, and improve robustness in complex surgical scenarios.

Abstract:
Trajectory optimization depends heavily on initialization. In particular, sampling-based approaches are highly sensitive to initial solutions, and limited exploration frequently leads them to converge to local minima in complex environments. We present Uncertainty Guided Exploratory Trajectory Optimization (UGE-TO), a trajectory optimization algorithm that generates well-separated samples to achieve a better coverage of the configuration space. UGE-TO represents trajectories as probability distributions induced by uncertainty ellipsoids. Unlike sampling-based approaches that explore only in the action space, this representation captures the effects of both system dynamics and action selection. By incorporating the impact of dynamics, in addition to the action space, into our distributions, our method enhances trajectory diversity by enforcing distributional separation via the Hellinger distance between them. It enables a systematic exploration of the configuration space and improves robustness against local minima. Further, we present UGE-MPC, which integrates UGE-TO into sampling-based model predictive controller methods. Experiments demonstrate that UGE-MPC achieves higher exploration and faster convergence in trajectory optimization compared to baselines under the same sampling budget, chieving 72.1% faster convergence in obstacle-free environments and 66% faster convergence with a 6.7% higher success rate in the cluttered environment compared to the best-performing baseline. Additionally, we validate the approach through a range of simulation scenarios and real-world experiments. Our results indicate that UGE-MPC has higher success rates and faster convergence, especially in environments that demand significant deviations from nominal trajectories to avoid failures. The project and code are available at https://ogpoyrazoglu.github.io/cuniform_sampling/.

Abstract:
Designing robot morphologies and kinematics has traditionally relied on human intuition, with little systematic foundation. Motiondesign co-optimization offers a promising path toward automation, but two major challenges remain: (i) the vast, unstructured design space and (ii) the difficulty of constructing task-specific loss functions. We propose a new paradigm that minimizes human involvement by (i) learning the design search space from existing mechanical designs, rather than hand-crafting it, and (ii) defining the loss directly from human motion data via motion retargeting and Procrustes analysis. Using screw-theory-based joint axis representation and isometric manifold learning, we construct a compact, geometry-preserving latent space of robot designs in which optimization is tractable. We then solve design optimization in this latent space using gradient-free optimization. Our approach establishes a principled framework for data-driven robot design and demonstrates that leveraging existing designs and human motion can effectively guide the automated discovery of novel robot design.

Abstract:
Reinforcement learning (RL) offers a powerful approach for robots to learn complex, collaborative skills by combining Dynamic Movement Primitives (DMPs) for motion and Variable Impedance Control (VIC) for compliant interaction. However, this model-free paradigm often risks instability and unsafe exploration due to the time-varying nature of impedance gains. This work introduces Certified Gaussian-Manifold Sampling (C-GMS), a novel trajectory-centric RL framework that learns combined DMP and VIC policies while guaranteeing Lyapunov stability and actuator feasibility by construction. Our approach reframes policy exploration as sampling from a mathematically defined manifold of stable gain schedules. This ensures every policy rollout is guaranteed to be stable and physically realizable, thereby eliminating the need for reward penalties or post-hoc validation. Furthermore, we provide a theoretical guarantee that our approach ensures bounded tracking error even in the presence of bounded model errors and deployment-time uncertainties. We demonstrate the effectiveness of C-GMS in simulation and verify its efficacy on a real robot, paving the way for reliable autonomous interaction in complex environments.

Abstract:
We present a reinforcement learning framework for quadrupedal wall-climbing locomotion that explicitly addresses uncertainty in magnetic foot adhesion. A physics-based adhesion model of a quadrupedal magnetic climbing robot is incorporated into simulation to capture partial contact, air-gap sensitivity, and probabilistic attachment failures. To stabilize learning and enable reliable transfer, we design a three-phase curriculum: (1) acquire a crawl gait on flat ground without adhesion, (2) gradually rotate the gravity vector to vertical while activating the adhesion model, and (3) inject stochastic adhesion failures to encourage slip recovery. The learned policy achieves a high success rate, strong adhesion retention, and rapid recovery from detachment in simulation under degraded adhesion. Compared with a model predictive control (MPC) baseline that assumes perfect adhesion, our controller maintains locomotion when attachment is intermittently lost. Hardware experiments with the untethered robot further confirm robust vertical crawling on steel surfaces, maintaining stability despite transient misalignment and incomplete attachment. These results show that combining curriculum learning with realistic adhesion modeling provides a resilient sim-to-real framework for magnetic climbing robots in complex environments.

Abstract:
From loco-motion to dextrous manipulation, humanoid robots have made remarkable strides in demonstrating complex full-body capabilities. However, the majority of current robot learning datasets and benchmarks mainly focus on stationary robot arms, and the few existing humanoid datasets are either confined to fixed environments or limited in task diversity, often lacking human-humanoid interaction and lower-body locomotion. Moreover, there are a few standardized evaluation platforms for benchmarking learning-based policies on humanoid data. In this work, we present Humanoid Everyday, a large-scale and diverse humanoid manipulation dataset characterized by extensive task variety involving dextrous object manipulation, human-humanoid interaction, locomotion-integrated actions, and more. Leveraging a highly efficient human-supervised teleoperation pipeline, Humanoid Everyday aggregates high-quality multimodal sensory dataincluding RGB, depth, LiDAR, and tactile inputstogether with natural language annotations, comprising 10.3k trajectories and over 3 million frames of data across 260 tasks across 7 broad categories. In addition, we conduct an analysis of representative policy learning methods on our dataset, providing insights into their strengths and limitations across different task categories. For standardized evaluation, we introduce a cloud-based evaluation platform that allows researchers to seamlessly deploy their policies in our controlled setting and receive performance feedback. By releasing Humanoid Everyday along with our policy learning analysis and a standardized cloud-based evaluation platform, we intend to advance research in general-purpose humanoid manipulation and lay the groundwork for more capable and embodied robotic agents in real-world scenarios. Our dataset, data collection code, and cloud evaluation website are made publicly available on our project website: https://github.com/anonymouse5202077/Humanoid-Everyday

Abstract:
Inspection of underwater structures with tethered underwater vehicles is often hindered by the risk of tether entanglement. We propose REACT (real-time entanglement- aware coverage path planning for tethered underwater ve- hicles), a framework designed to overcome this limitation. REACT comprises a computationally efficient geometry-based tether model using the signed distance field (SDF) map for accurate, real-time simulation of taut tether configurations around arbitrary structures in 3D. This model enables an efficient online replanning strategy by enforcing a maximum tether length constraint, thereby actively preventing entanglement. By integrating REACT into a coverage path planning framework, we achieve safe and entanglement-free inspection paths, previously challenging due to tether constraints. The complete REACT frameworks efficacy is validated in a pipe inspection scenario, demonstrating safe navigation and full coverage inspection. Simulation results show that REACT achieves complete coverage while maintaining tether constraints and completing the total mission 20% faster than conventional planners, despite a longer inspection time due to proactive avoidance of entanglement that eliminates extensive post-mission disentanglement. Real-world experiments confirm these benefits, where REACT completes the full mission, while the baseline planner fails due to physical tether entanglement.

Abstract:
Automation in underground mining has the potential to significantly enhance safety, operational efficiency, and sustainability. However, effectively coordinating fleets of autonomous vehicles in dynamic mine environments introduces substantial challenges in both optimization and motion planning. To address these challenges, we introduce and formalize the Block Cave Mining (BCM) problem, which focuses on computing a transport plan that maximizes ore throughput while satisfying draw ratio constraints. To solve this problem, we propose SAMM, an eventually optimal anytime solver that jointly integrates task assignment, scheduling, and path planning via a mixed-integer linear programming formulation. To improve scalability, we also introduce SAMMS, a variant of SAMM that trades optimality guarantees for efficiency by decomposing the problem into shorter planning subcycles. Experimental evaluations using realistic industrial mine scenarios demonstrate that SAMMS achieves near-optimal throughput and scales effectively to larger fleets and mine layouts.

Abstract:
Robot planning in partially observable environments, where not all objects are known or visible, is a challenging problem, as it requires reasoning under uncertainty through partially observable Markov decision processes. During the execution of a computed plan, a robot may unexpectedly observe task-irrelevant objects, which are typically ignored by naive planners. In this work, we propose incorporating two types of common-sense knowledge: (1) certain objects are more likely to be found in specific locations; and (2) similar objects are likely to be co-located, while dissimilar objects are less likely to be found together. Manually engineering such knowledge is complex, so we explore leveraging the powerful common-sense reasoning capabilities of large language models (LLMs). Our planning and execution framework, CoCo-TAMP, introduces a hierarchical state estimation that uses LLM-guided information to shape the belief over task-relevant objects, enabling efficient solutions to long-horizon task and motion planning problems. In experiments, CoCo-TAMP achieves an average reduction of 62.7% in planning and execution time in simulation, and 72.6% in real-world demonstrations, compared to a baseline that does not incorporate either type of common-sense knowledge.

Abstract:
Soft grippers exhibit exceptional adaptability to novel objects and tasks, making them suitable for safe and effective operation in human-centered applications. To improve their stiffness and gripping force, jamming techniques has been frequently used in manipulating objects of diverse shapes and weights. However, the existing jamming based grippers suffer from significant limitations including complex and expensive fabrication, excessive weight, slow recovery response and bending instability due to requiring a high level of vacuum to achieve jamming. This paper presents a novel design of pneumatically actuated flexible rod jamming based soft gripper. It consists of zigzag-based driving chambers to allow bending of the actuator upon pressurization. Additionally, a zigzag jamming chamber filled with flexible rods, that are fabricated by activating the internal support. The design is fabricated completely from Elastic 50A resin using Stereolithography (SLA) fabrication process without the need of additional fabrication procedure. The prototypes stiffness is achieved by regulating the vacuum inside the jamming chamber. A nonlinear static analysis based on 3rd Yeoh model is conducted to investigate the actuator performance in terms of safety and deflection under various operating conditions. The performance of the prototype is evaluated against conventional actuator, while concerning its bending repeatability and payload capacity. The experimental results show that the proposed design achieves bending angle of 178�?and carrying external load of 200 g. Additionally, it exhibits low deflection during bending compared to traditional zigzag actuator.

Abstract:
Standard imitation learning (IL) methods have achieved considerable success in robotics, yet often rely on the Markov assumption, which falters in long-horizon tasks where history is crucial for resolving perceptual ambiguity. This limitation stems not only from a conceptual gap but also from a fundamental computational barrier: prevailing architectures like Transformers are often constrained by quadratic complexity, rendering the processing of long, high-dimensional observation sequences infeasible. To overcome this dual challenge, we introduce Mamba Temporal Imitation Learning (MTIL). Our approach represents a new paradigm for robotic learning, which we frame as a practical synthesis of World Model and Dynamical System concepts. By leveraging the linear-time recurrent dynamics of State Space Models (SSMs), MTIL learns an implicit, action-oriented world model that efficiently encodes the entire trajectory history into a compressed, evolving state. This allows the policy to be conditioned on a comprehensive temporal context, transcending the confines of Markovian approaches. Through extensive experiments on simulated benchmarks (ACT, Robomimic, LIBERO) and on challenging real-world tasks, MTIL demonstrates superior performance against SOTA methods like ACT and Diffusion Policy, particularly in resolving long-term temporal ambiguities. Our findings not only affirm the necessity of full temporal context but also validate MTIL as a powerful and a computationally feasible approach for learning long-horizon, non-Markovian behaviors from high-dimensional observations.

Abstract:
As autonomous mobile robots increasingly operate in real-world environments, safety has emerged as a critical challenge, particularly regarding obstacle and pedestrian detection in building blind spots and reliable traffic signal recognition. While traditional Vehicle-to-Infrastructure (V2I) systems adopt high-capacity communication through 5G networks or via Optical Wireless Communication (OWC), these approaches require dedicated communication hardware that proves impractical for small, low-cost robots. Additionally, the communication bandwidth required for robot-oriented V2I, such as blind spot object detection and traffic signal states, is relatively limited; the high-capacity communication of 5G is often unnecessary. To address these challenges, we propose a novel optical communication system named Optical LiDAR Communication (OLC), which repurposes existing LiDAR sensors as communication devices. By integrating LiDAR Injection with 2D Code technology, OLC achieves cost-effectiveness through V2I communication without requiring additional hardware on robots. Real-world experiments confirmed that the proposed method achieves a communication success rate of over 76% at distances up to 30 meters. Furthermore, as a proof-of-concept, we develop two key V2I systems utilizing OLC: traffic signal information transmission and blind-spot obstacle detection, and real-time communication performance was demonstrated. These results indicate that the proposed method has potential as a V2I platform for next-generation robotics infrastructure.

Abstract:
In this letter, we propose a tiered systematic framework to enhance the overall efficiency and environmental coverage of autonomous exploration for Autonomous Ground Vehicle (AGV) in complex environments with narrow regions. At the local level, we introduce a novel Multi-cause Triggering Sensor Model (MTSM) to improve informative observation acquisition in narrow regions. Furthermore, the Frontier set is defined from a probabilistic distribution perspective and utilized to optimize the initial training pool of Bayesian optimization, thereby accelerating convergence toward the optimal navigation target point. At the global level, we incrementally maintain an Information-Rich Sparse Roadmap (IRSR) by leveraging accumulated historical exploration knowledge. When a dead zone situation is detected, the heuristic guidance is activated and realized by graph search considering information content and distance between IRSR vertices, enabling AGV to maintain a continuous and sustained exploration process. Three simulation scenarios with increasing complexity are designed, in which comprehensive comparisons and evaluations against different types of state-of-the-art approaches are conducted. The results demonstrate that our framework achieves a favorable balance between algorithm runtime, exploration efficiency and coverage completeness, with superior performance in narrow regions. Subsequent real-world experiments further validate the strong potential of our proposed method for practical applications.

Abstract:
Learning from demonstration is a promising way to teach robots new skills. However, a central challenge in executing acquired skills is the ability to recognize faults and prevent failures. This is essential since the demonstrations usually cover only a limited number of mostly successful cases. During task execution, unexpected situations that were not encountered during demonstrations may occur. Examples include changes in the robot's environment or interaction with human operators. To recognize such situations, this paper focuses on teaching the robot situational awareness by using a camera input and labeling frames as safe or risky. We train a Gaussian Process regression model fed by a low-dimensional latent space representation of the input images. The model outputs a continuous risk score ranging from zero to one, quantifying the level of risk evidence at each timestep. This allows for pausing task execution in unsafe situations and directly adding new training data, labeled by the human user. Our experiments on a robotic manipulator show that our proposed method can reliably detect both known and novel faults using only a small amount of user-provided data. In contrast, a standard Multi-Layer Perceptron performs well only on faults it has encountered during training. Our method enables the next generation of cobots to be rapidly deployed with easy-to-set-up, vision-based risk assessment, proactively safeguarding humans and detecting misaligned parts or missing objects before failures occur.

Abstract:
Dual-arm cooperative manipulation holds great promise for tackling complex real-world tasks that demand seamless coordination and adaptive dynamics. Despite substantial progress in learning-based motion planning, most approaches struggle to generalize across diverse manipulation tasks and adapt to dynamic, unstructured environments, particularly in scenarios involving interactions between two objects such as assembly, tool use, and bimanual grasping. To address these challenges, we introduce a novel VLM-Assisted Siamese Flow Diffusion (VLM-SFD) framework for efficient imitation learning in dual-arm cooperative manipulation. The proposed VLM-SFD framework exhibits outstanding adaptability, significantly enhancing the ability to rapidly adapt and generalize to diverse real-world tasks from only a minimal number of human demonstrations. Specifically, we propose a Siamese Flow Diffusion Network (SFDNet) employs a dual-encoder-decoder Siamese architecture to embed two target objects into a shared latent space, while a diffusion-based conditioning processconditioned by task instructionsgenerates two-stream object-centric motion flows that guide dual-arm coordination. We further design a dynamic task assignment strategy that seamlessly maps the predicted 2D motion flows into 3D space and incorporates a pre-trained vision-language model (VLM) to adaptively assign the optimal motion to each robotic arm over time. Experiments validate the effectiveness of the proposed method, demonstrating its ability to generalize to diverse manipulation tasks while maintaining high efficiency and adaptability. The code and demo videos are publicly available on our project website~urlhttps://sites.google.com/view/vlm-sfd/.

Abstract:
Recent foundation models demonstrate strong generalization capabilities in monocular depth estimation. However, directly applying these models to Full Surround Monocular Depth Estimation (FSMDE) presents two major challenges: (1) high computational cost, which limits real-time performance, and (2) difficulty in estimating metric-scale depth, as these models are typically trained to predict only relative depth. To address these limitations, we propose a novel knowledge distillation strategy that transfers robust depth knowledge from a foundation model to a lightweight FSMDE network. Our approach leverages a hybrid regression framework combining the knowledge distillation scheme--traditionally used in classification--with a depth binning module to enhance scale consistency. Specifically, we introduce a cross-interaction knowledge distillation scheme that distills the scale-invariant depth bin probabilities of a foundation model into the student network while guiding it to infer metric-scale depth bin centers from ground-truth depth. Furthermore, we propose view-relational knowledge distillation, which encodes structural relationships among adjacent camera views and transfers them to enhance cross-view depth consistency. Experiments on DDAD and nuScenes demonstrate the effectiveness of our method compared to conventional supervised methods and existing knowledge distillation approaches. Moreover, our method achieves a favorable trade-off between performance and efficiency, meeting real-time requirements.

Abstract:
Humans possess the capability to seamlessly integrate tools into their body schema, enabling precise and adaptive interactions with the environment. This touch-mediated ability allows us to dexterously use tools in everyday tasks, an ability currently lacking in robotic systems. In this work, we propose a novel method for indirect force estimation in robotic tool use, a prerequisite for advanced tool use, leveraging vision-based tactile sensing (VTS) and deep learning techniques. By capturing high-resolution spatial deformations from tactile images, our model implicitly infers force transmission dynamics without requiring explicit knowledge of tool properties or material characteristics. We validate our approach across multiple tool types using a single trained machine learning model, demonstrating its generalization capability. This work represents the first demonstration of indirect force estimation for tool-mediated robotic interactions, offering a pathway toward more dexterous and adaptive robotic tool use in real-world applications.

Abstract:
Data-driven methods provide effective solutions for robot trajectory generation in dynamic environments. Many physical constraints exist in the real world, and understanding these constraints to generate feasible trajectories for kinematics or dynamics is highly demanding regarding the data quantity. Due to the black box, it is also challenging to ensure the safety of the trajectories planned by data-driven models. In this paper, we propose an end-to-end model (D2MFusion) that fuses data-driven components and a model-based optimizer. D2MFusion uses a differentiable optimization layer (dLQR) that forms a backpropagation loop with a perception network. With the input BEV image, the perception network outputs the environmental feature vector to adjust the optimizer parameters to adapt to the dynamic environment. We train this fusion planner to imitate expert trajectories on a real self-driving dataset and demonstrate the planners explainability, data efficiency, and safe reactivity through closed-loop simulations. We also conduct experiments on a real quadrupedal robot (Unitree Go2) in three different scenarios to demonstrate the ability of our method to navigate in dynamic environments.

Abstract:
Global localization is critical for autonomous navigation, particularly in scenarios where an agent must localize within a map generated in a different session or by another agent, as agents often have no prior knowledge about the correlation between reference frames. However, this task remains challenging in unstructured environments due to appearance changes induced by viewpoint variation, seasonal changes, occlusions, and perceptual aliasing in homogeneous environments known failure modes for traditional place recognition methods. To address these challenges, we propose VISTA (View-Invariant Segmentation-Based Tracking for Frame Alignment), a novel open-set, monocular global localization framework that combines: 1) a front-end, object-based, segmentation and tracking pipeline, followed by 2) a submap correspondence search, which exploits geometric consistencies between environment maps to align vehicle reference frames. VISTA enables consistent localization across diverse camera viewpoints and seasonal changes, without requiring any domain-specific training or finetuning. We evaluate VISTA on seasonal and oblique-angle aerial datasets, achieving up to a 69% improvement in recall over baseline methods. Furthermore, we maintain a compact object-based map that is only 0.6% the size of the most memory-conservative baseline, making our approach capable of real-time implementation on resource-constrained platforms.

Abstract:
Robots executing iterative tasks in complex, uncertain environments require control strategies that balance robustness, safety, and high performance. This paper introduces a safe information-theoretic learning model predictive control (SIT-LMPC) algorithm for iterative tasks. Specifically, we design an iterative control framework based on an information-theoretic model predictive control algorithm to address a constrained infinite-horizon optimal control problem for discrete-time nonlinear stochastic systems. An adaptive penalty method is developed to ensure safety while balancing optimality. Trajectories from previous iterations are utilized to learn a value function using normalizing flows, which enables richer uncertainty modeling compared to Gaussian priors. SIT-LMPC is designed for highly parallel execution on graphics processing units, allowing efficient real-time optimization. Benchmark simulations and hardware experiments demonstrate that SIT-LMPC iteratively improves system performance while robustly satisfying system constraints.

Abstract:
In this paper, we present a whole-body control framework that allows a wheeled-bipedal robot to achieve robust locomotion across diverse environments without relying on terrain perception. The proposed approach consists of a whole-body motion planner and an optimization-based torque computation module. By considering the floating-base dynamics of the robot, the motion planner produces terrain-adaptive behaviors using the zero moment point (ZMP) to preserve balance without prior knowledge of the terrain. In addition, the torque computation module combines a linear quadratic regulator (LQR) with a quadratic programming (QP)-based controller. The LQR computes wheel torques to regulate the body angle while addressing the inherent non-minimum phase characteristics. Using these wheel torques, the QP-based controller allocates optimal joint torques to achieve the desired motion and maintain stable balance. The proposed framework is validated on a wheeled-bipedal robot, demonstrating locomotion over various terrains, including slopes and stairs, as well as robustness against external disturbances.

Abstract:
Soft robots offer an opportunity to embed intelligence directly into morphology, potentially reducing the need for continuous feedback regulation. We present an autonomous, minimally actuated multi-stable soft robot for exploration in confined and cluttered environments. The robot is composed of a serial chain of multi-stable elastic elements whose energy landscape encodes discrete, passively stable configurations, enabling reversible shape transformation and shape retention without sustained actuation. A single mobile pneumatic actuator triggers transitions between these stable states, producing complex three-dimensional configurations with minimal hardware complexity. Autonomy is achieved through the integration of nonlinear hybrid modeling, visual pose estimation, and sampling-based motion planning within a ROS2 framework. Rather than regulating continuous deformation, computation in our system selects and sequences mechanically admissible state transitions, while structural multi-stability provides inherent stabilization and memory. Experimental results demonstrate closed-loop navigation in cluttered environments using this distributed balance between mechanics and control.

Abstract:
Autonomous unmanned aerial vehicles (UAVs) are traditionally controlled using behavior trees or state machines, which provide deterministic execution but limited adaptability in dynamic environments. Extending these conventional systems to handle new tasks requires manual specification of additional nodes or transitions, creating a scalability challenge as mission complexity increases. This work introduces a high-level mission manager leveraging local Large Language Models (LLMs) for autonomous UAV control. The system allows operators to issue high-level commands in natural language, which the LLM interprets and decomposes into sequences of ROS 2 actions, such as takeoff, navigation, object localization, and landing, without mission-specific programming. The LLM does not directly control the UAV but selects from a constrained set of tools mapped to ROS 2 actions or services. Real-time robot state is injected into the model context, ensuring that decisions are based on actual system status and environment perception.

Abstract:
Close-proximity inspection of underwater cylindrical structures is challenging due to nonlinear vehicle dynamics, flow disturbances, and payload uncertainty. Fixed-weight MPC provides structured constraint handling but lacks adaptivity, while model-free RL is adaptive but often unstable and unsafe under disturbances. We propose Marine AC-MPC, which combines a differentiable iLQR-based MPC layer with an actor-critic framework that learns time-varying MPC cost weights online. In MarineGym, the proposed method achieves more reliable orbit tracking and higher success rates than fixed-weight MPC and PPO baselines under disturbed conditions.

Abstract:
Recent 3D Gaussian Splatting (3DGS) methods have demonstrated the feasibility of self-driving scenes reconstruction and novel view synthesis. However, most existing methods either rely solely on cameras or use LiDAR only for Gaussian initialization or depth supervision, while the rich scene information contained in point clouds, such as reflectance, and the complementarity between LiDAR and RGB have not been fully exploited, leading to degradation in challenging self-driving scenes, such as those with high ego-motion and complex lighting. To address these issues, we propose a robust and efficient LiDAR-reflectance-guided Salient Gaussian Splatting method (LR-SGS) for self-driving scenes, which introduces a structure-aware Salient Gaussian representation, initialized from geometric and reflectance feature points extracted from LiDAR and refined through a salient transform and improved density control to capture edge and planar structures. Furthermore, we calibrate LiDAR intensity into reflectance and attach it to each Gaussian as a lighting-invariant material channel, jointly aligned with RGB to enforce boundary consistency. Extensive experiments on the Waymo Open Dataset demonstrate that LR-SGS achieves superior reconstruction performance with fewer Gaussians and shorter training time. In particular, on Complex Lighting scenes, our method surpasses OmniRe by 1.18 dB PSNR.

Abstract:
Tracking dynamic endpoint trajectories of deformable linear objects (DLOs) with a robotic manipulator remains challenging due to their complex non-linear behavior. While closed-loop Model Predictive Control (MPC) can account for these non-linearities, it requires an accurate dynamic model and precise state estimation. This paper introduces a closed-loop approach for controlling a DLO's endpoint to track dynamic 2D shapes. We model the DLO as a floating-base kinematic chain and present a new perspective on learning its dynamics using a data-driven approximation of its hybrid dynamics. Based on this model, we formulate an Optimal Control Problem (OCP), which we solve within the control loop using both linear MPC and DDP. We validate our approach with simulation and hardware experiments, demonstrating its ability to track dynamic endpoint motions.

Abstract:
While recent research has focused heavily on dexterous grasp pose generation, less attention has been devoted to the execution of planned grasps. Under shape and position uncertainty, open-loop execution often yields uncoordinated contacts, causing undesired in-hand object motion and even grasp failures. To address this, this paper proposes a tactile-driven model predictive controller for adaptive and delicate execution of diverse dexterous grasps. Our approach emphasizes multi-contact coordination across both approaching and grasping phases, with three key novelties: (i) coordination-aware phase separation, (ii) armhand coordination to compensate for position errors, and (iii) adaptive force coordination to increase contact forces in a balanced manner. An analytical model is employed to relate contact forces to robot joint motions for predictive control. Our formulation imposes no restrictions on grasp types or contact configurations and integrates seamlessly with state-of-the-art grasp pose generation methods. We validate the approach through large-scale simulations involving 15k grasps across 400 objects on three robotic hands, and real-world experiments on eight objects. Results demonstrate that our method achieves higher grasp success rates and reduced undesired object movements. Supplementary materials are available at https://ada-grasp-ctrl.github.io/.

Abstract:
From stacking a tower of blocks to serving a cup of coffee, stable object placement is a crucial skill for future robots. It becomes particularly challenging under geometric uncertainties, e.g., when the object pose or shape is not known accurately. This work leverages a differentiable simulation model of contact dynamics to tackle this challenge. We derive a novel gradient that relates force-torque sensor readings to geometric uncertainties, thus enabling uncertainty estimation by minimizing discrepancies between sensor data and model predictions via gradient descent. Gradient-based methods are sensitive to initialization. To mitigate this effect, we maintain a belief over multiple estimates and choose the robot action based on the current belief at each time step. In experiments on a Franka robot arm, our method achieved promising results on multiple objects under various geometric uncertainties, including the in-hand pose uncertainty of a grasped object, the object shape uncertainty, and the environment uncertainty.

Abstract:
We present a distributed first-order and second-order adaptive hybrid optimization algorithm (DAHO) for multi-robot systems. A team of robots collaboratively trains a shared deep neural network using only local data while exchanging model updates via peer-to-peer robot communication. Raw data never leaves the device, which preserves privacy and conserves communication bandwidth. The method blends a second-order Limited-memory BroydenFletcherGoldfarbShann (LBFGS) method with an alternating direction method of multipliers (ADMM) based first-order method to obtain both the fast convergence of second-order methods and the robustness of first-order schemes. An automatic switching policy, guided by a convergence analysis rooted in trust region theory, selects the update type at each round. A soft switch mechanism derived from the same analysis mitigates oscillations during mode changes. Compared with four single-method baselines that range from first-order to second-order optimization, the proposed hybrid approach achieves faster convergence, superior accuracy, and near centralized performance on robotics related deep learning tasks.

Abstract:
Tensegrity robots excel in tasks requiring extreme levels of deformability and robustness. However, there are challenges in state estimation and payload versatility due to their high number of degrees of freedom and unconventional shape. This paper introduces a modular three-bar tensegrity robot featuring a customizable payload design. Our tensegrity robot employs a novel Quasi-Direct Drive (QDD) cable actuator with low-stretch polymer cables to achieve accurate proprioception without needing external force or torque sensors. The design allows for on-the-fly stiffness tuning for better environment and payload adaptability. In this paper, we present the robots design, fabrication, assembly, and experimental results. Experimental data demonstrates the high accuracy cable length estimation (<1% error relative to bar length) and variable stiffness control of the cable actuator up to 7 times the minimum stiffness for self support. The shape morphing and stiffness tuning capabilities are leveraged in two realistic demonstrations. The presented tensegrity robot is a platform for future advancements in autonomous operation and open-source module design. Open source design files are available at (Redacted URL).

Abstract:
This work focuses on the problem of 6D pose estimation for novel objects when a reference 3D model or posed reference images are not available. While existing methods can estimate the precise 6D pose of objects, they heavily rely on curated CAD models or reference images, the preparation of which is a time-consuming and labor-intensive process. Moreover, in real-world scenarios, 3D models or reference images may not be available in advance and instant robot reaction is desired. In this work, we propose a novel framework named HIPPo, which eliminates the need for curated CAD models and reference images by harnessing image-to-3D priors from Diffusion Models, enabling model-free zero-shot 6D pose estimation. Specifically, we construct HIPPo Dreamer, a rapid image-to-mesh model built on a multiview Diffusion Model and a 3D reconstruction foundation model. Our HIPPo Dreamer can generate a 3D mesh of any unseen objects from a single glance in just a few seconds. Then, as more observations are acquired, we propose to continuously refine the diffusion prior mesh model by joint optimization of object geometry and appearance. This is achieved by a measurement-guided scheme that gradually replaces the plausible diffusion priors with more reliable online observations. Consequently, HIPPo can instantly estimate and track the 6D pose of a novel object and maintain a complete mesh for immediate robotic applications. Thorough experiments on various benchmarks show that HIPPo outperforms state-of-the-art methods in 6D object pose estimation when prior reference images are limited.

Abstract:
To navigate crowds without collisions, robots must interact with humans by forecasting their future motion and reacting accordingly. While learning-based prediction models have shown success in generating likely human trajectory predictions, integrating these stochastic models into a robot controller presents several challenges. The controller needs to account for interactive coupling between planned robot motion and human predictions while ensuring both predictions and robot actions are safe (i.e. collision-free). To address these challenges, we present a receding horizon crowd navigation method for single-robot multi-human environments. We first propose a diffusion model to generate joint trajectory predictions for all humans in the scene. We then incorporate these multi-modal predictions into a SICNav Bilevel MPC problem that simultaneously solves for a robot plan (upper-level) and acts as a safety filter to refine the predictions for non-collision (lower-level). Combining planning and prediction refinement into one bilevel problem ensures that the robot plan and human predictions are coupled. We validate the open-loop trajectory prediction performance of our diffusion model on the commonly used ETH/UCY benchmark and evaluate the closed-loop performance of our robot navigation method in simulation and extensive real-robot experiments demonstrating safe, efficient, and reactive robot motion.

Abstract:
In this article, we present a perception-based full-stack system for autonomous vehicle following that does not rely on accurate global localization or map data. Our architecture consists of modules for vehicle communication, localization, object tracking, waypoint management, static environment modeling, trajectory planning, and control, which all are covered in the article. To test our system, we conducted several practical experiments in various scenarios on our two autonomous vehicles. Those experiments include the handling of static and dynamic obstacles, driving on- and off-road under different light and weather conditions with distances between the vehicles ranging from 5m to 100m and with speeds of up to 20m/s. Furthermore, we showcased our systems performance during the 12th European Land Robot Trial 2024, where our institute participated in the convoying scenario. The tests from the trial and our own experiments showed satisfactory results. Our system archives a high path-following accuracy and is able to cope with various challenging scenarios.

Abstract:
One of the primary reasons robotic apple harvesting is a challenging manipulation problem is the cluttered tree canopy. An effective harvesting gripper should i) be compact to minimize collisions with the canopy, ii) offer a compliant grasp to prevent bruising; and iii) hold the fruit securely to counteract forces during picking. Much of the prior work has used single-mode grippers (suction or fingers), which are often compliant but have low grasp strength (suction), or have a strong grasp but a large form factor (fingers). We present a compact robotic gripper that combines the benefits of both. It first uses an array of soft suction cups to gently attach to the fruit, then deploys three telescoping fingers that sweep away obstacles and pivot inward to secure the grasp. We analyze the finger design for its ability to sweep clutter and maintain a tight grasp, and we measure grasp strength across suction-only, fingers-only, and combined (tandem) actuation modes. Tandem mode consistently provides a grasp that can counter typically observed fruit detachment forces. Using an apple proxy, we test the grippers performance in cluttered scenarios, achieving over 96% pick success with an ideal controller. Finally, we validate the gripper in a commercial apple orchard, achieving an 81% pick success rate.

Abstract:
Training robust bimanual manipulation policies via imitation learning requires demonstration data with broad coverage over robot poses, contacts, and scene contexts. However, collecting diverse and precise real-world demonstrations is costly and time-consuming, which hinders scalability. Prior works have addressed this with data augmentation, typically for either eye-in-hand (wrist camera) setups with RGB inputs or for generating novel images without paired actions, leaving augmentation for eye-to-hand (third-person) RGB-D training with new action labels less explored. In this paper, we propose Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation (ROPA), an offline imitation learning data augmentation method that fine-tunes Stable Diffusion to synthesize third-person RGB and RGB-D observations of novel robot poses with up to four camera viewpoints. Our approach simultaneously generates corresponding joint-space action labels while employing constrained optimization to enforce physical consistency through appropriate gripper-to-object contact constraints in bimanual scenarios. We evaluate our method on 5 simulated and 3 real-world tasks. Our results across 2625 simulation trials and 300 real-world trials demonstrate that ROPA outperforms baselines and ablations, showing its potential for scalable RGB and RGB-D data augmentation in eye-to-hand bimanual manipulation. Our project website is available at: https://ropaaug.github.io/.

Abstract:
Human drivers adeptly navigate complex scenarios by utilizing rich attentional semantics, but the current autonomous systems struggle to replicate this ability, as they often lose critical semantic information when converting 2D observations into 3D space. In this sense, it hinders their effective deployment in dynamic and complex environments. Leveraging the superior scene understanding and reasoning abilities of Vision-Language Models (VLMs), we propose VLM-E2E, a novel framework that uses the VLMs to enhance training by providing attentional cues. Our method integrates textual representations into Bird's-Eye-View (BEV) features for semantic supervision, which enables the model to learn richer feature representations that explicitly capture the driver's attentional semantics. By focusing on attentional semantics, VLM-E2E better aligns with human-like driving behavior, which is critical for navigating dynamic and complex environments. Furthermore, we introduce a BEV-Text learnable weighted fusion strategy to address the issue of modality importance imbalance in fusing multimodal information. This approach dynamically balances the contributions of BEV and text features, ensuring that the complementary information from visual and textual modalities is effectively utilized. By explicitly addressing the imbalance in multimodal fusion, our method facilitates a more holistic and robust representation of driving environments. We evaluate VLM-E2E on the nuScenes dataset and achieve significant improvements in perception, prediction, and planning over the baseline end-to-end model, showcasing the effectiveness of our attention-enhanced BEV representation in enabling more accurate and reliable autonomous driving tasks.

Abstract:
Visual Place Recognition (VPR) localizes a query image by matching it against a database of geo-tagged reference images, making it essential for navigation and mapping in robotics. Although Vision Transformer (ViT) solutions deliver high accuracy, their large models often exceed the memory and compute budgets of resource-constrained platforms such as drones and mobile robots. To address this issue, we propose TeTRA, a ternary transformer approach that progressively quantizes the ViT backbone to 2-bit precision and binarizes its final embedding layer, offering substantial reductions in model size and latency. A carefully designed progressive distillation strategy preserves the representational power of a full-precision teacher, allowing TeTRA to retain or even surpass the accuracy of uncompressed convolutional counterparts, despite using fewer resources. Experiments on standard VPR benchmarks demonstrate that TeTRA reduces memory consumption by up to 69% compared to efficient baselines, while lowering inference latency by 35%, with either no loss or a slight improvement in recall@1. These gains enable high-accuracy VPR on power-constrained, memory-limited robotic platforms, making TeTRA an appealing solution for real-world deployment.

Abstract:
In cross-domain visual understanding tasks, models often achieve strong performance on the source domain but suffer severe degradation when applied to target domains with substantial distribution shifts. This challenge is particularly prominent under the zero-shot domain adaptation setting, where adaptation must be achieved without access to target-domain samples and instead relies on language guidance to bridge the gap. However, existing approaches typically depend on fixed class names or handcrafted prompt templates, which fail to capture fine-grained semantic attributes present in the target domain. Moreover, the insufficient alignment between visual and linguistic modalities further constrains the transferability of semantic knowledge. To address these issues, we propose an attribute-driven cross-modal feature modulation framework, termed Language-guided Attribute alignment and Semantic Consistency (LASC). On the semantic side, we introduce an attribute-driven prompt generation module that dynamically combines category information with domain-relevant attributes to construct adaptive text prompts, which are aligned with visual features through cross-modal attention for enhanced semantic stability. Furthermore, we incorporate a semantic consistency constraint, where a memory bank enforces intra-class compactness and inter-class separation, ensuring robust discriminability across domains. Extensive experiments demonstrate that our approach achieves significant improvements over state-of-the-art baselines on multiple cross-domain benchmarks, and maintains strong adaptation ability without requiring any target-domain data.

Abstract:
To complete assignments provided by humans in natural language, robots must interpret commands, generate and answer relevant questions for scene understanding, and manipulate target objects. Real-world deployments often require multiple heterogeneous robots with different manipulation capabilities to handle different assignments cooperatively. Beyond the need for specialized manipulation skills, effective information gathering is important in completing these assignments. To address this component of the problem, we formalize the information-gathering process in a fully cooperative setting as an underexplored multi-agent multi-task Embodied Question Answering (MM-EQA) problem, which is a novel extension of canonical Embodied Question Answering (EQA), where effective communication is crucial for coordinating efforts without redundancy. To address this problem, we propose CommCP, a novel LLM-based decentralized communication framework designed for MM-EQA. Our framework employs conformal prediction to calibrate the generated messages, thereby minimizing receiver distractions and enhancing communication reliability. To evaluate our framework, we introduce an MM-EQA benchmark featuring diverse, photo-realistic household scenarios with embodied questions. Experimental results demonstrate that CommCP significantly enhances the task success rate and exploration efficiency over baselines. The experiment videos, code, and dataset are available on our project website: https://comm-cp.github.io.

Abstract:
Object-Goal Navigation (ObjectNav) requires an embodied agent to search for and reach a target object category in previously unseen environments using only onboard egocentric observations, which is a fundamental capability for long-horizon autonomous robots. Current Object-Goal Navigation methods typically discard environmental knowledge after each episode, limiting their ability to operate autonomously over long horizons. To overcome this limitation, we introduce DIPP, a diffusion-based potential planner that unifies navigation and mapping. DIPP generates two complementary potential fields: a navigation potential that directs the agent toward the target and a topological potential that captures the environments structural skeleton. The topological potential serves a dual purpose: it acts as an implicit structural prior for waypoint selection when fused directly with the navigation potential and, more importantly, enables the incremental construction of a persistent, explicit topological graph. This graph enables a hierarchical policy to select strategic, long-horizon waypoints, elevating planning from a tactical search to a strategic decision. We evaluate DIPP in the Habitat simulator on the Gibson dataset. Results show that DIPP achieves strong performance on standard ObjectNav metrics (SR, SPL) while constructing structurally accurate maps, evidenced by a high Node Recall score. Furthermore, leveraging the explicit persistent graph for hierarchical planning significantly boosts navigation performance. These findings demonstrate the effectiveness of DIPP in enabling embodied agents to build and exploit persistent spatial knowledge for long-term operation in unseen environments.

Abstract:
Surface defects are a primary source of yield loss in manufacturing, yet existing anomaly detection methods often fail in real-world deployment due to limited and unrepresentative datasets. To overcome this, we introduce 3D-ADAM, a 3D Anomaly Detection in Additive Manufacturing dataset, that is the first large-scale, industry-relevant dataset for RGB+3D surface defect detection in additive manufacturing. 3D-ADAM comprises 14,120 high-resolution scans of 217 unique parts, captured with four industrial depth sensors, and includes 27,346 annotated defects across 12 categories along with 27,346 annotations of machine element features in 16 classes. 3D-ADAM is captured in a real industrial environment and as such reflects real production conditions, including variations in part placement, sensor positioning, lighting, and partial occlusion. Benchmarking state-of-the-art models demonstrates that 3D-ADAM presents substantial challenges beyond existing datasets. Validation through expert labelling surveys with industry partners further confirms its industrial relevance. By providing this benchmark, 3D-ADAM establishes a foundation for advancing robust 3D anomaly detection capable of meeting manufacturing demands. We provide our dataset for accessibility at: https://huggingface.co/datasets/pmchard/3D-ADAM

Abstract:
Shared control combines human intention with autonomous decision-making. At the low level, the primary goal is to maintain safety regardless of the users input to the system. However, existing shared control methodsbased on, e.g., Model Predictive Control, Control Barrier Functions, or learning-based controloften face challenges with feasibility, scalability, and mixed constraints. To address these challenges, we propose a Constraint-Aware Assistive Controller that computes control actions online while ensuring recursive feasibility, strict constraint satisfaction, and minimal deviation from the users intent. It also accommodates a structured class of non-convex constraints common in real-world settings. We leverage Robust Controlled Invariant Sets for recursive feasibility and a Mixed-Integer Quadratic Programming formulation to handle non-convex constraints. We validate the approach through a large-scale user study with 66 participantsone of the most extensive in shared control researchusing a simulated environment to assess task load, trust, and perceived control, in addition to performance. The results show consistent improvements across all these aspects without compromising safety and user intent. Additionally, a real-world experiment on a robotic manipulator demonstrates the frameworks applicability under bounded disturbances, ensuring safety and collision-free operation.

Abstract:
General robot manipulation requires the handling of previously unseen objects. Learning a physically accurate model at test time can provide significant benefits in data efficiency, predictability, and reuse between tasks. Tactile sensing can compliment vision with its robustness to occlusion, but its temporal sparsity necessitates careful online exploration to maintain data efficiency. Direct contact can also cause an unrestrained object to move, requiring both shape and location estimation. In this work, we propose a learning and exploration framework that uses only tactile data to simultaneously determine the shape and location of rigid objects with minimal robot motion. We build on recent advances in contact-rich system identification to formulate a loss function that penalizes physical constraint violation without introducing the numerical stiffness inherent in rigid-body contact. Optimizing this loss, we can learn cuboid and convex polyhedral geometries with less than 10s of randomly collected data after first contact. Our exploration scheme seeks to maximize Expected Information Gain and results in significantly faster learning in both simulated and real-robot experiments. More information can be found at: https://dairlab.github.io/activetactile

Abstract:
Origami-inspired robots offer rapid, accessible design and manufacture with diverse functionalities. In particular, origami robots without conventional electronics have the unique advantage of functioning in extreme environments such as ones with high radiation or large magnetic fields. However, the absence of sophisticated control systems limits these robots to simple autonomous behaviors. In our previous studies, we developed a printable, electronics-free, and self-sustained oscillator that generates simple complementary square-wave signals. Our study presents a quadrature oscillation system capable of generating four square-wave signals a quarter-cycle out of phase, enabling four distinct states. Such control signals are important in various engineering and robotics applications, such as orchestrating limb movements in bio-inspired robots. We demonstrate the practicality and value of this oscillation system by designing and constructing an origami crawling robot that utilizes the quadrature oscillator to achieve coordinated locomotion. Together, the oscillator and robot illustrate the potential for more complex control and functions in origami robotics, paving the way for more electronics-free, rapid-design origami robots with advanced autonomous behaviors.

Abstract:
In multi-agent collaborative sensing systems, substantial communication overhead from information exchange significantly limits scalability and real-time performance, especially in bandwidth-constrained environments. This often results in degraded performance and reduced reliability. To address this challenge, we propose WaveComm, a wavelet-based communication framework that drastically reduces transmission loads while preserving sensing performance in low-bandwidth scenarios. The core innovation of WaveComm lies in decomposing feature maps using Discrete Wavelet Transform (DWT), transmitting only compact low-frequency components to minimize communication overhead. High-frequency details are omitted, and their effects are reconstructed at the receiver side using a lightweight generator. A Multi-Scale Distillation (MSD) Loss is employed to optimize the reconstruction quality across pixel, structural, semantic, and distributional levels. Experiments on the OPV2V and DAIR-V2X datasets for LiDAR-based and camera-based perception tasks demonstrate that WaveComm maintains state-of-the-art performance even when the communication volume is reduced to 86.3% and 87.0% of the original, respectively. Compared to existing approaches, WaveComm achieves competitive improvements in both communication efficiency and perception accuracy. Ablation studies further validate the effectiveness of its key components.

Abstract:
Flexible tendon-driven multi-segment robotic bronchoscopes can reach peripheral lung regions for minimally invasive diagnosis and therapy. However, long tendon transmissions introduce friction, elasticity, and backlash, which couple the motion of adjacent segments and reduce operational accuracy and safety. This paper proposes a hybrid model-learning decoupled control framework for a two-segment bronchoscope that explicitly cancels distal-to-proximal coupling while compensating transmission disturbances. The method learns online a pose-dependent coupling map from synchronized encoder and electromagnetic measurements and uses it for feedforward cancellation in the proximal channel. In addition, an adaptive disturbance compensation module estimates per-tendon compliance and backlash to correct stretch and dead-zone effects. A two-segment tendon-driven robotic bronchoscope platform demonstrated a substantial reduction in proximal drift during distal actuation. At a 90° distal bend, the mean proximal coupling angle was 5.84°. Compared with the most commonly used piecewise constant curvature model baseline, the proposed controller achieved stronger motion decoupling, reducing the coupling rate by 86.47%, thereby enabling more precise bronchoscopic manipulation in anatomically constrained environments.

Abstract:
Successfully solving long-horizon manipulation tasks remains a fundamental challenge. These tasks involve extended action sequences and complex object interactions, presenting a critical gap between high-level symbolic planning and low-level continuous control. To bridge this gap, two essential capabilities are required: robust long-horizon task planning and effective goal-conditioned manipulation. Existing task planning methods, including traditional and LLM-based approaches, often exhibit limited generalization or sparse semantic reasoning. Meanwhile, image-conditioned control methods struggle to adapt to unseen tasks. To tackle these problems, we propose SAGE, a novel framework for Scene Graph-Aware Guidance and Execution in Long-Horizon Manipulation Tasks. SAGE utilizes semantic scene graphs as a structural representation for scene states. A structural scene graph enables bridging task-level semantic reasoning and pixel-level visuo-motor control. This also facilitates the controllable synthesis of accurate, novel sub-goal images. SAGE consists of two key components: (1) a scene graph-based task planner that uses VLMs and LLMs to parse the environment and reason about physically-grounded scene state transition sequences, and (2) a decoupled structural image editing pipeline that controllably converts each target sub-goal graph into a corresponding image through image inpainting and composition. Extensive experiments have demonstrated that SAGE achieves state-of-the-art performance on distinct long-horizon tasks.

Abstract:
Recently, biped robot walking technology has been significantly developed; however, mainly in a bland walking scheme. To emulate human walking, robots need to step on the positions they see in unknown spaces accurately. In this paper, we present PolyMap, a perception-based locomotion planning framework for humanoid robots to climb stairs. Our core idea is to build a real-time polygonal staircase plane semantic map, followed by a footstep planar using these polygonal plane segments. These plane segmentation and visual odometry are done by multi-sensor fusion(LIDAR, RGB-D camera and IMUs). The proposed framework is deployed on a NVIDIA Orin, which performs 20-30 Hz whole-body motion planning output. Both indoor and outdoor real-scene experiments indicate that our method is efficient and robust for humanoid robot stair climbing.

Abstract:
SLAM methods based on 3D Gaussian Splatting (3DGS) have demonstrated impressive tracking and mapping performance, but typically require additional geometric information from external depth sensors. Meanwhile, recent SLAM systems that leverage geometric priors from pre-trained feed-forward models enable real-time dense reconstruction, yet often discard original RGB information during optimization, thus degrading overall reconstruction quality. We present GeoGS-SLAM, an online monocular dense reconstruction system that combines the 3DGS-based map representation with learned geometric priors. Given uncalibrated RGB input, we first employ a feed-forward visual geometry model to predict camera and scene priors. The Gaussian scene map is then expanded by directly sampling Gaussian primitives from both RGB input and geometric priors. Camera poses and the scene map are jointly optimized through a coarse-to-fine strategy that minimizes both photometric and geometric losses. To ensure global consistency, we further incorporate online loop closure detection and pose graph optimization. Extensive experiments across indoor and outdoor benchmarks demonstrate that GeoGS-SLAM achieves superior rendering quality and tracking accuracy compared to state-of-the-art methods while maintaining online real-time performance. Project page: urlhttps://rlgao.github.io/geogs_slam.

Abstract:
近年来，视觉-语言-行动（VLA）模型通过无缝整合视觉感知、语言理解和动作生成，在端到端的学习框架中彻底革新了机器人作。然而，由于这些模型设计为直接与物理世界和人类交互，其安全性至关重要，即使是小漏洞也可能导致灾难性故障。在本研究中，我们提出了通用对抗对象，这是一种表面纹理优化的球体，当置于机器人视野内时，任务成功率会显著降低。具体来说，我们的方法引入了一个多层次攻击框架，能够共同干扰轨迹规划、任务执行和动作控制。我们在模拟和现实机器人环境中验证了我们的方法。实验结果表明，对抗对象在两种代表性VLA模型（Pi0和RDT

Abstract:
Detecting functional grasp poses for tool operation is critical for robots in complex real-world tasks, yet existing methods lack this capability. Key challenges are: 1) Scarce realworld datasets with fine-grained functional labels and task-valid grasp annotations, as their construction requires domain knowledge (making annotation labor-intensive/subjective) and linking poses to tool usage (beyond stability checks); 2) Difficulty in fine-grained functional segmentation, where minimal sub-region differences are overwhelmed by global cues/noise, with 3D model-dependent methods impractical in unstructured settings; 3) Poor 6-DoF grasp alignment with functional regions due to high morphological heterogeneity, as existing methods either fail to balance stability and functional constraints (high-score grasps outside regions) or are limited to low degrees of freedom. To address these, we build the Tool-Grasp Dataset (20 tool categories, 50 scenes, 12,600 RGB-D images, 250M+ 6-DoF annotations) with fine-grained functional labels. We propose ToolGrasp, a two-stage 6-DoF framework: Stage 1s Mask-Guided Grasp Region Segmentation Network (MG-GRSN) leverages tool-specific semantics to output precise functional masks, mitigating intra-tool variability; Stage 2s Quality-Aware MultiModal Grasp Pose Detection Network (QAM-GPDN) uses these masks to constrain predictions, fusing RGB-D features with a quality module to select aligned poses. Experiments show MGGRSN outperforms baselines by 3.5% (seen) and 5.2% (unseen) in mIoU; QAM-GPDN boosts functional pose AP by 2.89% (seen) and 3.76% (unseen). Real-robot experiments validate real-world effectiveness.

Abstract:
Multi-robot systems can be extremely efficient for accomplishing team-wise tasks by acting concurrently and collaboratively. However, most existing methods either assume static task features or simply replan when environmental changes occur. This paper addresses the challenging problem of coordinating multi-robot systems for collaborative tasks involving dynamic and moving targets. We explicitly model the uncertainty in target motion prediction via Conformal Prediction (CP), while respecting the spatial-temporal constraints specified by Linear Temporal Logics (LTL). The proposed framework (UMBRELLA) combines the Monte Carlo Tree Search (MCTS) over partial plans with uncertainty-aware rollouts, and introduces a CP-based metric to guide and accelerate the search. The objective is to minimize the Conditional Value at Risk (CVaR) of the average makespan. For tasks released online, a receding-horizon planning scheme dynamically adjusts the assignments based on updated task specifications and motion predictions. Spatial and temporal constraints among the tasks are always ensured, and only partial synchronization is required for the collaborative tasks during online execution. Extensive large-scale simulations and hardware experiments demonstrate significant reductions in both the average makespan and its variance by 23% and 71%, compared with static baselines.

Abstract:
In Social Robot Navigation (SRN), the availability of meaningful metrics is crucial for evaluating trajectories from human-robot interactions. In the SRN context, such interactions often relate to resolving conflicts between two or more agents. Correspondingly, the shares to which agents contribute to the resolution of such conflicts are important. This paper builds on recent work, which proposed a Responsibility metric capturing such shares. We extend this framework in two directions: First, we model the conflict buildup phase by introducing a time normalization. Second, we propose the related Engagement metric, which captures how the agents' actions intensify a conflict. In a comprehensive series of simulated scenarios with dyadic, group and crowd interactions, we show that the metrics carry meaningful information about the cooperative resolution of conflicts in interactions. They can be used to assess behavior quality and foresightedness. We extensively discuss applicability, design choices and limitations of the proposed metrics.

Abstract:
Continuum robots enable dexterous manipulation in constrained environments, but require accurate and efficient models for real-time manipulation and control. Traditional physics-based models can be computationally expensive and may suffer from inaccuracies due to unmodeled effects, while current learning-based methods often generalize poorly beyond the specific robot on which they are trained. We present a formulation of surrogate modeling for tendon-driven continuum robots as an operator learning problem that maps robot design parameters and tendon actuation inputs to resulting configurations. This formulation enables a single trained model to generalize across a large class of robot designs. We develop four novel neural operator architectures--two based on Deep Operator Networks (DeepONets) and two based on Fourier Neural Operators (FNOs)--and train them on simulation data to predict robot configurations. All architectures achieve good accuracy while allowing for fast and accurate generalization across designs. Our results demonstrate that operator learning provides an effective and generalizable surrogate for continuum robot mechanics in the design space, enabling fast modeling for control, planning, and design optimization in surgical and industrial applications.

Abstract:
Magnetic particle swarms are governed by rich nonlinear collective dynamics that complicate predictive, feedback-based control in biomedical microrobotics. We develop a physics-based reduced-order ellipse model that describes the swarm morphology by its principal radii (r1, r2). At steady state, these radii depend explicitly on magnetic-field curvature, axial gradients, and actuation angular velocity through anisotropic stiffness terms. Model parameters are identified experimentally, yielding low validation errors (RMSE: 0.25 mm for r1 and 0.42 mm for r2) and revealing pronounced stiffness anisotropy (Sx/Sy �?0.16). The resulting formulation provides compact, interpretable equations that enable tractable control design and feedback regulation of magnetic particle swarms.

Abstract:
Tactile sensing is essential for enabling dexterous robotic manipulation, yet estimating contact states such as location and force from high-dimensional sensor measurements remains challenging due to noise and complex nonlinear mappings between raw signals and physical interaction states. In this work, we propose a physics-informed contact modeling framework that combines the flexibility of deep models with inductive biases from physical modeling. Focusing on electrical impedance tomography (EIT) tactile skins, our approach incorporates knowledge of the EIT forward model by regularizing neural estimators with a latent-space consistency constraint, stabilizing the ill-posed inverse mapping from voltages to contact states. To support robust training and evaluation, we also develop a high-fidelity simulation pipeline that incorporates key hardware imperfections to better bridge the sim-to-real gap. We benchmark multiple architecturesincluding multilayer perceptrons, convolutional networks, Transformer-based models, and autoencoder regressorson both real and synthetic datasets. Results show that the proposed hybrid approach consistently improves estimation accuracy, particularly for force prediction, and generalizes across domains. These findings highlight the value of embedding physical priors into learning pipelines for reliable tactile state estimation in robotic manipulation.

Abstract:
A key challenge in modern robotics is to adapt to changing environments, a challenge that is exacerbated when simulations cannot encompass every possible real-world configuration, and therefore Reinforcement Learning (RL) in the physical world becomes necessary. Continual Reinforcement Learning (RL) provides the tools to address this challenge; however, both the frameworks and the methods remain un- derexplored. Autonomous Racing (AR) and in particular the RoboRacer competition provide a testing ground for such methods, as learning to drive on a new track-floor combination with the least amount of new experience naturally frames a continual learning problem. This work tries to address this gap by proposing a continual RL framework based on Continual Backpropagation (CBP) that is able, with only real-world data, to train a generalistic policy on a set of tracks and then fine- tune it within 15 minutes to outperform classical controllers. Furthermore, a comparison method based on offline RL is proposed, and a simulation analysis of the plasticity properties of the methods is conducted.

Abstract:
The optimal design of robotic actuators is a critical area of research, yet limited attention has been given to optimizing gearbox parameters and automating actuator CAD. This paper introduces COMPAct: Computational Optimization and Automated Modular Design of Planetary Actuators, a framework that systematically identifies optimal gearbox parameters for a given motor across four gearbox types, single-stage planetary gearbox (SSPG), compound planetary gearbox (CPG), Wolfrom planetary gearbox (WPG), and double-stage planetary gearbox (DSPG). The framework minimizes mass and actuator width while maximizing efficiency, and further automates actuator CAD generation to enable direct 3D printing without manual redesign. Using this framework, optimal gearbox designs are explored across a wide range of gear ratios, providing insights into the suitability of different gearbox types while automatically generating CAD models for all four gearbox types with varying gear ratios and motors. Two actuator types are fabricated and experimentally evaluated through power efficiency, no-load backlash, and transmission stiffness tests. Experimental results indicate that the SSPG actuator achieves a mechanical efficiency of 60--80%, a no-load backlash of 0.59 deg, and a transmission stiffness of 242.7 Nm/rad, while the CPG actuator demonstrates 60% efficiency, 2.6 deg backlash, and a stiffness of 201.6 Nm/rad. CODE: https://github.com/singhaman1750/COMPAct.git VIDEO: https://youtu.be/etK6anjXag8

Abstract:
The motion planning problem requires finding a collision-free path between start and goal configurations in high-dimensional, cluttered spaces. Recent learning-based methods offer promising solutions, with self-supervised physics-informed approaches such as Neural Time Fields (NTFields) solving the Eikonal equation to learn value functions without expert demonstrations. However, existing physics-informed methods struggle to scale in complex, multi-room environments, where simply increasing the number of samples cannot resolve local minima or guarantee global consistency. We propose Hierarchical Neural Time Fields (H-NTFields), a weakly-supervised framework that combines weak supervision from sparse roadmaps with physics-informed PDE regularization. The roadmap provides global topological anchors through upper and lower bounds on travel times, while PDE losses enforce local geometric fidelity and obstacle-aware propagation. Experiments on 18 Gibson environments and real robotic platforms show that H-NTFields substantially improves robustness over prior physics-informed methods, while enabling fast amortized inference through a continuous value representation.

Abstract:
New developments in robotics have allowed robots to become very small, and capable of completing tasks humans cannot. Current robots capable of achieving this are physically limited in how small they can be without compromising on other aspects such as sensing, strength, or complexity. Thus, we strive to understand how we can more compactly map complex mechanical outputs to a low number of mechanical inputs. This paper presents a novel design for a hyper-redundant robot, capable of passive multiplexing. This is achieved using bistable joints at each link, with each link having a different bistable moment in order to establish priority when multiplexing. In doing so, this simple mechanism is able to achieve individual joint control, and reach a variety of complex configurations. To demonstrate the proposed robot, we construct an eleven linked mechanism and four linked mechanism, in which we demonstrate multiplexing, as well as high positional accuracy. By simulating the mechanism, we also quantify a geometric relationship between individual links and the overall robots workspace.

Abstract:
In teleoperation, the human operator typically controls only the end-effector pose, which often leads to self-collisions of the manipulator and collisions with environmental obstacles, since joints and links are not controlled individually. A common strategy to mitigate this issue is to enhance the operator's input using optimal-control-based trajectory planning. As derivative-based solvers require differentiable constraints, existing approaches either approximate robots and obstacles with spheres, reducing geometric accuracy, or approximate derivatives, degrading convergence and increasing computation times. We address these limitations by adapting a recent formulation of differentiable collision-avoidance constraints, based on duality in convex optimization, to the teleoperation setting. The robot is approximated with capsules and the environment with polytopes. We compare the resulting trajectory planning method against state-of-the-art techniques in simulation with varying numbers of obstacles and evaluate it on a UR5e manipulator in a real-world teleoperation test. Results show that our approach achieves lower computation times while enabling more accurate obstacle modeling, leading to smoother and collision-free end-effector teleoperation.

Abstract:
This paper presents novel method for distribution-free robust trajectory optimization and control of discrete-time, nonlinear, and non-Gaussian stochastic systems, with closed-loop guarantees on chance constraint satisfaction. Our framework employs conformal inference to generate coverage-based confidence sets for the closed-loop dynamics around arbitrary reference trajectories, by constructing a joint nonconformity score to quantify both the validity of contraction (i.e., incremental stability) conditions and the impact of external stochastic disturbance on the closed-loop dynamics, without any distributional assumptions. Via appropriate constraint tightening, chance constraints can be reformulated into tractable, statistically valid deterministic constraints on the reference trajectories. This enables a formal pathway to leverage and validate learning-based motion planners and controllers, such as those with neural contraction metrics, in safety-critical real-world applications. Notably, our statistical guarantees are non-diverging and can be computed with finite samples of the underlying uncertainty, without overly conservative structural priors. We demonstrate our approach in motion planning problems for designing safe, dynamically feasible trajectories in both numerical simulation and hardware experiments.

Abstract:
Vision-based robot manipulation systems often suffer from performance degradation under domain shifts in visual inputs. While data augmentation is commonly employed in reinforcement learning, its application in imitation learning remains relatively underexplored. Our preliminary experiments indicate that simply incorporating augmentation techniques does not yield effective improvements in imitation learning. To address this challenge, we propose a two-stage learning process. First, we develop an adversarial feature learning framework that leverages data augmentation to enhance robustness against domain shifts. Second, we introduce an unsupervised domain adaptation method that adapts models to target environments using only easily collected image data. In robotic tasks, visual domain shifts can often be detected from initial observations alone. Since collecting complete action-labeled episodes in new domains is expensive, adapting with only initial images greatly reduces data collection costs. To this end, we develop an adaptation strategy that relies solely on initial target-domain observations, eliminating the need for labeled demonstrations. Experimental results across both simulation and physical robot implementations demonstrate that our method preserves source domain performance while exhibiting enhanced resilience to visual perturbations, including varying lighting conditions, background modifications, and environmental distractors.

Abstract:
Ensuring accurate and stable state estimation is a challenging task crucial to safety-critical domains such as high-speed autonomous racing, where measurement uncertainty must be both adaptive to the environment and temporally smooth for control. In this work, we develop a learning-based framework, LACE, capable of directly modeling the temporal dynamics of GNSS measurement covariance. We model the covariance evolution as an exponentially stable dynamical system where a deep neural network (DNN) learns to predict the system's process noise from environmental features through an attention mechanism. By using contraction-based stability and systematically imposing spectral constraints, we formally provide guarantees of exponential stability and smoothness for the resulting covariance dynamics. We validate our approach on an AV-24 autonomous racecar, demonstrating improved localization performance and smoother covariance estimates in challenging, GNSS-degraded environments. Our results highlight the promise of dynamically modeling the perceived uncertainty in state estimation problems that are tightly coupled with control sensitivity.

Abstract:
Simulation environments are useful for both robot policy learning and planning verification and validation. Traditionally, the process of creating a simulation was onerous. Creating a bespoke simulation environment for each individual environment that a robot would operate in was simply infeasible. In this work, we introduce PerceptTwin, a fully automatic pipeline that constructs interactive simulations directly from semantic scene representations produced by a robots perception stack. PerceptTwin combines open-vocabulary object maps with 3D asset generation, affordance prediction, and commonsense condition checking. These interactive simulations can be used to validate and refine plans before they are executed on the robot hardware. Borrowing from the AI alignment literature, we also introduce an LLM judge that verifies plan correctness and alignment with human preferences. Experiments show that PerceptTwin feedback allows LLM planners to refine plans, enhance safety, and resist harmful black-box prompting attacks. In our suite of tasks, PerceptTwin improves plan success by an average of ~39% for GPT5, GPT5Mini, and GPT5Nano planners. Additionally, PerceptTwin also improves human plan verification by up to ~18% on average for plans that fail due to unfilled skill preconditions. Our results demonstrate the potential of open-vocabulary scene simulation from robot perception as a foundation for safer, more reliable robot planning.

Abstract:
Manipulation involving rigid-deformable interactions, such as hanging clothes or dressing humans, is essential for household robots. Compared to single-object manipulation or interactions between rigid bodies, these tasks are particularly challenging due to the rich multi-point contacts and the complex dynamics of the deformable bodies during interaction. Therefore, object-centric representations such as 6D poses or structural points without task-specific information become insufficient for these interactions. In this work, we propose a hybrid correspondence-based representation tailored for deformable-rigid interactions. First, we introduce structure-, task-, and interaction-aware sparse keypoints. The keypoints are generated based on the global structures of both rigid and deformable objects, and filtered by their local interaction contacts. However, tracking these sparse keypoints through the interaction remains difficult due to the high-dimensional dynamics of deformable objects. Therefore, we further construct dense correspondences on the deformable objects for accurate keypoint tracking throughout manipulation. This hybrid design combines the advantages of both representations: sparse keypoints encode rich, task-specific information for fine-grained manipulation, while dense correspondences ensure efficient tracking and generalization to novel deformations, shapes, and scenarios. Extensive experiments demonstrate the effectiveness and broad applicability of our method.

Abstract:
Robot-Assisted Surgery is integral to modern minimally invasive procedures, with automation emerging as the next frontier to enhance precision and reduce surgeon fatigue. This evolution is largely impeded by the inherent kinematic inaccuracies of surgical robots, where unreliable internal sensors lead to significant control errors. While previous methods attempted to mitigate these issues through complex model-based calibration, they often suffer from high cost and limited effectiveness. This work utilises a learning-policy to actively compensate for hardware inaccuracies using closed-loop visual feedback that was trained from a teacher-student learning framework. The policy can fuse unreliable internal readings with precise external visual data, allowing it to correct for kinematic errors in real time without needing a perfect physical model. The learned policy was successfully deployed on the da Vinci Research Kit, where experiments validated the fundamental feasibility of using external vision to overcome internal sensor deficits. This research provides a foundational and reliable control methodology, paving the way for more advanced and robust surgical automation.

Abstract:
Outdoor intelligent autonomous robotic operation relies on a sufficiently expressive map of the environment. Classical geometric mapping methods retain essential structural environment information, but lack a semantic understanding and organization to allow high-level robotic reasoning. 3D scene graphs (3DSGs) address this limitation by integrating geometric, topological, and semantic relationships into a multi-level graph-based map. Outdoor autonomous operations commonly rely on terrain information either due to task-dependence or the traversability of the robotic platform. We propose a novel approach that combines indoor 3DSG techniques with standard outdoor geometric mapping and terrain-aware reasoning, producing terrain-aware place nodes and hierarchically organized regions for outdoor environments. Our method generates a task-agnostic metric-semantic sparse map and constructs a 3D Scene Graph from this map for downstream planning tasks, all while remaining lightweight for autonomous robotic operation. Our thorough evaluation demonstrates our 3DSG method performs on par with state-of-the-art camera-based 3DSG methods while remaining memory efficient. We also demonstrate its effectiveness in diverse robotic tasks of object retrieval and region monitoring in both simulation and real-world environments.

Abstract:
Traditional deep reinforcement learning-based visual navigation techniques face challenges in dynamic and unstructured outdoor environments, particularly in the absence of high-resolution maps and GPS signals. This paper presents a deep reinforcement learning-based approach for target-driven visual navigation without explicit localization and mapping in outdoor settings, using the successor feature (SF) framework to enhance the model's transfer learning. This design enables effective knowledge transfer across tasks, allowing the model to adapt to novel environments with zero-shot or few-shot fine-tuning. To facilitate training and evaluation, we design grid-world environments constructed from real-world outdoor images, providing realistic yet controlled conditions for developing and testing deep reinforcement learning-based navigation. Experimental results demonstrate that our method can adapt effectively in outdoor environments, both within the same domain and across different domains. Moreover, despite being trained in a discrete grid-world setting, the model is successfully deployed in real time within the same area, maintaining robust performance and highlighting its strong transferability to continuous, real-world conditions.

Abstract:
We contribute Bi3, a dataset of social robot navigation among groups of people in a constrained lab space. Compared to prior data collection efforts for social robot navigation, our dataset is unique in that it features: an original experiment design giving rise to close navigation encounters between two humans and a robot; five different navigation algorithms; two different robot platforms; a diverse participant pool of 74 people recruited from two sites in the USA and France; multimodal data streams including 10.5 hours of human and robot ground-truth motion tracks, RGB video, and user impressions over robot performance. Our analysis of the collected dataset through metrics like interaction density and human velocity suggests that Bi3 represents a benchmark of unique diversity and modeling complexity. Bi3 contributes towards understanding how humans and robots can productively mesh their activities in constrained environments, and can be a resource for training models of human motion prediction and robot control policies for navigation in densely crowded spaces.

Abstract:
Agile mapless navigation in cluttered 3D environments poses significant challenges for autonomous drones. Conventional mappingplanningcontrol pipelines incur high computational cost and propagate estimation errors. We present AERO-MPPI, a fully GPU-accelerated framework that unifies perception and planning through an anchor-guided ensemble of Model Predictive Path Integral (MPPI) optimizers. Specifically, we design a multi-resolution LiDAR point-cloud representation that rapidly extracts spatially distributed anchors as look-ahead intermediate endpoints, from which we construct polynomial trajectory guides to explore distinct homotopy path classes. At each planning step, we run multiple MPPI instances in parallel and evaluate them with a two-stage multi-objective cost that balances collision avoidance and goal reaching. Implemented entirely with NVIDIA Warp GPU kernels, AERO-MPPI achieves real-time onboard operation and mitigates the local-minima failures of single-MPPI approaches. Extensive simulations in forests, verticals, and inclines demonstrate sustained reliable flight above 7 m/s, with success rates above 80% and smoother trajectories compared to state-of-the-art baselines. Real-world experiments on a LiDAR-equipped quadrotor with NVIDIA Jetson Orin NX 16G confirm that AERO-MPPI runs in real time onboard and consistently achieves safe, agile, and robust flight in complex cluttered environments. Code is available at https://github.com/XinChen-stars/AERO_MPPI.

Abstract:
One of the goals of active information acquisition using multi-robot teams is to keep the relative uncertainty in each region at the same level to maintain identical acquisition quality (e.g., consistent target detection) in all the regions. To achieve this goal, ergodic coverage can be used to assign the number of samples according to the quality of observation, i.e., sampling noise levels. However, the noise levels are unknown to the robots. Although this noise can be estimated from samples, the estimates are unreliable at first and can generate fluctuating values. The main contribution of this paper is to use simulated annealing to generate the target sampling distribution, starting from uniform and gradually shifting to an estimated optimal distribution, by varying the coldness parameter of a Boltzmann distribution with the estimated sampling entropy as energy. Simulation results show a substantial improvement of both transient and asymptotic entropy compared to both uniform and direct-ergodic searches. Finally, a demonstration is performed with a TurtleBot swarm system to validate the physical applicability of the algorithm.

Abstract:
Accurate environment perception is fundamental for robust robot navigation, mapping, and interaction. Traditional perception pipelines rely on multiple sensors, including stereo cameras and LiDAR, which impose constraints on cost, payload, and system integration. In this paper, we propose a novel single-image perception framework that unifies novel view synthesis and RGB/segmented LiDAR emulation into a single pipeline. Leveraging monocular depth estimation and camera intrinsics recovery, our approach projects image pixels into 3D space and performs mesh reconstruction to generate dense geometric representations. This enables high-fidelity sensor emulation, including transparent surface reconstruction such as glass - an element often missed by conventional LiDAR. By enriching synthetic LiDAR scans with otherwise unavailable geometry, our method enhances downstream tasks such as robot path planning and obstacle avoidance. This work opens up new possibilities for resource-efficient robotic perception by reducing sensor dependency while improving geometric reasoning.

Abstract:
Modeling concentric tube robots (CTRs) involves complex nonlinear continuum mechanics, and despite recent progress, physics-based models often lack an accurate representation of the experimental setups. To overcome these limitations, deep neural network-based models have been explored as alternatives with superior accuracy; however, they often overlook known mechanics, require large training datasets, and typically discard shape estimation of the robot. We present a physics-informed neural network (PINN) for kinematic modeling of a 6-DoF CTR with three pre-curved tubes that embed the Cosserat rod differential equations and learns from few-shot observational data, balancing physics priors with data-driven fitting. PINN enables full-state estimation of shape, twist angle, torsional strain, bending moment, and orientation. Benchmark tests show a mean shape error below 1% of the robot length and accurately recovers other kinematic states, outperforming a purely physics-based Cosserat rod model baseline while using a minimal training set. The resulting model is also computationally efficient and robust, making it well-suited for real-time control applications.

Abstract:
Continuum robots exhibit high-dimensional, nonlinear dynamics which are often coupled with their actuation mechanism. Spectral submanifold (SSM) reduction has emerged as a leading method for reducing high-dimensional nonlinear dynamical systems to low-dimensional invariant manifolds. Our proposed control-augmented SSMs (caSSMs) extend this methodology by explicitly incorporating control inputs into the state representation, enabling these models to capture nonlinear state-input couplings. Training these models relies solely on controlled decay trajectories of the actuator‑augmented state, thereby removing the additional actuation‑calibration step commonly needed by prior SSM‑for‑control methods. We learn a compact caSSM model for a tendon-driven trunk robot, enabling real-time control and reducing open-loop prediction error by 40% compared to existing methods. In closed-loop experiments with model predictive control (MPC), caSSM reduces tracking error by 52%, demonstrating improved performance against Koopman and SSM based MPC and practical deployability on hardware continuum robots.

Abstract:
High-quality LiDAR point cloud (LPC) compression is essential for the storage and transmission of 3D data. The octree-structured entropy codec has emerged as the predominant method; however, previous methods do not fully utilize spatial contextual information, due to the loss of local features caused by uneven scanning density. To address this problem, we propose OctHilNet, a novel Hilbert-guided hierarchical framework for LPC compression that introduces the polarized octree for efficient node organization and the serialize-driven entropy model to strengthen the continuity of node contexts. Specifically, to counteract the inherent density imbalance, OctHilNet first transforms points into polar coordinates and applies a non-linear rebalancing to the radial distance. Then, we introduce the Hilbert space-filling curve to mitigate the impact of the decoupling between sequential adjacency and geometric proximity in octree node sequences. Finally, to better capture fine-grained spatial correlations, we propose LocAtten and NeighbConv modules in a hierarchical Transformer, which jointly strengthen local dependencies overlooked by standard self-attention. Compared to the previous state-of-the-art works, our method achieves 45.1%-50.1% and 51.9%-53.9% BD-Rate gains on the LPC benchmark SemanticKITTI and MPEG-specified Ford datasets, respectively. In particular, our OctHilNet allows for extension to downstream tasks (i.e., vehicle detection and semantic segmentation), further demonstrating the practicality of the method.

Abstract:
As a robot's operational environment and tasks to perform within it grow in complexity, the explicit specification and balancing of optimization objectives to achieve a preferred behavior profile moves increasingly farther out of reach. These systems benefit strongly by being able to align their behavior to reflect human preferences and respond to corrections, but manually encoding this feedback is infeasible. Active preference learning (APL) learns human reward functions by presenting trajectories for ranking. However, existing methods sample from fixed trajectory sets or replay buffers that limit query diversity and often fail to identify informative comparisons. We propose CRED, a novel trajectory generation method for APL that improves reward inference by jointly optimizing environment design and trajectory selection to efficiently query and extract preferences from users. CRED "imagines" new scenarios through environment design and leverages counterfactual reasoning--by sampling possible rewards from its current belief and asking "What if this were the true preference?"-to generate trajectory pairs that expose differences between competing reward functions. Comprehensive experiments and a user study show that CRED significantly outperforms state-of-the-art methods in reward accuracy and sample efficiency and receives higher user ratings.

Abstract:
Autonomous navigation that relies on precise metric maps is inherently fragile to environmental changes and mapping inaccuracies. These discrepancies often lead to failures in localization and path planning, as the robot's internal representation of the world no longer matches reality. We propose an alternative navigation approach that instead focuses on how a robot interacts with its surroundings rather than its precise metric position. Our core contribution is a learned behavioral vocabulary conditioned on raw sensor data that can be used to compose plans for navigation. Our system transforms LiDAR data into low-dimensional learned embeddings which are clustered to create a set of abstract, human-interpretable behaviors (e.g., along wall, exiting intersection, on bridge). This representation allows the robot to control its behavior with respect to the embedding rather than controlling its state with respect to a specific metric cost function or waypoint, thereby minimizing the impact of map and position inaccuracies. We define the mission as a topological sequence of behavioral clusters on the overhead map, enabling high-level navigation.This approach provides a robust way to decompose the environment into recognizable and actionable states that can reliably compose a plan, even on stale maps with environmental deformations and world changes. Our method achieves higher navigation success under intentional map distortions, with average mission success rates 53 and 55 percentage points higher for short and long term plans respectively when compared to baselines which rely on accurate metric maps.

Abstract:
Evaluating the safety of autonomous vehicles requires simulation of safety-critical scenarios such as potential collisions, which are difficult to reproduce in real-world environments. Prior methods rely on future trajectory predictions and heuristically select adversarial agents based on spatial proximity to the ego vehicle, often producing unrealistic scenarios that misalign with real-world temporal dynamics and contextual risk. To address these issues, we propose CRASH, the first learning-based adversarial agent selection approach that operates solely on past and present observations. It comprises two key components: (1) a Motion-Aware Masking (MAM) module that filters out static agents unlikely to collide with the ego vehicle due to negligible movement, and (2) an Adversarial agent Selection Module (ASM) that models contextual interactions to probabilistically estimate each agents likelihood of inducing a collision with the ego vehicle. Experiments on the nuScenes and Waymo datasets demonstrate that CRASH significantly improves the success rate of generating realistic collision scenarios under both replay and rule-based planners, validating the effectiveness of context-aware agent modeling without access to future information.

Abstract:
Autonomous vehicles must generate long-horizon and dynamically feasible trajectories in real timeeven when operating at the limits of vehicle handlingto ensure safe operation in adverse conditions. However, existing work rarely quantifies the computational demands of generating such trajectories without prior references, warm starts and often defaults to low-fidelity models, compromising accuracy and control authority. We investigate the modeling and solver design choices that enable real-time solution of long-horizon, reference-free optimal control problems (OCPs) using full vehicle dynamics. To this end, we analyze vehicle stiffness properties to justify the OCP's integration scheme and show that lower-order A-stable methods consistently outperform alternatives, with solve time differences reaching two orders of magnitude. We show that robust nonlinear solver performance hinges on understanding barrier parameter update strategies and safeguarding techniques for Hessian indefiniteness, inherent in some interior point methods. Lastly, we propose a computationally efficient method for generating initial guesses using dynamic equilibrium, unlocking real-time performance and reducing initial infeasibility by up to four orders of magnitude. Extensive benchmarking and high-fidelity BeamNG simulation demonstrate compute times as low as 55 ms over a 260 m horizon, including high-speed obstacle avoidance scenarios where drifting emerges as a necessary component of feasible trajectory generation.

Abstract:
Grasping target objects is a fundamental skill for robotic manipulation, but in cluttered environments with stacked or occluded objects, a single-step grasp is often insufficient. To address this, previous work has introduced pushing as an auxiliary action to create graspable space. However, these methods often struggle with both stability and efficiency because they neglect the scenes geometric information, which is essential for evaluating grasp robustness and ensuring that pushing actions are safe and effective. To this end, we propose a geometry-aware pushgrasp synergy framework that leverages point cloud data to integrate grasp and push evaluation. Specifically, the grasp evaluation module analyzes the geometric relationship between the grippers point cloud and the points enclosed within its closing region to determine grasp feasibility and stability. Guided by this, the push evaluation module predicts how pushing actions influence future graspable space, enabling the robot to select actions that reliably transform non-graspable states into graspable ones. By jointly reasoning about geometry in both grasping and pushing, our framework achieves safer, more efficient, and more reliable manipulation in cluttered settings. Our method is extensively tested in simulation and real-world environments in various scenarios. Experimental results demonstrate that our model generalizes well to real-world scenes and unseen objects. The code and video are available at https://github.com/xiaolijz/GAPG.

Abstract:
The absence of sensory feedback has been a critical challenge for myoelectric prostheses in recent years. While electrotactile feedback has emerged as an effective non-invasive solution, significant challenges remain in simultaneously ensuring real-time performance, processing EMG signals under electrical stimulation interference, and transmitting richer sensory information. This study proposes a multidimensional bio-inspired electrical stimulation feedback paradigm, implemented on a self-developed closed-loop myoelectric prosthetic hand system with real-time interference avoidance capability. Utilizing the human cutaneous nervous system as the feedback pathway, our paradigm establishes diverse electrotactile patterns through real-time modulation of four-channel stimulation parameters (frequency and current intensity). Experimental results with both able-bodied participants and amputees demonstrate that the proposed paradigm can accurately convey prosthetic state information, enabling users to perceive object size, length, shape, and stiffness through the prosthetic hand. This feedback framework provides a viable sensory restoration solution for prosthetic applications.

Abstract:
We address the over-constrained planning problem in semi-static environments. The planning objective is to find a best-effort solution that avoids all hard constraint regions while minimally traversing the least risky areas. Conventional methods often rely on pre-defined area costs, limiting generalizations. Further, the spatial continuity of navigation spaces makes it difficult to identify regions that are passable without overestimation. To overcome these challenges, we propose SuReNav, a superpixel graph-based constraint relaxation and navigation method that imitates human-like safe and efficient navigation. Our framework consists of three components: 1) superpixel graph map generation with regional constraints, 2) regional-constraint relaxation using graph neural network trained on human demonstrations for safe and efficient navigation, and 3) interleaving relaxation, planning, and execution for complete navigation. We evaluate our method against state-of-the-art baselines on 2D semantic maps and 3D maps from OpenStreetMap, achieving the highest human-likeness score of complete navigation while maintaining a balanced trade-off between efficiency and safety. We finally demonstrate its scalability and generalization performance in real-world urban navigation with a quadruped robot, Spot. Code and Videos are available at https://sure-nav.github.io/.

Abstract:
Understanding an autonomous agent's decision-making prowess is of paramount importance, as it increases trust and guarantees safety. Although agent policies learned through reinforcement learning (RL) and machine learning (ML) paradigms have demonstrated their dominance in various domains, they struggle with deployment in high-stakes environments due to their algorithmic opacity. A structured and transparent representation of a policy helps us understand, evaluate, and modify it if necessary. Due to their inherent reactivity, modularity, and transparent hierarchical representation, the Behavior Tree (BT) is an ideal solution to represent control policies. In this paper, we focus on building a knowledge representation transfer framework in which knowledge of trained RL agents is captured through imitation learning and then utilized to form a compact BT. Our primary focus is to retain maximum performance while improving the interpretability of the BTs. In combination with planning and learning, we automate the formation of a BT and offer an alternative, transparent architecture for policy representation. In an extensive analysis with a variety of gymnasium environments and the Robotics Package Delivery domain simulations, we demonstrate the significant performance retention capability and superior interpretability of the proposed Imitation-BT. %in real-world applications.

Abstract:
In this paper, the concurrent-allocation task execution (CATE) algorithm is presented to address this problem (i.e., MPCM navigation in obstacle environments). First, the path-crossing-related elements in terms of (i) robot allocation, (ii) desired-point convergence, and (iii) collision and obstacle avoidance are en-coded into integer and control barrier function (CBF) constraints. Then, the proposed constraints are used in an online constrained optimization framework, which implicitly yet effectively mini-mizes the possible path crossings and trajectory length in obstacle environments by minimizing the desired point allocation cost and slack variables in CBF constraints simultaneously. In this way, the MPCM navigation in obstacle environments can be achieved with flexible spatial orderings. Note that the feasibility of solutions and the asymptotic convergence property of the proposed CATE algorithm in obstacle environments are both guaranteed, and the calculation burden is also reduced by concurrently calculating the optimal allocation and the control input directly without the path planning process. Finally, extensive simulations and experiments are conducted to validate that the CATE algorithm (i) outperforms the existing state-of-the-art baselines in terms of feasibility and efficiency in obstacle environments, (ii) is effective in environments with dynamic obstacles and is adaptable for per-forming various navigation tasks in 2D and 3D, (iii) demonstrates its efficacy and practicality by 2D experiments with a multi-AMR onboard navigation system, and (iv) provides a possible solution to evade deadlocks and pass through a narrow gap.

Abstract:
This letter presents a robust multi-sensor fusion framework for state estimation in legged robots (LLIO) based on an iterated extended Kalman filter. To address the limitations of IMU priori estimation, which often leads to legged robot localization errors or failures, our method integrates the contact constraints of the robot's leg kinematics with the ground. By introducing a sliding window-based ground contact constraint module, we effectively combine the contact state of the legged robot's foot with ground features, enhancing the constraints in complex environments and reduce localization drift. Additionally, factor graph optimization minimizes global cumulative drift. The proposed method has been extensively evaluated through numerous experiments and relevant public datasets. The results demonstrate that our approach significantly reduces local drift and better computational efficiency.

Abstract:
This paper presents Decoupled STAR (DSTAR), a novel reconfigurable robot fitted with a sprawling mechanism that allows the wheel rotation axes to vary relative to the body, and two independently activated four-bar extension mechanisms (FBEM). These mechanisms enable the robot to move its center of mass (COM) in any direction, and increase its maneuvering capabilities by selecting a variety of locomotion gaits. A kinematic model of the robot and a quasi-static force analysis are used to optimize the design and evaluate its motor requirements. Experiments demonstrate that combining the sprawling mechanism with FBEM enables the DSTAR to both crawl and drive, overcome a wide range of challenging obstacles, and improve its climbing capability by 66% compared to symmetric FBEM designs (such as RSTAR). The robot can crawl and maneuver over rough terrain using its unique turtle-gait method, roll sideways to surmount wall obstacles up to 20 cm high, travel horizontally across uneven ground, and switch between wheels and whegs to adapt to different terrain types, including dirt, stones, and grass.

Abstract:
In this paper, we present a low-cost, easy-to-implement sim-to-real framework for biped locomotion that narrows the reality gap using only simulation data, without motion-capture or additional real-world measurements. First, a walking policy for the BRUCE robot is trained in Isaac Gym via reinforcement learning. Next, we develop a compact, physics-informed neural network (PINN) grounded in Euler-Lagrange structure and augmented with an LSTM to predict simulator forward dynamics. Trained solely on simulation trajectories, the PINN forecasts next-step joint angles and velocities of the simulated robot given the physical robot's current state and control inputs. During hardware deployment, and consistent with a whole-body control architecture, these predicted states serve as reference joint states while the policy outputs provide feedforward torque commands; a feedforward-plus-feedback torque controller then computes the executed joint torques, thereby reducing the sim-to-real gap. Experiments on BRUCE demonstrate that our method better reproduces simulated behavior and attains higher tracking accuracy than direct policy transfer. Furthermore, the dynamics predictor runs at 1 kHz on embedded hardware, showing superior real-time performance relative to existing learning-based models.

Abstract:
This paper presents a unified framework that jointly predicts behavioral intentions and vectorized occupancy, leveraging them as priors to dynamically prune context information during trajectory decoding, thereby enhancing prediction accuracy, interpretability, and efficiency. While most prior work has focused on boosting the precision of multimodal trajectory prediction, explicit modeling of behavioral intentions (e.g., yielding, overtaking) remains underexplored. To this end, we employ a shared context encoder for both intention and trajectory predictions, thereby reducing structural redundancy and information loss. Moreover, we address the lack of ground-truth behavioral intention labels in mainstream datasets (Waymo, Argoverse) by auto-labeling these datasets, thus advancing the communitys efforts in this direction. We further introduce a vectorized occupancy prediction module that infers the probability of each map polyline being occupied by the target vehicles future trajectory. By leveraging these intention and occupancy predictions priors, our method conducts dynamic, modality-dependent pruning of irrelevant agents and map polylines in the decoding stage, effectively reducing computational overhead and mitigating noise from non-critical elements. Our approach ranks first among LiDAR-free methods on the Waymo Motion Dataset and achieves SOTA performance on the Waymo Interactive Prediction Dataset. Remarkably, even without model ensembling, our single-model framework improves the softmAP by 10% compared to the previous SOTA method, BETOP, in Waymo Interactive Prediction Leaderboard. Furthermore, the proposed framework has been successfully deployed on real vehicles, demonstrating its practical effectiveness in real-world applications.

Abstract:
Maintaining the visibility of the target is one of the major objectives of aerial tracking missions. This paper proposes a target-visible trajectory planning pipeline using quadratic programming (QP). Our approach can handle various tracking settings, including 1) single- and dual-target following and 2) both static and dynamic environments, unlike other works that focus on a single specific setup. In contrast to other studies that fully trust the predicted trajectory of the target and consider only the visibility of the targets center, our pipeline considers error in target path prediction and the entire body of the target to maintain the target visibility robustly. First, a prediction module uses a sample-check strategy to quickly calculate the reachable areas of moving objects, which represent the areas their bodies can reach, considering obstacles. Subsequently, the planning module formulates a single QP problem, considering path homotopy, to generate a tracking trajectory that maximizes the visibility of the target's reachable area among obstacles. The performance of the planner is validated in multiple scenarios, through high-fidelity simulations and real-world experiments.

Abstract:
This paper aims to bridge perception and planning in navigation systems by learning optimal trajectories from depth information in an end-to-end fashion. However, using neural networks as black-box replacements for traditional modules risks scalability and adaptability. Moreover, such methods often fall short in sufficiently incorporating the robots dynamic constraints, resulting in trajectories that are either inadequately executable or unexpectedly aggressive, diverging from user expectations. In this paper, we fuse the benefits of conventional methods and neural networks by introducing an optimization-embedded network based on a compact trajectory library. The network distills spatial constraints, which are then applied to model-based spatial-temporal trajectory optimization problem, yielding feasible and optimal solutions. By making the optimization problem differentiable, our model seamlessly approximates the optimal trajectory. Additionally, the introduced regularized trajectory library permits efficient capture of the spatial distribution of optimal trajectories with minimal storage cost, safeguarding multimodal planning features. Benchmarking demonstrates the outstanding performance of our method in trajectory smoothness, success rate, and constraint satisfaction. Real-world flight experiments with an onboard computer showcase the autonomous quadrotors ability to navigate swiftly through dense forests. Our project page with videos is at https://zju-fast-lab.github.io/e2e_opt/.

Abstract:
Learning from demonstration has proved itself useful for teaching robots complex skills with high sample efficiency. However, teaching long-horizon tasks with multiple skills is challenging as deviations tend to accumulate, the distributional shift becomes more evident, and human teachers become fatigued over time, thereby increasing the likelihood of failure. To address these challenges, we introduce (ST)², a sequential method for learning long-horizon manipulation tasks that allows users to control the teaching flow by specifying key points, enabling structured and incremental demonstrations. Using this framework, we study how users respond to two teaching paradigms: (i) a traditional monolithic approach, in which users demonstrate the entire task trajectory at once, and (ii) a sequential approach, in which the task is segmented and demonstrated step by step. We conducted an extensive user study on the restocking task with 16 participants in a realistic retail store environment, evaluating the user preferences and effectiveness of the methods. User-level analysis showed superior performance for the sequential approach in most cases (10 users), compared with the monolithic approach (5 users), with one tie. Our subjective results indicate that some teachers prefer sequential teaching---as it allows them to teach complicated tasks iteratively---or others prefer teaching in one go due to its simplicity.

Abstract:
This paper presents the design, development, and testing of a soft 3D-printed endoskeleton for arbitrary cable routing in tendon-driven soft actuators. The endoskeleton is embedded in a silicone body, and it is fixed to the mold prior to the casting process. It enables tendons to be placed through predefined eyelets, ensuring accurate positioning within the soft body. To minimize its impact on the overall stiffness of the soft body, the endoskeleton was designed with a slim profile, flexible connections, and fabricated using a 3D-printable elastic material (Shore A hardness 50), selected to roughly match the mechanical properties of the surrounding silicone matrix (typically with Shore 00 hardness 2030). Although the reference geometry in this study is a cylindrical body, the design can be extended to a wide range of soft body shapes and sizes. Key features of the proposed solution include a 3D-printable guide for tendon routing that is (1) fully soft,(2) easy to place, (3) rapidly reconfigurable for arbitrary tendon paths, (4) adaptable to variable soft body geometries, and (5) easy to fabricate with single-step casting. The current work describes the design, manufacturing, simulation, and testing of a case study in which the endoskeleton is employed to reproduce a target pose predicted by FE analysis. The matching is satisfactory and demonstrates the effectiveness of the approach.

Abstract:
Mechanical compliance is a key design parameter for dynamic contact-rich manipulation, affecting task success and safety robustness over contact geometry variation. Design of soft robotic structures, such as compliant fingers, requires choosing design parameters which affect geometry and stiffness, and therefore manipulation performance and robustness. Today, these parameters are chosen through either hardware iteration, which takes significant development time, or simplified models (e.g. planar), which can't address complex manipulation task objectives. Improvements in dynamic simulation, especially with contact and friction modeling, present a potential design tool for mechanical compliance. We propose a simulation-based design tool for compliant mechanisms which allows design with respect to task-level objectives, such as success rate. This is applied to optimize design parameters of a structured compliant finger to reduce failure cases inside a tolerance window in insertion tasks. The improvement in robustness is then validated on a real robot using tasks from the benchmark NIST task board. The finger stiffness affects the tolerance window: optimized parameters can increase tolerable ranges by a factor of 2.29, with workpiece variation up to 8.6 mm being compensated. However, the trends remain task-specific. In some tasks, the highest stiffness yields the widest tolerable range, whereas in others the opposite is observed, motivating need for design tools which can consider application-specific geometry and dynamics.

Abstract:
We propose a control framework that integrates model-based bipedal locomotion with residual reinforcement learning (RL) to achieve robust and adaptive walking in the presence of real-world uncertainties. Our approach leverages a model-based controllercomprising a Divergent Component of Motion (DCM) trajectory planner and a whole-body controlleras a reliable base policy. To address the uncertainties of inaccurate dynamics modeling and sensor noise, we introduce a residual policy trained through RL with domain randomization. Crucially, we employ a model-based oracle policy, which has privileged access to ground-truth dynamics during training, to supervise the residual policy via a novel supervised loss. This supervision enables the policy to efficiently learn corrective behaviors that compensate for unmodeled effects without extensive reward shaping. Our method demonstrates improved robustness and generalization across a range of randomized conditions, offering a scalable solution for sim-to-real transfer in bipedal locomotion.

Abstract:
In visual simultaneous localization and mapping (SLAM), the quality of the visual vocabulary is fundamental to the system's ability to represent environments and recognize locations. While ORB-SLAM is a widely used framework, its binary vocabulary, trained through the k-majority-based bag-of-words (BoW) approach, suffers from inherent precision loss. The inability of conventional binary clustering to represent subtle feature distributions leads to the degradation of visual words, a problem that is compounded as errors accumulate and propagate through the hierarchical tree structure. To address these structural deficiencies, this paper proposes hierarchical binary-to-real-and-back (HBRB)-BoW, a refined hierarchical binary vocabulary training algorithm. By integrating a global real-valued flow within the hierarchical clustering process, our method preserves high-fidelity descriptor information until the final binarization at the leaf nodes. Experimental results demonstrate that the proposed approach yields a more discriminative and well-structured vocabulary than traditional methods, significantly enhancing the representational integrity of the visual dictionary in complex environments. Furthermore, replacing the default ORB-SLAM vocabulary file with our HBRB-BoW file is expected to improve performance in loop closing and relocalization tasks.

Abstract:
This work presents a tightly coupled LiDARinertial odometry (LIO) framework tailored for quadruped robots operating under vibration and fluctuating conditions. By integrating time delay estimation (TDE) into an error-state Kalman filter (ESKF), external disturbances affecting the IMU are explicitly estimated and compensated during IMU pre-integration, significantly reducing vibration-induced errors. The resulting refined IMU poses are further used to correct LiDAR motion distortion, enabling a unified refinement process. This leads to smoother trajectories, improved localization accuracy, and enhanced robustness against both environmental and sensor uncertainties. The proposed framework is validated through real-time deployment on a quadruped robot.

Abstract:
Soft robots offer significant advantages over rigid-bodied counterparts due to their inherent flexibility and deformability. However, these same characteristics often introduce challenges in control and simulation-to-reality (Sim-to-Real) synchronization. In this study, we quantified the locomotion performance of a soft quadruped robot through experimental trials. To alleviate the burden of exhaustive physical testing over an expanded parameter space, we employed a Gaussian process-based optimization framework.

Abstract:
Accurate Wireless Magnetic Capsule (WCE) pose estimation remains a challenge for advancing minimally invasive medical procedures, because the relationship between magnetic sensor measurements and capsule pose is highly nonlinear and sensitive to noise and modeling errors, making large-scale training data essential for data-driven estimation. However, data acquisition itself remains a limiting factor, restricting both the volume of training data and the effective workspace of the system. To address this limitation, we propose a region-selective synthetic data injection strategy that generates additional data points using a calibrated physics-based model. In this strategy, regions with high model fidelity are replaced with physics-based data at arbitrary points, while regions with lower fidelity rely on sensor data, which provides a more accurate representation of the real system. Experimental results show that the proposed strategy achieves performance comparable to that of a purely data-driven model while significantly reducing the data acquisition burden.

Abstract:
Magnetic microagents hold great promise for improved medical treatments, including targeted drug delivery. Yet, tracking position and performance in vivo during actuation remains challenging. We introduce a sensing approach leveraging inductive signals derived from microagents actuated by a rotating magnetic field (RMF). This offers the unique possibility of controlling microagents while simultaneously tracking position and phase lag in real-time.

Abstract:
Continuum soft robots (cSR) represent a particular class of the highest compliant deformable robots made of elastomers, typically driven by embedded pneumatic chambers. While their structural compliance is appealing for many tasks, most existing research on variable stiffness at a constant generalized position relies on empirical evidence rather than a formal method that enhances the properties of cSR. Consequently, few sound advances have been reported regarding closed-loop (control-based) variable stiffness and furthermore for variable viscoelasticity for tracking in cSR, nor have they fully exploited the ease with which soft robots incorporate additional actuation inputs. In this extended abstract, the control of the fundamental behavioral structural viscoelasticity is addressed through commanding redundancy of actuation during motion. To this end, m pneumatic chambers are introduced into the cSR such that n chambers are used for motion tracking, while the remaining r=m-n are available to allocate variations of stiffness in regulation tasks, and of viscoelasticity in tracking tasks.

Abstract:
Controlling friction at the fingertip is fundamental to dexterous manipulation, yet remains difficult to realize in robotic hands. We present the design and analysis of a robotic fingertip equipped with passive rollers that can be selectively braked or pivoted to modulate contact friction and constraint directions. When unbraked, the rollers permit unconstrained sliding of the contact point along the rolling direction; when braked, they resist motion like a conventional fingertip. The rollers are mounted on a pivoting mechanism, allowing reorientation of the constraint frame to accommodate different manipulation tasks. We develop a constraint-based model of the fingertip integrated into a parallel-jaw gripper and analyze its ability to support diverse manipulation strategies. Experiments show that the proposed design enables a wide range of dexterous actions that are conventionally challenging for robotic grippers, including sliding and pivoting within the grasp, robust adaptation to uncertain contacts, multi-object or multi-part manipulation, and interactions requiring asymmetric friction across fingers. These results demonstrate the versatility of passive roller fingertips as a low-complexity, mechanically efficient approach to friction modulation, advancing the development of more adaptable and robust robotic manipulation.

Abstract:
Using unmanned aerial vehicle (UAV) formations to guide unmanned ground vehicles (UGVs) through unstructured obstacle-laden areas leads to highly efficient execution of tasks such as the transportation of supplies. However, existing methods fail to efficiently plan obstacle-avoidance strategies for the entire UAV-UGV swarm. Additionally, the formation controller and planner are isolated, resulting in the degradation of formation tracking accuracy, which presents potential security risks. This paper proposes a novel UAV formation scheme that integrates safe corridor (SC) generation, trajectory fitting, and formation tracking to ensure operational safety. The scheme employs a novel line-of-sight (LOS) mechanism to optimize A-planned waypoints, generating the SC as an obstacle-avoidance strategy. A minimum snap trajectory is fitted to the optimized waypoints with SC constraints. Bridged by the trajectory, the scheme develops a rigid-graph-based controller (RGC) to track the planning result, enabling dynamic formation maneuvering within the SC. Consequently, the proposed UAV formation scheme achieves obstacle-avoidance guidance by restricting the UGVs to the formation projection. The validation results demonstrate that the proposed scheme exhibits enhanced robustness and superior planning capabilities compared to traditional methods.

Abstract:
，软握把因其适应性和安全性而备受推崇，�?它们固有的柔软性常常导致在重物下抓握失�?很多。大多数增强附着力的握把依赖单一附着针对光滑或粗糙表面量身定制的策略。蜥蜴，但在非结构化环境中，有效导航时，通过以下方式基于地表状况。灵感来自混合粘附策�?壁虎和变色龙，本研究展示了一种仿生的软抓握器它集成了微楔干胶和SMA驱动的微棘�?微楔胶提供可控的附着力，保证平滑而SMA驱动的微棘则延伸用于粗糙表面粘附和回放以避免干扰。优化模型为开发目的是确定最优链路维度，提升抓取能力性能方面，力和半径。实验结�?各种表面验证了其有效

Abstract:
Learning-based monocular visual odometry (VO) poses robustness, generalization, and efficiency challenges in robotics. Recent advances in visual foundation models, such as DINOv2, have improved robustness and generalization in various vision tasks, yet their integration in VO remains limited due to coarse feature granularity. In this paper, we present DINO-VO, a feature-based VO system leveraging DINOv2 visual foundation model for its sparse feature matching. To address the integration challenge, we propose a salient keypoints detector tailored to DINOv2s coarse features. Furthermore, we complement DINOv2's robust-semantic features with fine-grained geometric features, resulting in more localizable representations. Finally, a transformer-based matcher and differentiable pose estimation layer enable precise camera motion estimation by learning good matches. Against prior detector-descriptor networks like SuperPoint, DINO-VO demonstrates greater robustness in challenging environments. Furthermore, we show superior accuracy and generalization of the proposed feature descriptors against standalone DINOv2 coarse features. DINO-VO outperforms prior frame-to-frame VO methods on the TartanAir and KITTI datasets and competitive on EuRoC dataset, while running efficiently at 72 FPS with less than 1GB of memory usage on a single GPU. Moreover, it performs competitively against Visual SLAM systems on outdoor driving scenarios, showcasing its generalization capabilities.

Abstract:
Open-World Object Detection (OWOD) presents a critical challenge for modern computer vision systems: detecting known classes, identifying unknown objects, and incrementally learning to recognize them over time. However, current approaches have two fundamental limitations: (1) the fixed-dimensional classification head inherently restricts incremental learning capabilities, and (2) heavy reliance on extensive annotated data hinders adaptability in few-shot settings. To address these limitations, we propose OWOD-FSL that integrates dynamic prototype classification head with few-shot learning. At the core of our approach are two major contributions: a dynamic prototype classification head that supplants traditional fixed classifiers with an expandable prototype classifier for unlimited class expansion, and a biologically-inspired bi-phase learning strategy that integrates offline prototype generation with incremental learning refinement. Comprehensive experiments on M-OWODB benchmark shows that OWOD-FSL achieves state-of-the-art performance in both unknown class recall (U-Recall) and known class mAP, significantly outperforming existing methods.

Abstract:
Accurate and reliable positioning is essential for perception, decision-making, and other high-level applications in autonomous driving, autonomous aerial vehicles, and intelligent robotics. Due to the inherent limitations of standalone sensors, integrating heterogeneous sensors with complementary capabilities is an effective approach to achieving this goal. The visual-inertial navigation system (VINS) fuses visual cameras and inertial measurement units (IMUs) to explore unknown environments. It requires a priori knowledge of 3D features and jointly estimates camera poses and feature positions, which inevitably introduces feature linearization errors. Meanwhile, the dimensionality of the system state increases with abundant textures, degrading real-time performance. To eliminate accumulated errors from VINS, frameworks further fuse measurements from the Global Navigation Satellite System (GNSS), but still suffer from similar limitations. To address the aforementioned issues, we propose a filtering-based, tightly coupled GNSS-visual-inertial positioning framework with a pose-only formulation applied to VINS, termed PO-GVINS. We first apply the PO formulation to our VINS (PO-VINS). GNSS raw measurements are subsequently incorporated, with integer ambiguities resolved, to achieve accurate and drift-free state estimation. Extensive experiments demonstrate that the proposed PO-VINS significantly outperforms the multi-state constraint Kalman filter (MSCKF) and achieves accuracy comparable to that of optimization-based VINS. By further incorporating GNSS measurements, PO-GVINS achieves accurate, drift-free state estimation, making it a robust solution for positioning in challenging environments.

Abstract:
Soft robots, with their highly compliant bodies, exhibit numerous unforeseen configurations that often defy engineering intuition and complicate control design. This work introduces a simulation-based co-optimization framework that jointly optimizes both morphology and control. Unlike existing approaches that rely on oversimplified soft robot models or feed-forward controllers for simple tasks, our framework targets complex tasks that benefit from closed-loop feedback. The controller is trained over a hybrid design space combining discrete parameters, which define the nominal structure, and continuous parameters, which shift the morphology adaptively. The design distribution is iteratively manipulated to emphasize high-performing candidates until the optimal designcontrol pair emerges. Proprioceptive feedback in the form of mechanical strain is integrated to provide the controller with awareness of morphological state and interaction dynamics. Demonstrations show that the framework converges reliably to optimal designcontrol solutions, validating the effectiveness of the proposed joint optimization strategy.

Abstract:
The Vision-and-Language Navigation (VLN) task requires an agent to follow natural language instructions and navigate through complex environments. Existing MLLM-based VLN methods primarily rely on imitation learning (IL) and often use DAgger for post-training to mitigate covariate shift. While effective, these approaches incur substantial data collection and training costs. Reinforcement learning (RL) offers a promising alternative. However, prior VLN RL methods lack dynamic interaction with the environment and depend on expert trajectories for reward shaping, rather than engaging in open-ended active exploration. This restricts the agent's ability to discover diverse and plausible navigation routes. To address these limitations, we propose ActiveVLN, a VLN framework that explicitly enables active exploration through multi-turn RL. In the first stage, a small fraction of expert trajectories is used for IL to bootstrap the agent. In the second stage, the agent iteratively predicts and executes actions, automatically collects diverse trajectories, and optimizes multiple rollouts via the GRPO objective. To further improve RL efficiency, we introduce a dynamic early-stopping strategy to prune long-tail or likely failed trajectories, along with additional engineering optimizations. Experiments show that ActiveVLN achieves the largest performance gains over IL baselines compared to both DAgger-based and prior RL-based post-training methods, while reaching competitive performance with state-of-the-art approaches despite using a smaller model.

Abstract:
Multi-UAV coordination is critical for complex real-world applications, but these missions are often constrained by intricate causal dependencies between tasks, alongside strict UAV energy and return-to-base constraints. Existing methods, ranging from exact solvers to standard deep reinforcement learning approaches, struggle to scale with the combinatorial complexity of this problem and often fail to effectively represent the underlying logical task structures. To address this gap, we propose the Causal-Channel Transformer for Joint Allocation (C2T-JA), an end-to-end reinforcement learning framework. The core of C2T-JA is a dual-branch hybrid attention encoder that explicitly constructs and reasons over multi-hop, disentangled causal channels, effectively decoupling logical dependencies from spatial task features. Building on this rich representation, a context-aware decoder generates a globally coordinated joint action for the entire team. We evaluated C2T-JA against established baselines, including an exact solver (Gurobi), a conventional heuristic (OR-Tools), and a leading learning-based approach (AM-joint), on procedurally generated benchmarks of varying scales and dependency structures. The results demonstrate that our approach consistently produces higher-quality solutions, measured by task completion rates, while reducing decision times by several orders of magnitude, particularly in large-scale scenarios.

Abstract:
Thermal cameras offer strong potential for robot perception under challenging illumination and weather conditions. However, thermal Simultaneous Localization and Mapping (SLAM) remains difficult due to unreliable feature extraction, unstable motion tracking, and inconsistent global pose and map construction, particularly in dynamic large-scale outdoor environments. To address these challenges, we propose LST-SLAM, a novel large-scale stereo thermal SLAM system that achieves robust performance in complex, dynamic scenes. Our approach combines self-supervised thermal feature learning, stereo dual-level motion tracking, and geometric pose optimization. We also introduce a semanticgeometric hybrid constraint that suppresses potentially dynamic features lacking strong inter-frame geometric consistency. Furthermore, we develop an online incremental bag-of-words model for loop closure detection, coupled with global pose optimization to mitigate accumulated drift. Extensive experiments on kilometer-scale dynamic thermal datasets show that LST-SLAM significantly outperforms recent representative SLAM systems, including AirSLAM and DROID-SLAM, in both robustness and accuracy.

Abstract:
Unmanned Aerial Vehicles (UAVs) perception relies on onboard sensors like cameras and LiDAR, which are limited by the narrow field of view (FoV). We present Self-Perception INertial Navigation Enabled Rotorcraft (SPINNER), a self-rotating tri-rotor UAV for the FoV expansion and autonomous flight. Without adding extra sensors or energy consumption, SPINNER significantly expands the FoV of onboard camera and LiDAR sensors through continuous spin motion, thereby enhancing environmental perception efficiency. SPINNER achieves full 3-dimensional position and roll--pitch attitude control using only three brushless motors, while adjusting the rotation speed via anti-torque plates design. To address the strong coupling, severe nonlinearity, and complex disturbances induced by spinning flight, we develop a disturbance compensation control framework that combines nonlinear model predictive control (MPC) with incremental nonlinear dynamic inversion. Experimental results demonstrate that SPINNER maintains robust flight under wind disturbances up to 4.8 ,m/s and achieves high-precision trajectory tracking at a maximum speed of 2.0,m/s. Moreover, tests in parking garages and forests show that the rotational perception mechanism substantially improves FoV coverage and enhances perception capability of SPINNER.

Abstract:
Depth completion is the challenge of recovering a dense depth map from an RGB image and corresponding sparse depth measurements. Many modern depth completion strategies often rely on deep neural networks, using a monocular depth estimation (MDE) backbone to generate an initial dense depth map from the RGB image. This estimate is then further refined with the help of auxiliary network components that utilise the sparse depth measurements to improve accuracy and restore fine-grained depth details. However, such approaches introduce additional model parameters and require domain-specific fine-tuning, making them impractical for resource-constrained robotics applications. In this paper, we propose an alternative refinement strategy based on compressed sensing. Using the Discrete Cosine Transform (DCT) as our basis, we construct a ratio matrix that rescales the estimated depth map to align with measured ground truth data. Our experiments demonstrate that this method can significantly reduce the RMSE and MAE of the initial MDE estimate by more than a factor of 15. Furthermore, the proposed approach can outperform state-of-the-art depth completion models at sampling ratios above 50 percent, while also substantially reducing the overall GPU VRAM requirements. This pipeline is modular and compatible with any existing MDE model with no additional training, making it particularly suitable for deployment on GPU-constrained robotic platforms in previously unseen environments.

Abstract:
End-to-end robotic grasping increasingly relies on reinforcement learning to enable safe and precise execution, yet defining a reward that consistently drives such behavior remains a central challenge. Human-engineered rewards have been widely explored, but they are prone to reward hacking, depend heavily on artificial design choices, and often fail to capture human intuition. Preference-based reward models offer a promising alternative by aligning policies with human feedback, but their application to robotic grasping has remained limited, and preference-aligned actions do not always translate into successful execution. We propose Human Preference and Success-based Grasping (HPSG), a three-stage framework that combines pre-training, reward modeling, and fine-tuning. At its core is the Weighted Success Reward (WSR), which integrates a preference-trained reward model with binary success feedback so that policies learn behaviors that are effective in practice and aligned with human judgment. This design resolves the mismatch between subjective preferences and execution outcomes, thereby improving reliability. Through extensive simulation and real-world experiments, we show that HPSG produces reliable grasping policies, achieving higher success and completion rates, reducing collisions, and transferring to physical settings with smaller performance degradation than baseline methods. Our code is publicly available at: https://github.com/qkrwnduf1997/HPSG

Abstract:
Control barrier functions (CBFs) have been demonstrated as an effective method for safety-critical control of autonomous systems. Although CBFs are simple to deploy, their design remains challenging, motivating the development of learning-based approaches. Yet, issues such as suboptimal safe sets, applicability in partially observable environments, and lack of rigorous safety guarantees persist. In this work, we propose observation-conditioned neural CBFs based on Hamilton-Jacobi (HJ) reachability analysis, which approximately recover the maximal safe sets. We exploit certain mathematical properties of the HJ value function, ensuring that the predicted safe set never intersects with the observed failure set. Moreover, we leverage a hypernetwork-based architecture that is particularly suitable for the design of observation-conditioned safety filters. The proposed method is examined both in simulation and hardware experiments for a ground robot and a quadcopter. The results show improved success rates and generalization to out-of-domain environments compared to the baselines.

Abstract:
Model Predictive Path Integral (MPPI) control is a widely used sampling-based approach for real-time control, valued for its flexibility in handling arbitrary dynamics and cost functions. However, it often suffers from high-frequency noise in the sampled control trajectories, which hinders the search for optimal controls and transfers to the applied controls, leading to actuator wear. In this work, we introduce Low-Pass Model Predictive Path Integral Control (LP-MPPI), which integrates low-pass filtering into the sampling process to eliminate detrimental high-frequency components and enhance the algorithm's efficiency. Unlike prior approaches, LP-MPPI provides direct and interpretable control over the frequency spectrum of sampled control trajectory perturbations, leading to more efficient sampling and smoother control. Through extensive evaluations in Gymnasium environments, simulated quadruped locomotion, and real-world F1TENTH autonomous racing, we demonstrate that LP-MPPI consistently outperforms state-of-the-art MPPI variants, achieving significant performance improvements while reducing control signal chattering.

Abstract:
Unsupervised skill discovery acquires a diverse repertoire of skills through intrinsic motivation, offering the potential to alleviate the labor-intensive reward engineering in reinforcement learning and the reliance on costly task-specific data in imitation learning. However, such methods typically measure diversity based on single-step states, neglecting the trajectory phase coherence, whose absence disrupts the smoothness of state transitions. In this work, we explore skills in Fourier latent space via a simple mutual-information-based reward function, aiming to train a single versatile policy capable of executing diverse state transition patterns. Specifically, we utilize a spatio-temporal representation learned through a Periodic Autoencoder, which effectively captures the periodic or quasi-periodic nature of motion. These features, rather than raw states, are used to measure skill diversity. We validate our method on the 12-DOF quadruped robot Unitree A1, achieving varied gaits. Simulation results show that our method reduces high-frequency power by 73%, while improving state space coverage by 133% compared to the baseline. To accomplish specific tasks, we trained a high-level controller to orchestrate the learned skills, which improves training efficiency. Real-world experiments demonstrate that the learned skills can reliably execute tasks.

Abstract:
In recent years, the ﬁeld of Human-Drone Interaction (HDI) has attracted signiﬁcant attention, particularly in navigation assistance using aerial robots. However, existing approaches rely on non-contact cues or indirect physical connections such as cables, which demand continuous ﬂight and high energy consumption. These limitations shorten interaction time and make direct assistance challenging. To address this issue, we employ a deformable aerial robot with inﬂatable structures that enables adaptive perching on the human arm. On this platform, we propose a perching-based haptic guidance method in which the robot maintains close contact to deliver directional cues via thrust modulation and provide alerts and arrival feedback through vibration signals. The system further switches feedback modes dynamically according to context, enabling intuitive and ﬂexible guidance beyond conventional methods limited to simple directional cues. Through experiments, we quantitatively evaluated the presented force characteristics and conﬁrmed that perching-based haptic guidance requires less power than continuous ﬂight in the same platform. User experiments further demonstrated that participants could reach target locations without major deviations even when vision and hearing were blocked. Moreover, the entire process of approach, perching, haptic guidance, and deperching was stably executed on the real platform, validating the feasibility of perching-based haptic guidance. To the best of our knowledge, this is the ﬁrst study to realize close physical Human-Drone Interaction (pHDI) through perching-based haptic guidance.

Abstract:
Quadrotor endurance is ultimately limited by battery behavior, yet most energy-aware planning treats the battery as a simple energy reservoir and overlooks how flight motions induce dynamic current loads that accelerate battery degradation. This work presents an end-to-end framework for motion-aware battery health assessment in quadrotors. We first design a wide-range current sensing module to capture motion-specific current profiles during real flights, preserving transient features. In parallel, a high-fidelity battery model is calibrated using reference performance tests and a metaheuristic based on a degradation-coupled electrochemical model.By simulating measured flight loads in the calibrated model, we systematically resolve how different flight motions translate into degradation modesloss of lithium inventory and loss of active materialas well as internal side reactions. The results demonstrate that even when two flight profiles consume the same average energy, their transient load structures can drive different degradation pathways, emphasizing the need for motion-aware battery management that balances efficiency with battery degradation.

Abstract:
Large foundation models enable powerful reasoning for autonomous systems, but mapping semantic intent to reliable real-time control remains challenging. Existing approaches either (i) let Large Language Models (LLMs) generate trajectories directly - brittle, hard to verify, and latency-prone - or (ii) adjust Model Predictive Control (MPC) objectives online - mixing slow deliberation with fast control and blurring interfaces. We propose Agentic Fast-Slow Planning, a hierarchical framework that decouples perception, reasoning, planning, and control across natural timescales. The framework contains two bridges. Perception2Decision compresses scenes into ego-centric topologies using an on-vehicle Vision-Language Model (VLM) detector, then maps them to symbolic driving directives in the cloud with an LLM decision maker - reducing bandwidth and delay while preserving interpretability. Decision2Trajectory converts directives into executable paths: Semantic-Guided A embeds language-derived soft costs into classical search to bias solutions toward feasible trajectories, while an Agentic Refinement Module adapts planner hyperparameters using feedback and memory. Finally, MPC tracks the trajectories in real time, with optional cloud-guided references for difficult cases. Experiments in CARLA show that Agentic Fast-Slow Planning improves robustness under perturbations, reducing lateral deviation by up to 45% and completion time by over 12% compared to pure MPC and an A-guided MPC baseline. Code is available at https://github.com/cjychenjiayi/icra2026_AFSP.

Abstract:
Imitation learning (IL) policies in robotics deliver strong performance in controlled settings but remain brittle in real-world deployments: rare events such as hardware faults, defective parts, unexpected human actions, or any state that lies outside the training distribution can lead to failed executions. Vision-based Anomaly Detection (AD) methods emerged as an appropriate solution to detect these anomalous failure states but do not distinguish failures from benign deviations. We introduce FIDeL (Failure Identification in Demonstration Learning), a policy-independent failure detection module. Leveraging recent AD methods, FIDeL builds a compact representation of nominal demonstrations and aligns incoming observations via optimal transport matching to produce anomaly scores and heatmaps. Spatio-temporal thresholds are derived with an extension of conformal prediction, and a Vision-Language Model (VLM) performs semantic filtering to discriminate benign anomalies from genuine failures. We also introduce BotFails, a multimodal dataset of real-world tasks for failure detection in robotics. FIDeL consistently outperforms state-of-the-art baselines, yielding +5.30% percent AUROC in anomaly detection and +17.38% percent failure-detection accuracy on BotFails compared to existing methods.

Abstract:
Vision-Language-Action (VLA) models leverage pretrained vision-language models (VLMs) to couple perception with robotic control, offering a promising path toward general purpose embodied intelligence. However, current SOTA VLAs are primarily pretrained on multimodal tasks with limited relevance to embodied scenarios, and then finetuned to map explicit instructions to actions. Consequently, due to the lack of reasoning-intensive pretraining and reasoning-guided manipulation, these models are unable to perform implicit human intention reasoning required for complex, real-world interactions. To overcome these limitations, we propose IntentionVLA, a VLA framework with a curriculum training paradigm and an efficient inference mechanism. Our proposed method first leverages carefully designed reasoning data that combine intention inference, spatial grounding, and compact embodied reasoning, endowing the model with both reasoning and perception capabilities. In the following finetuning stage, IntentionVLA employs the compact reasoning outputs as contextual guidance for action generation, enabling fast inference under indirect instructions. Experimental results show that IntentionVLA substantially outperforms pi_0, achieving 18% higher success rates with direct instructions and 28% higher than ECoT under intention instructions. On out-of-distribution intention tasks, IntentionVLA achieves over twice the success rate of all baselines, and further enables zero-shot human-robot interaction with 40% success rate. These results highlight IntentionVLA as a promising paradigm for next-generation human-robot interaction (HRI) systems.

Abstract:
Non-prehensile (NP) manipulation, in which robots alter object states without forming stable grasps (for example, pushing, poking, or sliding), significantly broadens robotic manipulation capabilities when grasping is infeasible or insufficient. However, enabling a unified framework that generalizes across different tasks, objects, and environments while seamlessly integrating non-prehensile and prehensile (P) actions remains challenging: robots must determine when to invoke NP skills, select the appropriate primitive for each context, and compose P and NP strategies into robust, multi-step plans. We introduce AdaptPNP, a vision-language model (VLM)-empowered task and motion planning framework that systematically selects and combines P and NP skills to accomplish diverse manipulation objectives. Our approach leverages a VLM to interpret visual scene observations and textual task descriptions, generating a high-level plan skeleton that prescribes the sequence and coordination of P and NP actions. A digital-twin based object-centric intermediate layer predicts desired object poses, enabling proactive mental rehearsal of manipulation sequences. Finally, a control module synthesizes low-level robot commands, with continuous execution feedback enabling online task plan refinement and adaptive replanning through the VLM. We evaluate AdaptPNP across representative P&NP hybrid manipulation tasks in both simulation and real-world environments. These results underscore the potential of hybrid P&NP manipulation as a crucial step toward general-purpose, human-level robotic manipulation capabilities. More detailed information can be found at https://adaptpnp.github.io/

Abstract:
Understanding spatial affordances---comprising the contact regions of object interaction and the corresponding contact poses---is essential for robots to effectively manipulate objects and accomplish diverse tasks. However, existing spatial affordance prediction methods mainly focus on locating the contact regions while delegating the pose to independent pose estimation approaches, which can lead to task failures due to inconsistencies between predicted contact regions and candidate poses. In this work, we propose RoboPCA, a pose-centered affordance prediction framework that jointly predicts task-appropriate contact regions and poses conditioned on instructions. To enable scalable data collection for pose-centered affordance learning, we devise Human2Afford, a data curation pipeline that automatically recovers scene-level 3D information and infers pose-centered affordance annotations from human demonstrations. With Human2Afford, scene depth and the interaction object's mask are extracted to provide 3D context and object localization, while pose-centered affordance annotations are obtained by tracking object points within the contact region and analyzing hand-object interaction patterns to establish a mapping from the 3D hand mesh to the robot end-effector orientation. By integrating geometry-appearance cues through an RGB-D encoder and incorporating mask-enhanced features to emphasize task-relevant object regions into the diffusion-based framework, RoboPCA outperforms baseline methods on image datasets, simulation, and real robots, and exhibits strong generalization across tasks and categories.

Abstract:
Tactile sensing is an important sensing modality for robot manipulation. Among different types of tactile sensors, magnet-based sensors, like u-skin, balance well between tactile density, high-durability, and compactness. However, the large sim-to-real gap of tactile sensors prevents robots from acquiring useful tactile-based manipulation skills from simulation data, a recipe that has been successful for achieving complex and sophisticated control policies. Prior work has used binarization techniques to bridge the sim-to-real gap for dexterous in-hand manipulation with magnet-based sensors. However, binarization inherently loses much information that is useful in many other tasks, e.g., insertion. In our work, we propose GCS, a novel sim-to-real technique to learn contact-rich insertion skills with dense, distributed, 3-axes tactile readings from magnet-based tactile sensors. We evaluated our approach on blind insertion tasks and show successful zero-shot sim-to-real transfer of RL policies with raw tactile readings as input.

Abstract:
This paper presents a unified framework that combines symmetry-aware skill transfer with energy-tank passive control to achieve safe and adaptive ankle exoskeleton assistance. Subject-specific ankle references are first extracted from wearable IMU data : Dynamic Time Warping (DTW) aligns gait cycles onto a normalized phase axis , and Gaussian Mixture Regression (GMR) synthesizes smooth probabilistic templates suitable for online modulation. When only unilateral sensing is available, contralateral trajectories are reconstructed through either a half-period phase shift or a DTW-informed nonlinear mapping, enabling robust bilateral assistance. These references are then tracked by a joint-space PID controller wrapped with an energy tank, which bounds power exchange and prevents unintended energy injection. In simulation experiments, the proposed controller improved center-of-mass smoothness relative to plain PID. Benchtop validation confirms the efficacy of both GMR-generated and symmetric-generated trajectories. Furthermore, experimental results show a reduction of 40 N in peak interaction force (from 120 N to 80 N), resulting in less mechanical strain on the user. By unifying phase-consistent gait synthesis with passivity shaping, this work advances ankle exoskeleton assistance that is individualized, robust, and inherently safe.

Abstract:
Recent advances in Large language models (LLMs) have demonstrated their promising capabilities of generating robot operation code to enable LLM-driven robots. To enhance the reliability of operation code generated by LLMs, corrective designs with feedback from the observation of executing code have been increasingly adopted in existing research. However, the code execution in these designs relies on either a physical experiment or a customized simulation environment, which limits their deployment due to the high configuration effort of the environment and the potential long execution time. In this paper, we explore the possibility of directly leveraging LLM to enable static simulation of robot operation code, and then leverage it to design a new reliable LLM-driven corrective robot operation code generation framework. Our framework configures the LLM as a static simulator with enhanced capabilities that reliably simulate robot code execution by interpreting actions, reasoning over state transitions, analyzing execution outcomes, and generating semantic observations that accurately capture trajectory dynamics. To validate the performance of our framework, we performed experiments on various operation tasks for different robots, including UAVs and small ground vehicles. The experiment results not only demonstrated the high accuracy of our static text-based simulation but also the reliable code generation of our LLM-driven corrective framework, which achieves a comparable performance with state-of-the-art research while does not rely on dynamic code execution using physical experiments or simulators.

Abstract:
This work focuses on modeling dynamic urban environments for autonomous driving simulation. Contemporary data-driven methods using neural radiance fields have achieved photorealistic driving scene modeling, but they suffer from low rendering efficacy. Recently, some approaches have explored 3D Gaussian splatting for modeling dynamic urban scenes, enabling high-fidelity reconstruction and real-time rendering. However, these approaches often neglect to model fine-grained variations between frames and camera viewpoints, leading to suboptimal results. In this work, we propose a new approach named ArmGS that exploits composite driving Gaussian splatting with multi-granularity appearance refinement for autonomous driving scene modeling. The core idea of our approach is devising a multi-level appearance modeling scheme to optimize a set of transformation parameters for composite Gaussian refinement from multiple granularities, ranging from local Gaussian level to global image level and dynamic actor level. This not only models global scene appearance variations between frames and camera viewpoints, but also models local fine-grained changes of background and objects. Extensive experiments on multiple challenging autonomous driving datasets, namely, Waymo, KITTI, NOTR and VKITTI2, demonstrate the superiority of our approach over the state-of-the-art methods.

Abstract:
Source localization in a complex flow poses a significant challenge for multi-robot teams tasked with localizing the source of chemical leaks or tracking the dispersion of an oil spill. The flow dynamics can be time-varying and chaotic, resulting in sporadic and intermittent sensor readings, and complex environmental geometries further complicate a team's ability to model and predict the dispersion. To accurately account for the physical processes that drive the dispersion dynamics, robots must have access to computationally intensive numerical models, which can be difficult when onboard computation is limited. We present a distributed mobile sensing framework for source localization in which each robot carries a machine-learned, finite element model of its environment to guide information-based sampling. The models are used to evaluate an approximate mutual information criterion to drive an infotaxis control strategy, which selects sensing regions that are expected to maximize informativeness for the source localization objective. Our approach achieves faster error reduction compared to baseline sensing strategies and results in more accurate source localization compared to baseline machine learning approaches.

Abstract:
Accurate registration between a prior model and the real scene is essential for high-precision robotic manipulation, yet optical methods suffer from long calibration chains, line-of-sight constraints, and fabrication errors. We propose a calibration-free alternative that reformulates contact registration as complementary-shape docking between the object and the probe's swept volume, explicitly accounting for probe geometry and leveraging both contact and non-contact evidence. Our solver integrates a global-to-local search via 3D FFT correlation over low-discrepancy SO(3) samples, then followed by continuous SE(3) refinement using Lie-algebra updates and analytic contact sensitivities. This pipeline yields efficient exploration and metric-grade convergence without fragile point correspondences. Simulation across free-form meshes achieved sub-0.04 mm and sub-0.4° accuracy and robustness to pose noise and contact loss. On a tooth-preparation robot, our method attained 0.42 mm and 3.75°, outperforming an optical tracker registration while requiring no external sensors. These results demonstrate a practical and precise registration strategy for surgical and industrial robots.

Abstract:
Autonomous driving demands reinforcement learning (RL) agents that are not only performant, but also stable, sample-efficient, and robust to uncertainty. However, conventional policy optimization methods often suffer from unstable convergence, sensitivity to reward scaling, and limited generalization in safety-critical or out-of-distribution scenarios. We propose Self-Reflection Policy Optimization (SRPO), a principled, model-free framework that introduces policy-level self-evaluation by benchmarking each policy iteration against its own historical performance. This self-reflection yields a reward-shaping signal based on relative improvement, which is redistributed across trajectory steps using a rank-based credit assignment mechanism. This design emphasizes informative steps, eliminates dependence on absolute reward magnitudes, and improves stability in practice. We theoretically show that a bounds-based variant of SRPO preserves policy optimality and convergence. Empirically, we evaluate SRPO on both Highway-env and the high-fidelity CARLA simulator under adversarial perturbations and out-of-distribution driving conditions. SRPO consistently improves training efficiency, robustness, and policy performance compared to the baseline techniques. These results position SRPO as a promising and theoretically grounded approach to more reliable decision-making for autonomous driving. The source code is available at: urlhttps://github.com/dejin-wang/SRPO_anonymous_code.

Abstract:
LiDARs are widely used for 3D depth reconstruction, but their performance is often limited by inherent hardware constraints that impose trade-offs between range, spatial resolution, and frame rate. Many LiDAR systems typically operate at low frame rates (e.g., 5-10 Hz), prioritizing long-range sensing over responsiveness to rapid scene changes. We present NeuroLiDAR, an adaptive depth sensing framework that achieves effective frame rates of up to �?6 Hz by fusing temporally sparse LiDAR data with temporally dense inputs from neuromorphic event cameras. NeuroLiDAR integrates two components: event-based keyframe detection and event-guided depth extrapolation, to dynamically adjust the sensing rate in response to scene dynamics. To evaluate our approach, we introduce ELiDAR, a dataset spanning outdoor and indoor scenarios, and show that NeuroLiDAR reduces depth reconstruction error by �?9% in RMSE while achieving adaptive frame rates between 27.847.3 Hz. Our code and dataset are available at https://github.com/darshanakgr/neurolidar.

Abstract:
Path planning is a challenging problem in robotics, for which numerous algorithms have been developed to address it. Sampling-based algorithms, such as the Rapidly Exploring Random Tree (RRT) and its variants, are renowned for their ability to explore the search space efficiently. However, these algorithms consume a significant amount of computation time and memory to derive the optimal path. Furthermore, when it comes to different geometrical factors, such as narrow passages, fixed-shape robots often fail to navigate because of their structural constraints. This limitation raises the need to use reconfigurable robots, which are capable of changing their shape to access these confined areas. This paper proposes a novel path planning approach for a reconfigurable robot, based on a machine learning model, to address the aforementioned limitations. The proposed method employs a Convolutional Neural Network (CNN) model that predicts sample distribution, a flow field, and robot configurations, which is combined with RRT, termed Learning Guided Adaptive Reconfiguration with Vector field Oriented RRT (LARVO-RRT). The model has been trained using the optimal sample distributions and flow fields generated with the help of optimal paths from a customized A algorithm. Experimental results demonstrate that the proposed method surpasses the existing learning and non-learning-based path planning algorithms in terms of time cost, iteration count, and path quality. Furthermore, the algorithm has been able to outperform the existing path planners even without considering the reconfigurations.

Abstract:
Accurate Localization is a fundamental challenge in robotic autonomy, with applications ranging from autonomous driving to space proximity operations. Visual Localization is a viable choice in GPS-denied environments, such as subterranean, indoor, urban, or space environments; however, its performance degrades under often encountered conditions, such as low light or varying illumination. This paper introduces LIDIA, an illumination-aware model of localization quality for Perception-Aware Planning. LIDIA involves the efficient integration of light source direction into the planning framework, enabling the prediction of visually informative regions in the map under varying lighting. Unlike prior geometric approaches, LIDIA jointly exploits geometric and photometric information without requiring computationally expensive real-time rendering, thereby preserving online applicability. Our results demonstrate that LIDIA consistently outperforms existing geometric methods such as FIF in predicting the information gain of candidate camera poses and in planning trajectories that achieve higher localization accuracy. To the best of our knowledge, this is the first approach to unify geometric and photometric reasoning in an efficient, active localization system, paving the way for robust autonomy in illumination-constrained environments.

Abstract:
Mechanical stimulation has recently been shown as a promising approach to induce targeted cancer cell death. With precise field control, magnetic microrobots were navigated to the tumor site for delivering mechanical stimulation as a new treatment approach. However, most magnetic microrobots suffer from low force output when generating mechanical stimulation. Acoustic microrobots, with a microbubble as their simplest form, generate strong mechanical stimulation yet lack precise position control. In this paper, leveraging the force output of the acoustic field and precision of magnetic field control, we present a magnetic acoustic microbubble microrobot (MAM) that integrates magnetic navigation and acoustic stimulation. MAMs were fabricated with DSPC/PEG lipid shells and iron-oxide nanoparticles (IONPs) using a flow-focusing microfluidic method. The size of the fabricated monodispersed MAMs is approximately 10 μm. The fabricated MAMs were navigated by a quadrupole magnetic tweezer system, with a maximum field gradient of 23 T/m, and controlled to oscillate to generate mechanical stimulation under an acoustic transducer at 1 MHz. As a proof-of-concept, we applied MAM acoustic treatment to breast cancers (MDA-MB-231) and showed that MAM acoustic treatment led to reduced cell viability compared to the control group and the acoustic-only group. Considering all the components for MAM fabrication are FDA-approved materials, MAM holds promise for clinical translation in tumor mechanical stimulations.

Abstract:
We study object importance-based vision risk object identification (Vision-ROI), a key capability for intelligent driving systems. Existing approaches are deterministic and ignore uncertainty, potentially compromising safety. For example, fixed decision thresholds in ambiguous scenarios can cause premature or delayed risk detection and temporally unstable predictions. These issues worsen under diverse contexts with multiple interacting risks that perturb where and when risks occur. However, current vision methods lack a principled way to model uncertainty jointly across space and time, limiting adaptability to scene complexity. We propose Risk Tube Prediction, a unified formulation for modeling spatiotemporal risk uncertainty. We further introduce a conformal prediction framework to provide coverage guarantees for the true risks and yield calibrated risk scores and uncertainty estimates. Specifically, we employ risk-categoryaware calibrators that consider distinct characteristics to reduce confused calibration. To evaluate, we present a new dataset and metrics probing diverse scenario configurations with multi-risk coupling effects. We systematically analyze factors affecting uncertainty estimation, including scenario variations, per-risk category behavior, and perception error propagation. Our method delivers substantial improvements over prior approaches, enhancing vision-ROI robustness and downstream performance, such as reducing nuisance braking alerts. For more qualitative results, please visit our project webpage: https://hcis-lab.github.io/CRTP/

Abstract:
Fisheye cameras are widely deployed in autonomous driving perception suites for their low cost and full-coverage field of view (FOV), yet their potential remains under-leveraged in 3D object detection. Severe radial distortion challenges most BEV detectors by violating the fundamental assumption of uniform sampling. To bridge this gap, we propose Distortion-Aware PETR (DAPETR), a projection-free detector tailored for mixed pinholefisheye camera setups. DAPETR incorporates two key learned-adaptive modules: a unified distortion-aware positional embedding that harmonizes positional encodings for image representations with fisheye geometry, and a bidirectional feature-geometry co-modulation module that mutually adapts image features and 3D positional embeddings. In our experiments on a converted KITTI-360 benchmark, we systematically compare our learned-adaptive approach against PETR in polar coordinates (PolarPETR). We find that while both methods improve over the baseline, our learned modules achieve superior performance. Crucially, we uncover a negative interaction when combining both strategies, revealing that learned adaptation and explicit geometric re-parameterization can conflict. Our final DAPETR model significantly advances the research and benchmark for fisheye BEV detection, providing critical insights into effective distortion-aware 3D perception design other than image rectification.

Abstract:
We present ActiveUMI, a framework for a data collection system that transfers in-the-wild human demonstrations to robots capable of complex bimanual manipulation. ActiveUMI couples a portable VR teleoperation kit with sensorized controllers that mirror the robot's end-effectors, bridging human-robot kinematics via precise pose alignment. To ensure mobility and data quality, we introduce several key techniques, including immersive 3D model rendering, a self-contained wearable computer, and efficient calibration methods. ActiveUMI's defining feature is its capture of active, egocentric perception. By recording an operator's deliberate head movements via a head-mounted display, our system learns the crucial link between visual attention and manipulation. We evaluate ActiveUMI on six challenging bimanual tasks. Policies trained exclusively on ActiveUMI data achieve an average success rate of 70% on in-distribution tasks and demonstrate strong generalization, retaining a 56% success rate when tested on novel objects and in new environments.. Our results demonstrate that portable data collection systems, when coupled with learned active perception, provide an effective and scalable pathway toward creating generalizable and highly capable real-world robot policies.

Abstract:
Deploying multi-robot systems in environments shared with dynamic and uncontrollable agents presents sig- nificant challenges, especially for large robot fleets. In such environments, individual robot operations can be delayed due to unforeseen conflicts with uncontrollable agents. While existing research primarily focuses on preserving the completeness of Multi-Agent Path Finding (MAPF) solutions considering delays, there is limited emphasis on utilizing additional environmental information to enhance solution quality in the presence of other dynamic agents. To this end, we propose Flow-Aware Multi-Agent Path Finding (FA-MAPF), a novel framework that integrates learned motion patterns of uncontrollable agents into centralized MAPF algorithms. Our evaluation, conducted on a diverse set of benchmark maps with simulated uncontrollable agents and on a real-world map with recorded human trajecto- ries, demonstrates the effectiveness of FA-MAPF compared to state-of-the-art baselines. The experimental results show that FA-MAPF can consistently reduce conflicts with uncontrollable agents, up to 55%, without compromising task efficiency.

Abstract:
Controlling the behavioral diversity is a pivotal challenge in multi-agent reinforcement learning (MARL), particularly in complex collaborative scenarios. While existing methods attempt to regulate behavioral diversity by directly differentiating across all agents, they lack deep characterization and learning of multi-agent composition structures. This limitation leads to suboptimal performance or coordination failures when facing more complex or challenging tasks. To bridge this gap, we introduce Structured Diversity Control (SDC), a framework that redefines the system-wide diversity metric as a weighted combination of intra-group diversity, which is minimized for cohesion and inter-group diversity, which is maximized for specialization. The trade-off is governed by a pre-set Diversity Structure Factor (DSF), allowing for fine-grained, group-aware control over the collective strategy. Our method directly constrains the policy architecture without altering reward functions. This structural definition of diversity enables SDC to deliver substantial performance gains across various experiments, including increasing average rewards by up to 47.1% in multi-target pursuit and reducing episode lengths by 12.82% in complex neutralization scenarios. The proposed method offers a novel analytical perspective on the problem of cooperation in group-aware multi-agent systems.

Abstract:
Achieving expert-like robotic task execution in dynamic environments typically requires extensive, high-quality expert demonstrations, a significant bottleneck for real-world deployment. We present a novel learning framework that overcomes this data dependency, enabling robots to perform complex periodic tasks with expert-like proficiency, even when learning from naive demonstrations. Our two-stage pipeline first selects a representative demonstration based on user-defined information-aware task intention scores. This single best demo is then used to extract a canonical motion shape via Periodic Dynamic Movement Primitives (DMPs). Finally, a Long Short-Term Memory (LSTM) network refines the entire set of demonstrations,leveraging a multi-objective score that combines the canonical shape with mutual information and other task quality metrics. The proposed approach is demonstrated on a Franka Research 3 robot performing phasic tasks across three contrasting domains: wiping in human assistive services, weaving in the textile industry, and pick-and-place operations for warehouse automation. Visit project page at: https://focaslab.github.io/beyondtheteacher.

Abstract:
Deformable Linear Objects (DLOs), such as cables and ropes, pose significant challenges for robotic manipulation due to their high-dimensional state space, nonlinear deformation dynamics, and strong sensitivity to external forces. Cable routing tasks, in particular, are further complicated by geometric constraints, residual stresses in stiff cables, and the necessity of precise alignment with designated connectors. Existing approaches often rely on endpoint manipulation or external fixtures, which limits flexibility and scalability in real-world applications. While data-driven and graph-based models have shown promise for flexible ropes, they struggle to generalize across varying cable stiffness and suffers high computational costs. To address these challenges, we propose Adaptive Curvature-Aware Routing (ACR), a dual manipulation framework capable of adaptively handling cables of high stiffness and arbitrary lengths. Specifically, our framework combines local curvature analysis with Radial Basis Function Networks (RBFNs) to predict cable deformations. By prioritizing regions with high curvature discrepancies, it adaptively selects manipulation segments and performs safe, precise corrective actions to shape the cable toward the target configuration without heavy reliance on fixtures. Furthermore, we develop a constraint-aware cooperative controller that integrates both kinematic feasibility and physical safety into the motion strategy. Experiments in both simulation and real-world setups demonstrate that ACR significantly outperforms baseline methods in terms of success rate and terminal accuracy, validating the effectiveness of combining curvature-based adaptivity with data-driven modeling for complex cable routing tasks.

Abstract:
Actor-critic models are a class of model-free deep reinforcement learning (RL) algorithms that have demonstrated effectiveness across various robot learning tasks. While considerable research has focused on improving training stability and data sampling efficiency, most deployment strategies have remained relatively simplistic, typically relying on direct actor policy rollouts. In contrast, we propose PACHS (Parallel Actor-Critic Heuristic Search), an efficient parallel best-first search algorithm for inference that leverages both components of the actor-critic architecture: the actor network generates actions, while the critic network provides cost-to-go estimates to guide the search. Two levels of parallelism are employed within the search--actions and cost-to-go estimates are generated in batches by the actor and critic networks respectively, and graph expansion is distributed across multiple threads. We demonstrate the effectiveness of our approach in robotic manipulation tasks, including collision-free motion planning and contact-rich interactions such as non-prehensile pushing. Visit p-achs.github.io for demonstrations and examples.

Abstract:
With the growing deployment of surveillance systems in factories, offices, and homes, integrating them with robots offers a promising direction for collaborative and efficient task execution. However, existing approaches largely focus on single-robot scenarios and struggle with multi-view collaboration in large-scale environments. In this paper, we present a novel indoor collaborative object navigation dataset built on Habitat-Sim, featuring 206 cameras across 74 floors. The dataset enables systematic evaluation of an agents ability to exploit multi-view surveillance information. To address the limitations of single-robot perception, we propose SurveilNav, a collaborative navigation framework that integrates active camera scheduling, joint 2D/3D mapping, VLM-based value estimation, and collaborative target verification. By synergizing the robot's dynamic local perception with the static global view of surveillance, this architecture effectively overcomes both the limited perception range of single agents and the inherent blind spots of fixed cameras, resolving inefficient exploration. Experimental results on the HM3D dataset demonstrate that SurveilNav substantially outperforms existing methods in large-scale environments, achieving state-of-the-art performance in both exploration efficiency and navigation success rate. Moreover, the system shows strong potential for applications in large-scale search, home environments, and rescue missions.

Abstract:
Pre-trained robot policies serve as the foundation of many validated robotic systems, which encapsulate extensive embodied knowledge. However, they often lack the semantic awareness characteristic of foundation models, and replacing them entirely is impractical in many situations due to high costs and the loss of accumulated knowledge. To address this gap, we introduce GUIDES, a lightweight framework that augments pre-trained policies with semantic guidance from foundation models without requiring architectural redesign. GUIDES employs a fine-tuned vision-language model (Instructor) to generate contextual instructions, which are encoded by an auxiliary module into guidance embeddings. These embeddings are injected into the policys latent space, allowing the legacy model to adapt to this new semantic input through brief, targeted fine-tuning. For inference-time robustness, a large language modelbased Reflector monitors the Instructors confidence and, when confidence is low, initiates a reasoning loop that analyzes execution history, retrieves relevant examples, and augments the VLMs context to refine subsequent actions. Extensive validation in the RoboCasa simulation environment across diverse policy architectures shows consistent and substantial improvements in task success rates. Real-world deployment on a UR5 robot further demonstrates that GUIDES enhances motion precision for critical sub-tasks such as grasping. Overall, GUIDES offers a practical and resource-efficient pathway to upgrade, rather than replace, validated robot policies.

Abstract:
Robotic systems demand accurate and comprehensive 3D environment perception, requiring simultaneous capture of photo-realistic appearance (optical), precise layout shape (geometric), and open-vocabulary scene understanding (semantic). Existing methods typically achieve only partial fulfillment of these requirements while exhibiting optical blurring, geometric irregularities, and semantic ambiguities. To address these challenges, we propose OmniMap. Overall, OmniMap represents the first online mapping framework that simultaneously captures optical, geometric, and semantic scene attributes while maintaining real-time performance and model compactness. At the architectural level, OmniMap employs a tightly coupled 3DGSVoxel hybrid representation that combines fine-grained modeling with structural stability. At the implementation level, OmniMap identifies key challenges across different modalities and introduces several innovations: adaptive camera modeling for motion blur and exposure compensation, hybrid incremental representation with normal constraints, and probabilistic fusion for robust instance-level understanding. Extensive experiments show OmniMap's superior performance in rendering fidelity, geometric accuracy, and zero-shot semantic segmentation compared to state-of-the-art methods across diverse scenes. The framework's versatility is further evidenced through a variety of downstream applications, including multi-domain scene Q&A, interactive editing, perception-guided manipulation, and map-assisted navigation.

Abstract:
We design and experimentally evaluate two anti-backlash mechanisms for cycloidal reducers. The two mechanisms are integrated into variations of a proposed design of quasi-direct drive actuator. Three prototypes are realised to compare the two mechanisms against the baseline design. We evaluate the effectiveness of the anti-backlash mechanisms under varying preload with measurements of friction, backlash, and stiffness. The results demonstrate that the anti-backlash mechanisms are effective at reducing backlash by approx. 2-3x, at the expected expense of increased friction (<2x).

Abstract:
Manipulating elasto-plastic object remains a significant challenge due to severe self-occlusion, difficulties of representation, and complicated dynamics. This work proposes a novel framework for elasto-plastic object manipulation with a quasi-static assumption for motions, leveraging 3D occupancy to represent such objects, a learned dynamics model trained with 3D occupancy, and a learning-based predictive control algorithm to address these challenges effectively. We build a novel data collection platform to collect full spatial information and propose a pipeline for generating a 3D occupancy dataset. To infer the 3D occupancy during manipulation, an occupancy prediction network is trained with multiple RGB images supervised by the generated dataset. We design a deep neural network empowered by a 3D convolution neural network (CNN) and a graph neural network (GNN) to predict the complex deformation with the inferred 3D occupancy results. A learning-based predictive control algorithm is introduced to plan the robots actions, incorporating a novel shape-based action initialization module specifically designed to improve the planners efficiency. The proposed framework in this paper can successfully shape the elasto-plastic objects into a given goal shape and has been verified in various experiments both in simulation and the real world.

Abstract:
In this paper, we introduce a novel approach for modeling a memory-efficient spatial representation with 3D Gaussian splatting. Efficient vision-based spatial representation poses a significant challenge due to the memory demands of visual information. Recent advances in 3D rendering technologies, such as neural radiance fields (NeRF) and 3D Gaussian splatting, have prompted exploration of their applications in robotics. However, such 3D rendering methods often focus on rendering high-quality images, requiring numerous parameters and resulting in large data, which are unsuitable for robotics applications. To tackle this challenge, we introduce 3DSR, an efficient voxelized renderable neural 3D spatial representation that utilizes 3D Gaussian splatting. 3DSR leverages the strengths of both voxelization (memory efficiency) and 3D Gaussian splatting (high-quality image reconstruction). The proposed method achieves memory efficiency by reducing the number of 3D Gaussians in the 3D representation through voxelization, while preserving the image quality required for effective vision-based robotic applications. Experimental results demonstrate that 3DSR achieves over 90% of the best method's reconstruction quality while requiring only 54.54% of its memory. Additional experiments on visual localization and navigation further confirm that the proposed method is readily applicable to robotics.

Abstract:
The recent development of connected and automated vehicle (CAV) technologies has spurred investigations to optimize dense urban traffic, maximizing vehicle speed and throughput. This article explores advisory autonomy, in which real-time driving advisories are issued to human drivers, thus achieving near-term performance of automated vehicles. Due to the complexity of traffic systems, recent studies of coordinating CAVs have leveraged deep reinforcement learning (RL). Coarse-grained advisory is formalized as zero-order holds, and we consider a range of hold durations from 0.1 to 40 s. However, despite the similarity of the higher frequency tasks for CAVs, a direct application of deep RL fails to generalize to advisory autonomy tasks. To overcome this, we employ zero-shot transfer, training policies on a set of source tasksspecific traffic scenarios with designated hold durationsand then evaluating the efficacy of these policies on different target tasks. We introduce temporal transfer learning (TTL) algorithms to select source tasks for zero-shot transfer, systematically leveraging the temporal structure to solve the full range of tasks. TTL selects the most suitable source tasks to maximize the performance of the range of tasks. We validate our algorithms on diverse mixed-traffic scenarios, demonstrating that TTL more reliably solves the tasks than baselines. This article underscores the potential of coarse-grained advisory autonomy with TTL in traffic flow optimization.

Abstract:
Many real-world tasks, such as assembly, cooking, and object handovers, require bi-manual coordination. Learning such skills via imitation remains challenging due to dataset scarcity, mainly caused by the high cost of bi-manual robotic platforms and barriers to entry in robotics software. To address these challenges, we introduce (1) ROBOTNAME, a low-cost, bi-manual humanoid robot priced at approximately 14K. ROBOTNAME achieves 0.2 mm repeatability and supports a 5 kg payload per arm, and (2) a Python-first distributed control framework for seamless teleoperation, data collection, and policy deployment, designed for ease of use; moreover, the code-base is installable via pip. We conducted imitation learning experiments in both simulation and the real world, integrating the robot with perception models, motion planning, and a large language model. The results demonstrate that ROBOTNAME is a stable, user-friendly, and high-precision dual-arm platform. We expect that the ROBOTNAME hardware, control system, and curated dataset of bi-manual manipulation episodes will advance affordable and scalable dual-arm robotics

Abstract:
Optimization has been widely used to generate smooth trajectories for motion planning. However, existing trajectory optimization methods show weakness when dealing with large-scale long trajectories. Recent advances in parallel computing have accelerated optimization in some fields, but how to efficiently solve trajectory optimization via parallelism remains an open question. In this paper, we propose a novel trajectory optimization framework based on the Consensus Alternating Direction Method of Multipliers (CADMM) algorithm, which decomposes the trajectory into multiple segments and solves the subproblems in parallel. The proposed framework reduces the time complexity to O(1) per iteration with respect to the number of segments, compared to O(N) of the state-of-the-art (SOTA) approaches. Furthermore, we introduce a closed-form solution that integrates convex linear and quadratic constraints to speed up the optimization, and we also present a numerical solution for general convex inequality constraints. A series of simulations and experiments demonstrate that our approach outperforms the SOTA approach in terms of efficiency and smoothness. Especially for a large-scale trajectory, with one hundred segments, achieving over a tenfold speedup. To fully explore the potential of our algorithm on modern parallel computing architectures, we deploy our framework on a GPU and show high performance with thousands of segments.

Abstract:
This work proposes a learning method to accelerate robotic pick-and-place planning by predicting shared grasps. Shared grasps are defined as grasp poses feasible to both the initial and goal object configurations in a pick-and-place task. Traditional analytical methods for solving shared grasps evaluate grasp candidates separately, leading to substantial computational overhead as the candidate set grows. To overcome the limitation, we introduce an Energy-Based Model (EBM) that predicts shared grasps by combining the energies of feasible grasps at both object poses. The formulation enables early identification of promising candidates and significantly reduces the search space. Experiments show that our method improves grasp selection performance, offers higher data efficiency, and generalizes well to varying grasps and table heights, given that variations fall within the learned distributions.

Abstract:
Abstract Precise situational awareness is vital for the safe deployment of artificial intelligence in real-world scenarios, especially in assisted and automated driving (AAD) systems. Panoptic segmentation, which unifies semantic and instance segmentation, plays a key role in identifying objects, hazards, and drivable areas at the pixel level. However, the relationship between segmentation robustness and camera image quality remains insufficiently explored. To address this, we propose a unified pipeline to evaluate the robustness of panoptic segmentation models for automotive cameras, correlating their performance with eight traditional image quality assessment (IQA) metrics. We introduce D-Cityscapes+, a degraded dataset featuring 19 realistic automotive degradations at multiple severity levels, including novel darkness and snowfall models with veiling effects. Evaluation across 14 state-of-the-art backbones reveals: (1) large-particle degradations (e.g., lens droplets, heavy snow) cause severe performance drops and edge-concentrated errors; (2) Transformer-based models are more robust but limited by <2 FPS and >500 GFLOPs; (3) frequency-domain IQA metrics such as CW-SSIM strongly correlate with segmentation performance; and (4) generic image restoration does not consistently improve perception. The findings and benchmark (https://github.com/Warwick-Jocelyn/BRPS ) provide practitioners with diagnostic tools, datasets, and guidelines for developing robust, real-time panoptic

Abstract:
跨越地理分离的多无人机协作覆盖任�?各地区因潜在无人机故障面临重大挑�?在执行任务时。解决效率挑�?本文介绍了多区域覆盖与动态失效处�?一种新型实时路径可重构覆盖规划算法跨越地理区域的多无人机覆盖路径规�?具备实时路径重配置功能。拟�?方法，GRIT-M（贪婪修复初始化多重的禁忌搜索）扩展了GRIT算法，以高效处理初始无人机在任务中故障时的规划和在线路径修复执行。与现有方法不同，这些方法要么只专注于单一区域无论覆盖率或缺失，GRIT-M都包含了故障处理机制通过三项关键创新实现区域特定知识：（1�?a 优化区域间的复合过渡成本函数运动，（2ᦀ

Abstract:
In multi-automated guided vehicle (AGV) environments, inefficient service placement increases energy consumption, and charging cycles, lowering battery lifespan. Consequently, minimizing energy consumption is key for maintaining operational efficiency and sustainability. Additionally, the unpredictable arrival of service requests in multi-AGV systems can lead to system saturation. However, previous research overlooked the energy costs of on-device computation, especially under dynamic service arrivals. To address these challenges, this work proposes an energy minimization service placement algorithm (EMSPA). The results demonstrate that EMSPA outperforms a baseline random selection (RS) algorithm for different numbers of AGVs, services, and tasks per service, reducing normalized energy consumption by up to 2.34% and improving mean service acceptance rates by up to 16.09% with lineal execution time overhead. Further, EMSPA outperforms a queue-aware scheduling and deadlock mitigation strategy (QASDMS) in terms of processing power ratio by over 58.94%.

Abstract:
Multi-robot systems face the challenge of efficiently allocating teams of heterogeneous robots to tasks. The task allocation problem is complicated by collaborative interactions between robots where teams of robots develop emergent capabilities that enable them to complete tasks that would be inefficient or impossible for individual robots. To address these challenges, we present an iterative clustering algorithm for collaborative task allocation in heterogeneous multirobot systems. This approach partitions the computationallyintractable global optimization problem into smaller, tractablesubproblems by iteratively forming clusters of robots and tasks, then optimizing assignments within each cluster. By ensuring robots remain clustered with their currently assigned tasks, we guarantee monotonic improvement in overall utility with each iteration. We analyze the convergence of the algorithm and characterize how cluster size constraints determine which suboptimal assignments could trap the algorithm. In simulation, iterative clustering consistently outperforms simulated annealing, and a group-based auction in both computation time and solution quality, and outperforms a hedonic game approach in solution quality.

Abstract:
In this paper, we propose an autonomous robot packing system named RoboPacker designed to tightly store cluttered general objects into shipping boxes with high space utilization, which is a fundamental process in numerous industrial applications. However, achieving tight packaging for general objects often demands significant labor from human packers, particularly in high-throughput scenes. Compared to existing robot packing approaches, RoboPacker effectively overcomes challenges such as diverse object appearances, severe occlusion, and crowded packing spaces. Specifically, we propose an open-vocabulary shape estimation method to reconstruct complete point clouds for cluttered objects. We also design effective interactions with object clutter to gather informative visual clues for shape estimation under high uncertainty. Additionally, we introduce a hierarchical reinforcement learning framework to optimize packing order, location, and orientation for maximum space utilization. The robotic packing system integrates these techniques with feasible manipulation methods for real-world implementation. In this way, RoboPacker achieves efficient packing of novel and irregular objects, which is more suitable for real deployment environments. The Real-world experiments demonstrate RoboPacker can tightly pack 20 densely cluttered everyday objects from 8 seen and 4 novel classes into the 40x40x20 cm shipping box with a 73.3% success rate.

Abstract:
Robots are often required to localize in environments with unknown object classes and semantic ambiguity. However, when performing global localization using semantic objects, high semantic ambiguity intensifies object misclassification and increases the likelihood of incorrect associations, which in turn can cause significant errors in the estimated pose. Thus, in this letter, we propose a multi-label likelihood-based semantic graph matching framework for object-level global localization. The key idea is to exploit multi-label graph representations, rather than single-label alternatives, to capture and leverage the inherent semantic context of object observations. Based on these representations, our approach enhances semantic correspondence across graphs by combining the likelihood of each node with the maximum likelihood of its neighbors via context-aware likelihood propagation. For rigorous validation, data association and pose estimation performance are evaluated under both closed-set and open-set detection configurations. In addition, we demonstrate the scalability of our approach to large-vocabulary object categories in both real-world indoor scenes and synthetic environments.

Abstract:
Bipedal robots have advantages in maneuvering human-centered environments, but face greater failure risk compared to other stable mobile plarforms such as wheeled or quadrupedal robots. While learning-based traversability has been widely studied for these platforms, bipedal traversability has instead relied on manually designed rules with limited consideration of locomotion stability on rough terrain. In this work, we present the first learning-based traversability estimation and risk-sensitive navigation framework for bipedal robots operating in diverse, uneven environments. TravFormer, a transformer-based neural network, is trained to predict bipedal instability with uncertainty, enabling risk-aware and adaptive planning. Based on the network, we define traversability as stability-aware command velocitythe fastest command velocity that keeps instability below a user-defined limit. This velocity-based traversability is integrated into a hierarchical planner that combines traversability-informed Rapid Random Tree Star (TravRRT) for time-efficient planning and Model Predictive Control (MPC) for safe execution. We validate our method in MuJoCo simulation and the real world, demonstrating improved navigation performance, with enhanced robustness and time efficiency across varying terrains compared to existing methods.

Abstract:
Differential drive robots are widely used in various scenarios thanks to their straightforward principle, from household service robots to disaster response field robots. The nonholonomic dynamics and possible lateral slip of these robots lead to difficulty in getting feasible and high-quality trajectories. Although there are several types of driving mechanisms for real-world applications, they all share a similar driving principle, which involves controlling the relative motion of independently actuated tracks or wheels to achieve both linear and angular movement. Therefore, a comprehensive trajectory optimization to compute trajectories efficiently for various kinds of differential drive robots is highly desirable. In this paper, we propose a universal trajectory optimization framework, enabling the generation of high-quality trajectories within a restricted computational timeframe for these robots. We introduce a novel trajectory representation based on polynomial parameterization of motion states or their integrals, such as angular and linear velocities, which inherently matches the robots' motion to the control principle. The trajectory optimization problem is formulated to minimize computation complexity while prioritizing safety and operational efficiency. We conduct extensive simulations and real-world testing in crowded environments with three kinds of differential drive robots to validate the effectiveness of our approach.

Abstract:
We present VISTA (Viewpoint-based Image selection with Semantic Task Awareness), an active exploration method for robots to plan informative trajectories that improve 3D map quality in areas most relevant for task completion. Given an open-vocabulary search instruction (e.g., "find a person"), VISTA enables a robot to explore its environment to search for the object of interest, while simultaneously building a real-time semantic 3D Gaussian Splatting reconstruction of the scene. The robot navigates its environment by planning receding-horizon trajectories that prioritize semantic similarity to the query and exploration of unseen regions of the environment. To evaluate trajectories, VISTA introduces a novel, efficient viewpoint-semantic coverage metric that quantifies both the geometric view diversity and task relevance in the 3D scene. On static datasets, our coverage metric outperforms state-of-the-art baselines, FisherRF and Bayes' Rays, in computation speed and reconstruction quality. In quadrotor hardware experiments, VISTA achieves 6x higher success rates in challenging maps, compared to baseline methods, while matching baseline performance in less challenging maps. Lastly, we show that VISTA is platform-agnostic by deploying it on a quadrotor drone and a Spot quadruped robot. Code and videos can be found on our project page: https://stanfordmsl.github.io/VISTA/.

Abstract:
Visual target navigation is a critical capability for autonomous robots operating in unknown environments, particularly in human-robot interaction scenarios. While classical and learning-based methods have shown promise, most existing approaches lack common-sense reasoning and are typically designed for single-robot settings, leading to reduced efficiency and robustness in complex environments. To address these limitations, we introduce Co-NavGPT, a novel framework that integrates a Vision Language Model (VLM) as a global planner to enable common-sense multi-robot visual target navigation. Co-NavGPT aggregates sub-maps from multiple robots with diverse viewpoints into a unified global map, encoding robot states and frontier regions. The VLM uses this information to assign frontiers across the robots, facilitating coordinated and efficient exploration. Experiments on the Habitat-Matterport 3D (HM3D) demonstrates that Co-NavGPT outperforms existing baselines in terms of success rate and navigation efficiency, without requiring task-specific training. Ablation studies further confirm the importance of semantic priors from the VLM. We also validate the framework in real-world scenarios using quadrupedal robots. Supplementary video and code are available at:https://sites.google.com/view/co-navgpt2-submission.

Abstract:
Cloud robotics allows low-power robots to perform computationally intensive inference tasks by offloading them to the cloud, raising privacy concerns when transmitting sensitive images. Although end-to-end encryption secures data in transit, it does not prevent misuse by inquisitive third-party services since data must be decrypted for processing. This paper tackles these privacy issues in cloud-based object detection tasks for service robots. We propose a co-trained encoder-decoder architecture that retains only task-specific features while obfuscating sensitive information, utilizing a novel weak loss mechanism with proposal selection for privacy preservation. A theoretical analysis of the problem is provided, along with an evaluation of the trade-off between detection accuracy and privacy preservation through extensive experiments on public datasets and a real robot.

Abstract:
In this paper, we present SonarSplat, a novel Gaussian splatting framework for imaging sonar that demonstrates realistic novel view synthesis and models acoustic streaking phenomena. Our method represents the scene as a set of 3D Gaussians with acoustic reflectance and saturation properties. We develop a novel method to efficiently rasterize Gaussians to produce a range/azimuth image that is faithful to the acoustic image formation model of imaging sonar. In particular, we develop a novel approach to model azimuth streaking in a Gaussian splatting framework. We evaluate SonarSplat using real-world datasets of sonar images collected from an underwater robotic platform in a controlled test tank and in a real-world river environment. Compared to the state-of-the-art, SonarSplat offers improved image synthesis capabilities (+3.2 dB PSNR) and more accurate 3D reconstruction (77% lower Chamfer Distance). We also demonstrate that SonarSplat can be leveraged for azimuth streak removal.

Abstract:
Scalable multi-agent driving simulation requires behavior models that are both realistic and computationally efficient. We address this by optimizing the behavior model that controls individual traffic participants. To improve efficiency, we adopt an instance-centric scene representation, where each traffic participant and map element is modeled in its own local coordinate frame. This design enables efficient, viewpoint-invariant scene encoding and allows static map tokens to be reused across simulation steps. To model interactions, we employ a query-centric symmetric context encoder with relative positional encodings between local frames. We use Adversarial Inverse Reinforcement Learning to learn the behavior model and propose an adaptive reward transformation that automatically balances robustness and realism during training. Experiments demonstrate that our approach scales efficiently with the number of tokens, significantly reducing training and inference times, while outperforming several agent-centric baselines in terms of positional accuracy and robustness.

Abstract:
Autonomous navigation in snowy environments is essential for snow removal robots operating in regions with heavy snowfall. However, snow accumulation obscures terrain features and introduces sensor noise, making reliable perception and navigation difficult. Moreover, snow removal robots typically operate only during winter, while the environment may change during other seasons, requiring the robot to adapt to new situations. To address these challenges, this study proposes a self-adaptive navigation framework that learns directly in real snowy environments without relying on simulation. The framework integrates reservoir computing (RC), reinforcement learning (RL), and artificial bee colony (ABC) optimization. In addition, a snow-region detection method based on thermal and grayscale images is introduced to guide the robot toward areas requiring snow removal.

Abstract:
Modern robotic controllers are typically designed in simulation and subsequently deployed on real robots. However, discrepancies between simulated and actual actuator torque often lead to sim-to-real (sim2real) problems. Various actuator approaches have been proposed to address this problem, but when external torque sensors are used, it is difficult to measure the intrinsic actuator output torque due to disturbances from external load systems. This paper proposes an actuator modeling method that minimizes the influence of external systems. The friction torque of the actuator is first identified under no-load conditions, and the measured torque under loaded conditions is compensated accordingly to estimate the pure output torque. Experimental results across various actuators and load conditions demonstrate that the proposed model closely matches the measured torque, even in actuators with large friction. The proposed approach overcomes the modeling limitation using external sensors and provides an effective solution for reducing sim2real problems in diverse actuator systems.

Abstract:
Visual place recognition in large-scale, indoor environments often suffers from perceptual aliasing due to structural symmetries and dynamic changes. This work presents a robust hierarchical topological mapping framework designed for long-term robot autonomy. Our system integrates multi-modal data (including 2D LiDAR, odometry, and RGB imagery) into a two-layer architecture. First, a Layout Layer is designed to capture the geometric structure of the environment. Then, a Visual Layer is used to encode image sequences. A key contribution is the dynamic map maintenance mechanism, which monitors the attenuation of edge weights to detect environmental transitions, such as the opening or closing of doors. This allows for seamless lifelong updates without human intervention in large-scale environments. We evaluate our approach using various visual descriptors (eg SuperGlue, Patch-NetVLAD, and SeqVLAD) within a sequence-based matching pipeline. Experimental results in a 750 m^2 real-world facility demonstrate that the proposed method achieves high discrimination and scalability, even in challenging open areas and symmetric corridors. This framework provides a reliable solution for assistive robotics navigating complex, evolving public spaces.

Abstract:
The reliable deployment of deep reinforcement learning in real-world settings requires the ability to generalize across a variety of conditions, including both in-distribution scenarios seen during training as well as novel out-of-distribution scenarios. In this work, we present a framework for dynamics generalization in deep reinforcement learning that unifies these two distinct types of generalization within a single architecture. We introduce a robust adaptation module that provides a mechanism for identifying and reacting to both in-distribution and out-of-distribution environment dynamics, along with a joint training pipeline that combines the goals of in-distribution adaptation and out-of-distribution robustness. Our algorithm GRAM achieves strong generalization performance across in-distribution and out-of-distribution scenarios upon deployment, which we demonstrate through extensive simulation and hardware locomotion experiments on a quadruped robot.

Abstract:
Autonomous drone racing has risen as a challenging robotic benchmark for testing the limits of learning, perception, planning, and control. Expert human pilots are able to fly a drone through a race track by mapping pixels from a single camera directly to control commands. Recent works in autonomous drone racing attempting direct pixel- to-commands control policies have relied on either intermediate representations that simplify the observation space or performed extensive bootstrapping using Imitation Learning (IL). This paper leverages DreamerV3 to train visuomotor policies capable of agile flight through a racetrack using only pixels as observations. In contrast to model-free methods like PPO or SAC, which are sample-inefficient and struggle in this setting, our approach acquires drone racing skills from pixels. Notably, a perception-aware behaviour of actively steering the camera toward texture-rich gate regions emerges without the need of handcrafted reward terms for the viewing direction. Our experiments show in both, simulation and real-world flight using a hardware-in-the-loop setup with rendered image observations, how the proposed approach can be deployed on real quadrotors at speeds of up to 9 m/s. These results advance the state of pixel-based autonomous flight and demonstrate that MBRL offers a promising path for real-world robotics research.

Abstract:
Aerial transportation robots using suspended cables have emerged as versatile platforms for disaster response and rescue operations. To maximize the capabilities of these systems, robots need to aggressively fly through tightly constrained environments, such as dense forests and structurally unsafe buildings, while minimizing flight time and avoiding obstacles. Existing methods geometrically over-approximate the vehicle and obstacles, leading to conservative maneuvers and increased flight times. We eliminate these restrictions by proposing PolyFly, an optimal global planner which considers a non-conservative representation for aerial transportation by modeling each physical component of the environment, and the robot (quadrotor, cable and payload), as independent polytopes. We further increase the model accuracy by incorporating the attitude of the physical components by constructing orientation-aware polytopes. The resulting optimal control problem is efficiently solved by converting the polytope constraints into smooth differentiable constraints via duality theory. We compare our method against the existing state-of-the-art approach in eight maze-like environments and show that PolyFly produces faster trajectories in each scenario. We also experimentally validate our proposed approach on a real quadrotor with a suspended payload, demonstrating the practical reliability and accuracy of our method.

Abstract:
Object pose estimation using visual data is crucial for robotic interaction with the environment. Many existing instance-level methods are restricted by their requirements for 3D CAD models or multiple object views, which limits their flexibility and generalizability. Overcoming this limitation is critical to enhance the adaptability of pose estimation systems. In this work, a novel pipeline that leverages recent advances in reconstruction techniques is presented to address these challenges. To this end, Large Reconstruction Models (LRM) represent an advanced neural architecture capable of generating 3D object models from a limited set of views. Nevertheless, the resulting 3D models often lack relevant geometric and texture details due to insufficient input information. This research presents InstantPose, an innovative zero-shot instance-level pose estimation method that, building upon LRM, can determine the pose of unseen objects using as little as a single RGB-D query image. Extensive experiments demonstrate that InstantPose achieves remarkable performance in object pose estimation on the YCB-V dataset, compared to methods conceived to rely on a geometrically perfect object's model. Furthermore, the 6D pose provided through the presented approach facilitates successful object grasping, highlighting its practical utility in robotic manipulation tasks.

Abstract:
Knee pain is prevalent in over 20% of the population, limiting the mobility of those affected. In turn, isokinetic dynamometers and robots have been used to facilitate rehabilitation for those still capable of ambulation. However, there are at most only a few wearable robots capable of delivering isokinetic training for bedridden patients. Here, we developed a wearable robot that provides bedside isokinetic training by utilizing a variable stiffness actuator and dynamic energy regeneration. The efficacy of this device was validated in a study involving six subjects with debilitating knee injuries. During two courses of rehabilitation over a total of three weeks, the average peak torque, average torque, and average work produced by their affected knees increased significantly by 81.0%, 101.4%, and 117.6%, respectively. Furthermore, the devices energy regeneration features were found capable of extending its operating time to 198 days under normal usage, representing a 57.8% increase over the same device without regeneration. These results suggest potential methodologies for delivering isokinetic joint rehabilitation to bedridden patients in areas with limited infrastructure.

Abstract:
This work presents a motion planning framework for robotic manipulators that computes collision-free paths directly in image space. The generated paths can then be tracked using vision-based control, eliminating the need for an explicit robot model or proprioceptive sensing. At the core of our approach is the construction of a roadmap entirely in image space. To achieve this, we explicitly define sampling, nearest-neighbor selection, and collision checking based on visual features rather than geometric models. We first collect a set of image-space samples by moving the robot within its workspace, capturing keypoints along its body at different configurations. These samples serve as nodes in the roadmap, which we construct using either learned or predefined distance metrics. At runtime, the roadmap generates collision-free paths directly in image space, removing the need for a robot model or joint encoders. We validate our approach through an experimental study in which a robotic arm follows planned paths using an adaptive vision-based control scheme to avoid obstacles. The results show that paths generated with the learned-distance roadmap achieved 100% success in control convergence, whereas the predefined image-space distance roadmap enabled faster transient responses but had a lower success rate in convergence.

Abstract:
Robotic telemanipulationthe human-guided manipulation of remote objectsplays a pivotal role in several applications, from healthcare to operations in harsh environments. While visual feedback from cameras can provide valuable information to the human operator, haptic feedback is essential for accessing specific object properties that are difficult to be perceived by vision, such as stiffness. For the first time, we present a participant study demonstrating that operators can perceive the stiffness of remote objects during real-world telemanipulation with a dexterous robotic hand, when haptic feedback is generated from tactile sensing fingertips. Participants were tasked with squeezing soft objects by teleoperating a robotic hand, using two methods of haptic feedback: one based solely on the measured contact force, while the second also includes the squeezing displacement between the leader and follower devices. Our results demonstrate that operators are indeed capable of discriminating objects of different stiffness, relying on haptic feedback alone and without any visual feedback. Additionally, our findings suggest that the displacement feedback component may enhance discrimination with objects of similar stiffness.

Abstract:
The heat transfer tubes of the steam generator are critical components of the nuclear power system and require regular inspection to ensure safety. The SG-Climbot, a quadruped heat transfer tube inspection robot, is equipped with a guiding device capable of simultaneously aligning with and inspecting two heat transfer tubes. Furthermore, The guiding device must execute hundreds of pose configuration transformations to complete a localized coverage inspection, thereby presenting challenges to the robots efficient autonomous planning. This letter presents a planning framework for the SG-Climbots localized coverage inspection task.The framework consists of four planning levels: pair planning, position and orientation planning for the guiding device, inspection sequence planning, and time-optimal trajectory planning. A maximum matching algorithm suitable for robotic arms equipped with dual execution devices to perform tasks has been proposed, achieving the optimal pairing of heat transfer tubes and reducing inspection time by over 48 minutes (18.32% improvement). In addition, we analyze the impact of various Traveling Salesman Problem (TSP) solving algorithms on sequence planning issues that require reaching numerous nodes within short operation times, reducing the arm operating time by 33.20 s (6.99% improvement). Finally, the effectiveness of the proposed planning algorithm was validated through simulations and experiments.

Abstract:
Industry 5.0 focuses on human-centric collaboration between humans and robots, prioritizing safety, comfort, and trust. This study introduces a data-driven framework to assess trust using behavioral indicators. The framework employs a Preference-Based Optimization algorithm to generate trust-enhancing trajectories based on operator feedback. This feedback serves as ground truth for training machine learning models to predict trust levels from behavioral indicators. The framework was tested in a chemical industry scenario where a robot assisted a human operator in mixing chemicals. Machine learning models classified trust with over 80% accuracy, with the Voting Classifier achieving 84.07% accuracy and an AUC-ROC score of 0.90. These findings underscore the effectiveness of data-driven methods in assessing trust within human-robot collaboration, emphasizing the valuable role behavioral indicators play in predicting the dynamics of human trust.

Abstract:
This paper introduces a high-response artificial muscle actuator using dimethyl ether combustion (HADEC), which is a novel method to enhance the responsiveness and force output of pneumatic actuators. The HADEC system integrates a McKibben-type artificial muscle filled with a combustible mixture of dimethyl ether (DME) and air, and it is ignited to generate rapid fluid pressure through combustion. This approach achieves force, displacement, and response speeds comparable to those of biological muscles while maintaining the simplicity and low-cost structure of McKibben-type actuators. The system provides instantaneous force generation without the need for complex mechanisms such as latches or brakes. DME, which is an environment- friendly fuel, ensures minimal emissions. Experimental results validate the effectiveness of HADEC in improving responsiveness, and the findings suggest superior force generation, faster response times, and high-frequency operability compared to that of conventional pneumatic actuators. Further, the paper discusses the potential for repeated actuation and highlights the benefits of HADEC in various robotic applications that require rapid and significant force.

Abstract:
Generative control policies have recently unlocked major progress in robotics. These methods produce action sequences via diffusion or flow matching, with training data provided by demonstrations. But existing methods come with two key limitations: they require expert demonstrations, which can be difficult or costly to obtain, and they are limited to relatively slow, quasi-static tasks. In this paper, we leverage a tight connection between sampling-based predictive control and generative modeling to address these issues. In particular, we introduce generative predictive control, a supervised learning framework for tasks with fast dynamics that are easy to simulate but difficult to demonstrate. We show how trained flow-matching policies can be warm-started at inference time, maintaining temporal consistency and enabling high-frequency feedback. We believe that generative predictive control offers a complementary approach to existing behavior cloning methods, and hope that it will pave the way toward generalist policies that extend beyond quasi-static demonstration-oriented tasks.

Abstract:
Collaborative planning under operational constraints is an essential capability for heterogeneous robot teams tackling complex large-scale real-world tasks. Unmanned Aerial Vehicles (UAVs) offer rapid environmental coverage, but flight time is often limited by energy constraints, whereas Unmanned Ground Vehicles (UGVs) have greater energy capacity to support long-duration missions, but movement is constrained by traversable terrain. Individually, neither can complete tasks such as environmental monitoring. Effective UAV-UGV collaboration therefore requires energy-constrained multi-UAV task planning, traversability-constrained multi-UGV path planning, and crucially, synchronized concurrent co-planning to ensure timely in-mission recharging. To enable these capabilities, we propose Collaborative Planning with Concurrent Synchronization (CoPCS), a learning-based approach that integrates a heterogeneous graph transformer for operationally constrained task encoding with a transformer decoder for joint, synchronized co-planning that enables UAVs and UGVs to act concurrently in a coordinated manner. CoPCS is trained end-to-end under a unified imitation learning paradigm. We conducted extensive experiments to evaluate CoPCS in both robotic simulations and physical robot teams. Experimental results demonstrate that our method provides the novel multi-robot capability of synchronized concurrent co-planning and substantially improves team performance. More details of this work are available on the project website: https://hcrlab.gitlab.io/project/CoPCS.

Abstract:
Grasping is a fundamental robot skill, yet despite significant research advancements, learning-based 6-DOF grasping approaches are still not turnkey and struggle to generalize across different embodiments and in-the-wild settings. We build upon the recent success on modeling the object-centric grasp generation process as an iterative diffusion process. Our proposed framework, GraspGen, consists of a Diffusion-Transformer architecture that enhances grasp generation, paired with an efficient discriminator to score and filter sampled grasps. We introduce a novel and performant on-generator training recipe for the discriminator. To scale GraspGen to both objects and grippers, we release a new simulated dataset consisting of over 53 million grasps. We demonstrate that GraspGen outperforms prior methods in simulations with singulated objects across different grippers, achieves state-of-the-art performance on the FetchBench benchmark for grasping in clutter, and performs well on a real robot with noisy visual observations.

Abstract:
Single sensor (visual or LiDAR) simultaneous localization and mapping (SLAM) is fragile in the complex environment, which makes visual-LiDAR fusion a mainstream in SLAM research. However, most existing fusion methods omit explicit modeling of feature uncertainties and do not quantify each feature's constraint strength on each degree of freedom (DoF) of the 6-DoF pose, thereby hindering the full exploitation of the complementary information across different sensors. In this paper, a tightly coupled visual-LiDAR SLAM method termed UCA-SLAM is proposed, which integrates the closed-form uncertainty propagation and the DoF-wise constraint analysis. Specifically, UCA-SLAM maintains uncertainties for visual map points and LiDAR voxel planes, and computes DoF-wise constraint strength for each feature. In the front-end tracking, the DoF-wise constraints of features are comprehensively analyzed, which provides an adaptive fusion mechanism for pose estimation, and an explicit uncertainty propagation from feature measurements to the 6-DoF pose is derived. The resultant feature and pose uncertainties are then used to weight the cost function in local bundle adjustment (BA) optimization of UCA-SLAM to improve the accuracy of the system. Extensive experiments conducted on public datasets and in real-world environments demonstrate that UCA-SLAM outperforms state-of-the-art visual-LiDAR fusion SLAM methods. UCA-SLAM is open-sourced to benefit the community.

Abstract:
Box/cabinet scenarios pose with stacked objects significant challenges for robotic motion due to visual occlusions and constrained free space. Traditional collision-free trajectory planning methods often fail when no collision-free paths exist, and may even lead to catastrophic collisions caused by invisible objects. To overcome these challenges, we propose an operational aware interactive motion planner (PaiP) a real-time closed-loop planning framework utilizing multimodal tactile perception. This framework autonomously infers object interaction features by perceiving motion effects at interaction interfaces. These interaction features are incorporated into grid maps to generate operational cost maps. Building upon this representation, we extend sampling-based planning methods to interactive planning by optimizing both path cost and operational cost. Experimental results demonstrate that PaiP achieves robust motion in narrow spaces. Project page: https://travelers-lab.github.io/PaiP/

Abstract:
Active scene reconstruction aims to autonomously recover the fine-grained appearance and structural details of a complex unknown scenes. Existing approaches based on 2D topological or voxel-based abstractions often scale poorly to large environments and rely heavily on handcrafted features and heuristic rules, limiting scalability and robustness. To address these challenges, using a RGB-D camera on a mobile robot, we present a graph-based planning framework by integrating skeleton-derived topology, Birds-Eye-View (BEV)-augmented graph inference, and offline Reinforcement Learning (RL) for policy optimization. The 3D skeleton graph captures full spatial connectivity, overcoming the limitations of 2D representations. BEV-augmented graph inference enriches node embeddings with semantic context, avoiding handcrafted feature design. The offline RL approach replaces heuristic planning with data-driven decision-making, while an additional Maximum Mean Discrepancy (MMD) term mitigates distributional shift before and after feature injection, improving stability. Extensive simulation results validate the efficacy of the proposed method. Real-world experiments demonstrate the zero-shot transferability of the learned policy, highlighting its potential for scalable, fine-grained scene reconstruction.

Abstract:
Object pose tracking is a fundamental and essential task for robotics to perform tasks in the home and industrial settings. The most commonly used sensors to do so are RGB-D cameras, which can hit limitations in highly dynamic environments due to motion blur and frame-rate constraints. Event cameras have remarkable features such as high temporal resolution and low latency, which make them a potentially ideal vision sensors for object pose tracking at high speed. Even so, there are still only few works on 6D pose tracking with event cameras. In this work, we take advantage of the high temporal resolution and propose a method that uses both a propagation step fused with a pose correction strategy. Specifically, we use 6D object velocity obtained from event-based optical flow for pose propagation, after which, a template-based local pose correction module is utilized for pose correction. Our learning-free method has comparable performance to the state-of-the-art algorithms, and in some cases out performs them for fast-moving objects. The results indicate the potential for using event cameras in highly-dynamic scenarios where the use of deep network approaches are limited by low update rates.

Abstract:
In this work, we investigate cooperative strategies for an evader drone team of various sizes using multi-agent reinforcement learning in a multi-agent pursuit-evasion scenario. The objective of the evader team is to reach a goal with minimal velocity while not colliding with the pursuer team. The objective of the pursuer team is to defend the goal by catching evaders before they reach it. In this environment, we allow the pursuer to have superior control authority compared to the evader such that reaching the goal is challenging for the evader in a one-on-one scenario. The proposed strategy for an evader is to team up with an ally to lead pursuers into a collision with each other instead of intercepting the evader. We design policies using multi-agent proximal policy optimization, an actor-critic reinforcement learning method, and investigate how the learned strategy changes when we vary the size of the pursuer and evader teams. Finally, we demonstrate the learned policy's sim-to-real capabilities through a hardware demonstration.

Abstract:
The objective of constrained motion planning is to connect start and goal configurations while satisfying task-specific constraints. Motion planning becomes inefficient or infeasible when the configurations lie in disconnected regions, known as essentially mutually disconnected components (EMDs). Constraints further restrict feasible space to a lower-dimensional submanifold, while redundancy introduces additional complexity because a single end-effector pose admits infinitely many inverse kinematic solutions that may form discrete self-motion manifolds. This paper addresses these challenges by learning a connectivity-aware representation for selecting start and goal configurations prior to planning. Joint configurations are embedded into a latent space through multi-scale manifold learning across neighborhood ranges from local to global, and clustering generates pseudo-labels that supervise a contrastive learning framework. The proposed framework provides a connectivity-aware measure that biases the selection of start and goal configurations in connected regions, avoiding EMDs and yielding higher success rates with reduced planning time. Experiments on various manipulation tasks showed that our method achieves 1.9 times higher success rates and reduces the planning time by a factor of 0.43 compared to baselines.

Abstract:
Hierarchical Reinforcement Learning (HRL) is a potent paradigm for addressing long-horizon sequential decision-making in swarm confrontation. However, its strategic capabilities are often bottlenecked by high-level policies that struggle to reason over the dynamic, variable-sized observations of other agents. To address this, we introduce a novel decentralized HRL framework featuring a Transformer-based strategic policy. The Transformer's self-attention mechanism is uniquely suited to capture complex spatio-temporal relationships among a varying number of entities, enabling robust long-horizon task allocation. This high-level strategy is then translated by a low-level policy into collision-free navigation. In complex swarm confrontation scenarios, our method significantly outperforms established baselines, achieving win rates of up to 81%. Beyond this performance, the learned policies exhibit strong zero-shot generalization to larger swarms, offer decision-making interpretability via the attention mechanism, and foster the autonomous emergence of complex cooperative tactics. This work provides a blueprint for scalable, strategically sophisticated, and interpretable multi-agent systems.

Abstract:
As robots become more integrated in society, their ability to coordinate with other robots and humans on multi-modal tasks (those with multiple valid solutions) is crucial. Such behaviors can be learned from expert demonstrations via imitation learning (IL), but when expert demonstrations are multi-modal, standard IL approaches usually average across modes or collapse to a single mode, preventing effective coordination. Being inspired by diffusion models ability to capture complex multi-modal trajectory distributions in single-agent settings, we develop a diffusion-based framework for coordinated multi-modal behavior in multi-agent systems. However, existing multi-agent diffusion approaches typically require a centralized planner or explicit communication among agents. This assumption can fail in real-world scenarios where robots must operate independently or with agents like humans that they cannot directly communicate with. Therefore, we propose MIMIC-D, a Centralized Training, Decentralized Execution paradigm for multi-modal multi-agent IL via diffusion. We jointly train all agents' policies with full information, but execute using only local information to achieve implicit coordination. In simulation and hardware experiments, our method exhibits robust multi-modal coordination behavior in various tasks and environments, improving upon state-of-the-art baselines.

Abstract:
Aerial imagery provides essential global context for autonomous navigation, enabling route planning at scales inaccessible to onboard sensing. We address the problem of generating global costmaps for long-range planning directly from satellite imagery when entities and mission-specific traversal rules are expressed in natural language at test time. This setting is challenging since mission requirements vary, terrain entities may be unknown at deployment, and user prompts often encode compositional traversal logic. Existing approaches relying on fixed ontologies and static cost mappings cannot accommodate such flexibility. While foundation models excel at language interpretation and open-vocabulary perception, no single model can simultaneously parse nuanced mission directives, locate arbitrary entities in large-scale imagery, and synthesize them into an executable cost function for planners. We therefore propose OVerSeeC, a zero-shot modular framework that decomposes the problem into InterpretLocateSynthesize: (i) an LLM extracts entities and ranked preferences, (ii) an open-vocabulary segmentation pipeline identifies these entities from high-resolution imagery, and (iii) the LLM uses user's natural language preferences and masks to synthesize executable costmap code. Empirically, OVerSeeC handles novel entities, respects ranked and compositional preferences, and produces routes consistent with human-drawn trajectories across diverse regions, demonstrating robustness to distribution shifts. This shows that modular composition of foundation models enables open-vocabulary, preference-aligned costmap generation for scalable, mission-adaptive global planning.

Abstract:
Legged robots rely on accurate ground interaction awareness to traverse variable terrains, such as slippery surfaces. Existing slip detection methods often rely on kinematics and proprioception, which lack the sensitivity to detect early-stage slips that occur prior to catastrophic instability. Thus, this paper presents SlipSense, a novel framework for online force-based slip detection using a custom lightweight sensorized foot for quadrupeds to detect slip. The framework integrates a multimodal sensor design with a LSTM-based model to infer ground reaction forces and detect slip-indicative anomalies during locomotion. The proposed framework is deployed on a Unitree Go1 quadruped to demonstrate blind online slip detection over a slippery terrain. Our method detects early-stage slips down to an average displacement of 24.1 ± 6.4mm with an overall accuracy of 85.9%. This represents a 3.3-fold finer detection resolution and a 24% relative accuracy improvement over a standard kinematic baseline that uses foot velocity inferred through state estimation. The work in this paper serves as a foundation for force-aware gait adaptation in legged robotic locomotion, allowing future controllers to estimate terrain friction and adjust constraints, thus improving the overall stability of the system.

Abstract:
This paper presents a novel Koopman operator formulation for EulerLagrangian dynamics that employs an implicit generalized momentum-based state space representation, which decouples a known linear actuation channel from state-dependent dynamics and makes the system more amenable to linear Koopman modeling. By leveraging this structural separation, the proposed formulation only requires to learn the unactuated dynamics rather than the complete actuation-dependent system, thereby significantly reducing the number of learnable parameters, improving data efficiency, and lowering overall model complexity. In contrast, conventional explicit formulations inherently couple inputs with the state-dependent terms in a nonlinear manner, making them more suitable for bilinear Koopman models, which are more computationally expensive to train and deploy. Notably, the proposed scheme enables the formulation of linear models that achieve superior prediction performance compared to conventional bilinear models while remaining substantially more efficient. To realize this framework, we present two neural network architectures that construct Koopman embeddings from actuated or unactuated data, enabling flexible and efficient modeling across different tasks. Robustness is ensured through the integration of a linear Generalized Extended State Observer (GESO), which explicitly estimates disturbances and compensates for them in real-time. The combined momentum-based Koopman and GESO framework is validated through comprehensive trajectory tracking simulations and experiments on robotic manipulators, demonstrating superior accuracy, robustness, and learning efficiency relative to state-of-the-art alternatives.

Abstract:
Magnetic microrobots hold great promise for biomedical applications. However, achieving flexible magnetic field adjustment with a magnetic actuation system (MAS) to actuate diverse microrobots remains a significant challenge. In this work, we propose an Electromagnetic-Permanent Magnet Actuation (EPMA) system that generates controllable magnetic field variations to enable microrobot actuation for diverse tasks, including microrobotic actuation, microswarm pattern transformation and targeted delivery. Automatic ellipsoid calibration of the Hall sensors enables real-time magnetic field orientation measurement with an error under 3�? Experimental results demonstrate the microrobot's actuation performance in four distinct scenarios, with a rotation frequency of 0.5 Hz. Furthermore, by adjusting the dynamic magnetic field, we achieve microswarm pattern reconfiguration under static conditions as well as targeted delivery in fluidic environments at a flow speed of 52 mm/s and a rotation frequency of 4 Hz. This study presents a hybrid MAS for the microrobotic actuation in diverse environments by controllable dynamic magnetic fields.

Abstract:
Autonomous and mobile soft robots require internal oscillators, similar to a biological heart, to generate rhythmic motions. However, existing soft oscillators typically have fixed operational parameters and suffer from an inherent coupling between control input and power output, limiting their versatility and adaptability. This paper addresses this challenge by introducing a new design paradigm: a soft, multi-port, bistable oscillator whose core nonlinear energy landscape can be continuously and actively tuned on-the-fly. Our approach, based on mechanically reconfiguring the physical constraints of a bistable elastomeric structure, achieves a decoupling of kinematics (frequency) from dynamics (output pressure). We demonstrate this principle in two modes: first, active programming, where we continuously modulate the oscillators coupled frequency-amplitude relationship in real-time under a constant power input. Secondly, we demonstrate passive adaptation, where an autonomous walker powered by our oscillator exhibits physical intelligence. By physically interacting with a confined environment, the walker autonomously and instantaneously adapts its gait from a low-frequency, large-amplitude mode to a high-frequency, small-amplitude mode. This work provides a new pathway for creating adaptive, intelligent soft robots that can autonomously respond to their physical world without any electronic computation.

Abstract:
Task and Motion Planning (TAMP) integrates high-level task planning with low-level motion feasibility, but existing methods are costly in long-horizon problems due to excessive motion sampling. While LLMs provide commonsense priors, they lack 3D spatial reasoning and cannot ensure geometric or dynamic feasibility. We propose a kinodynamic TAMP planner based on a hybrid state tree that uniformly represents symbolic and numeric states during planning, enabling task and motion decisions to be jointly decided. Kinodynamic constraints embedded in the TAMP problem are verified by an off-the-shelf motion planner and physics simulator, and a VLM guides exploring a TAMP solution and backtracks the search based on visual rendering of the states. Experiments on the simulated domains and in the real world show 32.14% �?1166.67% increased average success rates compared to traditional and LLM-based TAMP planners and reduced planning time on complex problems, with ablations further highlighting the benefits of VLM backtracking. More details are available at https://graphics.ewha.ac.kr/kinodynamicTAMP/.

Abstract:
Aerial manipulation requires force-aware capabilities to enable safe and effective grasping and physical interaction. Previous works often rely on heavy, expensive force sensors unsuitable for typical quadrotor platforms, or perform grasping without force feedback, risking damage to fragile objects. To address these limitations, we propose a novel force-aware grasping framework incorporating six low-cost, sensitive skin-like tactile sensors. We introduce a magnetic-based tactile sensing module that provides high-precision three-dimensional force measurements. We eliminate geomagnetic interference through a reference Hall sensor and simplify the calibration process compared to previous work. The proposed framework enables precise force-aware grasping control, allowing safe manipulation of fragile objects and real-time weight measurement of grasped items. The system is validated through comprehensive real-world experiments, including balloon grasping, dynamic load variation tests, and ablation studies, demonstrating its effectiveness in various aerial manipulation scenarios. Our approach achieves fully onboard operation without external motion capture systems, significantly enhancing the practicality of force-sensitive aerial manipulation. The supplementary video is available at: https://www.youtube.com/watch?v=mbcZkrJEf1I.

Abstract:
Retinal endolaser photocoagulation (REPC) is a repetitive intraocular surgical procedure that could greatly benefit from automation and distance-based control, improving both efficiency and safety. This work presents a robotic system designed for automated REPC, utilizing instrument-integrated optical coherence tomography (iiOCT) to facilitate real-time distance measurements. The system employs intraoperative spherical and ellipsoidal retinal models to convert 2D laser patterns into 3D arrangements, which are further refined through a control loop that incorporates online feedback. Ex vivo experiments in porcine eyes demonstrated clinical-level accuracy, with lateral and axial errors of 44 micrometers and 29 micrometers, respectively. Additionally, the proposed mapping technique produced patterns with greater equidistance than baseline methods. This system showcases the potential to automate repetitive surgical tasks while maintaining the surgeon's control over critical decision-making processes in ophthalmic surgery.

Abstract:
Physical touch, such as handshakes, plays a critical role in human-robot interaction (HRI), influencing perceived naturalness and social presence. This study investigates how arm compliance, hand grip strength, and motion synchrony jointly affect the subjective quality of a human-robot handshake. We implemented a fully actuated, tactile-sensorized humanoid hand mounted on a manipulator arm and designed a compliant, oscillatory handshake controller with adaptive synchronization. Sixteen participants experienced handshakes across a 2x2x2 factorial design, varying arm compliance, grip strength, and synchrony. Objective kinematic analysis revealed significant main and interaction effects across all factors. At the same time, subjective ratings showed clear preferences for a weaker grip and greater arm compliance, with synchrony exerting minimal influence on perceived naturalness. These results highlight a perceptual hierarchy in HRI: foundational haptic properties exert the strongest influence on user experience, while advanced kinematic adjustments have limited impact when basic comfort is lacking. This insight provides concrete guidance for designing robotic handshakes that feel more human-like and pleasant.

Abstract:
End-to-end autonomous driving has been greatly advanced in recent years. However, most of existing work focuses on small vehicles (e.g., cars). Driving articulated trucks, such as tractor-trailers, still remains less being explored. The underactuated nature and extended wheelbase of tractor-trailers pose considerable driving challenges, especially when navigating narrow roads. For example, when a left-hand-drive tractor-trailer makes a right turn on a two-way two-lane narrow road, the tractor usually needs to encroach some spaces in the opposing lane. Otherwise, the trailer may have insufficient spaces to turn right and strike curbside objects. To provide a solution to this problem, we employ deep reinforcement learning to train an end-to-end autonomous driving policy with a trailer-aware reward function. Through planar rigid-body kinematics analysis, we locate the reference points on the tractor and the trailer. We also build a tractor-trailer model for CARLA. Experimental results demonstrate the effectiveness and superiority of our method in CARLA.

Abstract:
Simulating optical tactile sensors presents significant challenges due to their high deformability and intricate optical properties. To address these issues and enable a physically accurate simulation, we propose DOT-Sim: Differentiable Optical Tactile Simulation. Unlike prior simulators that rely on simplified models of deformable sensors, DOT-Sim accurately captures the physical behavior of soft sensors by modeling them as elastic materials using the Material Point Method (MPM). DOT-Sim enables rapid calibration of optical tactile sensor simulation using a small number of demonstrations within minutes, which is substantially faster than existing methods. Compared to current baselines, our approach supports much larger and non-linear deformations. To handle the optical aspect, we propose a novel approach to simulating optical responses by learning a residual image relative to the real-world idle state. We validate the physical and visual realism of our method through a series of zero-shot sim-to-real tasks. Our experiments show that DOT-Sim (1) accurately replicates the physical dynamics of a DenseTact optical tactile sensor in reality, (2) generates realistic optical outputs in contact-rich scenarios, and (3) enables direct deployment of simulation-trained classifiers in the real world, achieving 85% classification accuracy on challenging objects and 90% accuracy in embedded tumor-type detection, and (4) allows precise trajectory following with policy trained from demonstrations in simulation with an average error of less than 0.9 mm.

Abstract:
Global path planning provides high-level guidance for autonomous navigation, supplying reference paths for downstream navigation and control modules. Deep Reinforcement Learning (DRL) has shown strong potential in this domain, but existing methods struggle with multi-scale map inputs. This limitation arises from inconsistent representations across different map sizes and trajectory length variations, which hinder feature extraction, destabilize policy learning. To address these challenges, we propose the Progressive Multi-Scale Curriculum Reinforcement Learning (MS-CRL) framework. MS-CRL incorporates a progressive curriculum reinforcement learning algorithm (ProgCRL) that mitigates instability from trajectory length discrepancies, a unified multi-scale representation (UniMS) that normalizes spatial scales and resolves representation inconsistencies, and a Global-Local Fusion Network (GLFNet) that fully extracts both global and local features from the new representation for robust cross-scale policy learning. Extensive experiments on multi-scale map datasets demonstrate that MS-CRL enables effective global path planning, stabilizes policy learning, and achieves superior performance in path success rate, path quality, and planning efficiency, while significantly improving training efficiency and cross-scale adaptability compared with state-of-the-art baselines.

Abstract:
As micro-drones become increasingly deployed in indoor environments for applications ranging from warehouse inspection to emergency response, the challenge of precise automated landing emerges as a crucial barrier to their practical operation and ubiquitous adoption. Existing landing approaches often require complex hardware and substantial computation or perform unreliably indoors, making them impractical for palm-sized microdrones. We propose Moth, a low-cost infrared light-based solution that targets precise and efficient landing of low-resource microdrones. Moth consists of an infrared light source at the landing station along with an energy-efficient photodiode (PD) sensing platform attached to the bottom of the drone. At a cost under 83 USD, Moth achieves comparable performance to vision-based methods but at a fraction of the energy consumption and computation. Moth requires only three PDs without any complex pattern recognition models to land the drone accurately, under 10 cm of error, from up to 11.1 meters away.

Abstract:
Reinforcement Learning (RL) enables robust and adaptive locomotion in legged and wheeled-legged robots. A common approach is the Teacher-Student (TS) paradigm, in which a teacher policy with privileged information supervises a proprioceptive student. While the TS paradigm has proven effective on legged robots, we encounter two critical issues when applying it to wheeled-legged robots. One issue is multimodal confusion, where teacher actions become multimodal under the student proprioceptive observations, resulting in the student generating averaged action modes. The other is low imitability of teacher actions, as the teacher overlooks their reproducibility by the student. To address these issues, we propose Teaching to Individual Needs (TIN), a bidirectional TS framework. To mitigate multimodal confusion within the student policy, we design a Highest-Weight Component Mixture Density Network (HWC-MDN). By utilizing HWC-MDN, TIN student can explicitly model multimodal action distributions and outputs the highest-weight component. To improve imitability, we propose an Imitation-Aware Reward (IAR) that encourages the teacher to generate more reproducible actions by the student. Simulation experiments show that TIN significantly improves both training efficiency and traversability. Real-world tests illustrate that TIN enables the wheeled-legged robot MagicDog-W to traverse 45 cm obstacles and ascend 45° slopes.

Abstract:
Mimicking the graceful motion of swimming animals remains a core challenge in soft robotics due to the complexity of fluid-structure interaction and the difficulty of controlling soft, biomimetic bodies. Existing modeling approaches are often computationally expensive and impractical for complex control or reinforcement learning needed for realistic motions to emerge in robotic systems. In this work, we present a tendon-driven fish robot modeled in an efficient underwater swimmer environment using a simplified, stateless hydrodynamics formulation implemented in the widespread robotics framework MuJoCo. With just two real-world swimming trajectories, we identify five fluid parameters that allow a matching to experimental behavior and generalize across a range of actuation frequencies. We show that this stateless fluid model can generalize to unseen actuation and outperform classical analytical models such as the elongated body theory. This simulation environment runs faster than real-time and can easily enable downstream learning algorithms such as reinforcement learning for target tracking, reaching a 93% success rate. Due to the simplicity and ease of use of the model and our open-source simulation environment, our results show that even simple, stateless models --- when carefully matched to physical data --- can serve as effective digital twins for soft underwater robots, opening up new directions for scalable learning and control in aquatic environments.

Abstract:
Deformable substrates such as sand and mud present significant challenges for terrestrial robots due to complex robot-terrain interactions. Inspired by mudskippers, amphibious animals that naturally adjust their tail morphology and movement jointly to navigate such environments, we investigate how tail design and control can jointly enhance flipper-driven locomotion on granular media. Using a bio-inspired robot modeled after the mudskipper, we experimentally compared locomotion performance between idle and actively oscillating tail configurations and found that tail oscillation increased forward speed by 17% while reducing body drag by 46%. Shear force measurements revealed that this improvement arises from oscillation-induced fluidization of the substrate, which lowers resistive forces acting on the body. Additionally, tail morphology strongly influenced the oscillation strategy: designs with larger horizontal surface areas leveraged the oscillation-induced reduction in shear resistance more effectively by limiting insertion depth. Based on these findings, we present a design principle to inform tail action selection based on substrate strength and tail morphology. Our results offer new insights into tail design and control for improving robot locomotion on deformable substrates, with implications for agricultural robotics, search and rescue, and environmental exploration.

Abstract:
Accurate object pose estimation is essential for robotic manipulation, particularly in tasks involving small or geometrically intricate objects where high precision is required. Existing vision, tactile, and hybrid-based approaches struggle with occlusion, noise, and limited generalization, often requiring extensive retraining or large annotated datasets. In this work, we present M-VTOP, a modular framework for in-hand object pose estimation that integrates vision, tactile, and contact sensing in a flexible manner, allowing robustness against noisy or missing modalities. At the core of the framework is a belief-based particle filter that fuses heterogeneous sensor observations, maintains probabilistic estimates, and continuously refines them toward high-precision convergence in closed-loop robotic control with the pose estimation feedback. A mask based observation representation unifies visual and tactile signals into geometry-centric inputs, enhancing robustness to texture and lighting variations while supporting zero-shot generalization. The framework requires only an objects CAD model and avoids task-specific retraining. Experiments show that M-VTOP achieves sub-millimeter accuracy under complex geometries, occlusions, and tight tolerances, demonstrating its promise for high-precision robotic manipulation.

Abstract:
As robots are increasingly deployed in diverse application domains, enabling robust mobility across different embodiments has become a critical challenge. Classical mobility stacks, though effective on specific platforms, require extensive per-robot tuning and do not scale easily to new embodiments. Learning-based approaches, such as imitation learning (IL), offer alternatives, but face significant limitations on the need for high-quality demonstrations for each embodiment. To address these challenges, we introduce COMPASS, a unified framework that enables scalable cross-embodiment mobility using expert demonstrations from only a single embodiment. We first pre-train a mobility policy on a single robot using IL, combining a world model with a policy model. We then apply residual reinforcement learning (RL) to efficiently adapt this policy to diverse embodiments through corrective refinements. Finally, we distill specialist policies into a single generalist policy conditioned on an embodiment embedding vector. This design significantly reduces the burden of collecting data while enabling robust generalization across a wide range of robot designs. Our experiments demonstrate that COMPASS scales effectively across diverse robot platforms while maintaining adaptability to various environment configurations, achieving a generalist policy with a success rate approximately 5X higher than the pre-trained IL policy, and further demonstrates zero-shot sim-to-real transfer.

Abstract:
Robotic bin packing is widely deployed in warehouse automation, with current systems achieving robust performance through heuristic and learning-based strategies. These systems must balance compact placement with rapid execution, where selecting alternative items or reorienting them can improve space utilization but introduce additional time. We propose a selection-based formulation that explicitly reasons over this trade-off: at each step, the robot evaluates multiple candidate actions, weighing expected packing benefit against estimated operational time. This enables time-aware strategies that selectively accept increased operational time when it yields meaningful spatial improvements. Our method, STEP (Space-Time Efficient Packing), uses a preference-conditioned, Transformer-based reinforcement learning policy, and allows generalization across candidate set sizes and integration with standard placement modules. It achieves a 44% reduction in operational time without compromising packing density. Additional material is available at https://step-packing.github.io.

Abstract:
Autonomous harvesting in the open presents a complex manipulation problem. In most scenarios, an autonomous system has to deal with significant occlusion and require interaction in the presence of large structural uncertainties (every plant is different). Perceptual and modeling uncertainty make design of reliable manipulation controllers for harvesting challenging, resulting in poor performance during deployment. We present a sim2real reinforcement learning (RL) framework for occlusion-aware plant manipulation, where a policy is learned entirely in simulation to reposition stems and leaves to reveal target fruit(s). In our proposed approach, we decouple high-level kinematic planning from low-level compliant control which simplifies the sim2real transfer. This decomposition allows the learned policy to generalize across multiple plants with different stiffness and morphology. In experiments with multiple real-world plant setups, our system achieves up to 86.7% success in exposing target fruits, demonstrating robustness to occlusion variation and structural uncertainty.

Abstract:
SP-based synthesis yields two-time-scale control that allows compliant-joint robots to achieve high-quality tracking at low implementation cost. Composite learning enables exact online identification and control of robots without the stringent condition known as persistent excitation (PE). However, to achieve exact online identification for compliant-joint robots, parameter update derived from SP-based synthesis and composite learning requires physically unavailable states. This paper presents a novel SP-based composite learning robot control (SP-CLRC) strategy for compliant-joint robots that achieves exact online identification and control without requiring access to physically unavailable states. In the proposed method, link-side and actuator-side parameters are estimated separately, enabling exact online identification using available robot states. A two-time-scale composite learning method is proposed to guarantee practical exponential stability of the closed-loop system with parameter convergence under interval excitation, a condition strictly weaker than PE. Experiments on a two-degree-of-freedom robot driven by series elastic actuators have shown that the proposed SP-CLRC significantly outperforms the baseline in online identification and tracking accuracy.

Abstract:
We propose S3LAM, a novel RGB-D SLAM system that leverages 2D surfel splatting to achieve geometrically accurate scene representations for simultaneous tracking and mapping. Unlike existing 3DGS-based SLAM approaches that rely on 3D Gaussian ellipsoids, we utilize 2D Gaussian surfels as primitives for more efficient scene representation. By focusing on the surfaces of objects in the scene, this design enables S3LAM to reconstruct high-quality geometry, benefiting both mapping and tracking. To address inherent SLAM challenges including real-time optimization under limited viewpoints, we introduce a novel adaptive surface rendering strategy that improves mapping accuracy while maintaining computational efficiency. We further derive camera pose Jacobians directly from 2D surfel splatting formulation, highlighting the importance of our geometrically accurate representation that improves tracking convergence. Extensive experiments on both synthetic and real-world datasets demonstrate that S3LAM achieves state-of-the-art performance. Our code is available at https://github.com/FanryZ/S3LAM.

Abstract:
We present SLNet, a lightweight backbone for 3D point cloud recognition designed to achieve strong performance without the computational cost of many recent attention, graph, and deep MLP based models. The model is built on two simple ideas: NAPE (Nonparametric Adaptive Point Embedding), which captures spatial structure using a combination of Gaussian RBF and cosine bases with input adaptive bandwidth and blending, and GMU (Geometric Modulation Unit), a per channel affine modulator that adds only 2D learnable parameters. These components are used within a four stage hierarchical encoder with FPS+kNN grouping, nonparametric normalization, and shared residual MLPs. In experiments, SLNet shows that a very small model can still remain highly competitive across several 3D recognition tasks. On ModelNet40, SLNet-S with 0.14M parameters and 0.31 GFLOPs achieves 93.64% overall accuracy, outperforming PointMLP-elite with 5× fewer parameters, while SLNet-M with 0.55M parameters and 1.22 GFLOPs reaches 93.92%, exceeding PointMLP with 24× fewer parameters. On ScanObjectNN, SLNet-M achieves 84.25% overall accuracy within 1.2 percentage points of PointMLP while using 28× fewer parameters. For large scale scene segmentation, SLNet-T extends the backbone with local Point Transformer attention and reaches 58.2% mIoU on S3DIS Area 5 with only 2.5M parameters, more than 17× fewer than Point Transformer V3. We also introduce NetScore+, which extends NetScore by incorporating latency and peak memory so that efficiency can be evaluated in a more deployment oriented way. Across multiple benchmarks and hardware settings, SLNet delivers a strong overall balance between accuracy and efficiency. Code is available at: https://github.com/m-saeid/SLNet.

Abstract:
Learning from demonstration effectively transfers human manipulation skills to robots. It can be especially useful for imitating industrial manipulation tasks which are performed by humans and are difficult to model such as deformable object manipulation. Manipulation of deformable objects often requires not only accurate tracking of the demonstration trajectory using a robot end-effector, but also the accommodation of interaction forces. Precise tracking of such trajectories while ignoring these interaction forces leads to overly stiff, unsafe, or unsuccessful executions. We address this problem by proposing Dual Quaternion based Compliant Movement Primitives (DQ-CMP). DQ-CMP couples a dual-quaternion based Dynamic Movement Primitive for compact 6-DoF pose encoding with learnable wrench primitives. This combination reproduces synchronized motion and force behaviors directly at the end-effector. The method is robot-agnostic and singularity-free at the representation-level, as it operates in operational space using dual quaternions. From a few demonstrations, required wrenches for unseen initial configurations are predicted using Gaussian process regression defined on the pose manifold. This enables generalization of the learned wrenches across different starting poses. We validate the method on real-robot experiments including a shoe-sole detachment for recycling and bending of stiff foam inside a box. Results show compliant, safe task execution and successful generalization to new initial poses.

Abstract:
Visual navigation is a core capability for mobile robots, yet end-to-end learning-based methods often struggle with generalization and safety in unseen, cluttered, or narrow environments. These limitations are especially pronounced in dense indoor settings, where collisions are likely and end-to-end models frequently fail. To address this, we propose SaferPath, a hierarchical visual navigation framework that leverages learned guidance from existing end-to-end models and refines it through a safety-constrained optimization-control module. SaferPath transforms visual observations into a traversable-area map and refines guidance trajectories using Model Predictive Stein Variational Evolution Strategy (MP-SVES), efficiently generating safe trajectories in only a few iterations. The refined trajectories are tracked by an MPC controller, ensuring robust navigation in complex environments. Extensive experiments in scenarios with unseen obstacles, dense unstructured spaces, and narrow corridors demonstrate that SaferPath consistently improves success rates and reduces collisions, outperforming representative baselines such as ViNT and NoMaD, and enabling safe navigation in challenging real-world settings.

Abstract:
In conventional learning-based robotic dynamics modeling, physical information is mostly incorporated into the model or loss function, while the design of training data often relies on random sampling or uniform coverage, which can limit performance. To address this gap, this paper proposes the CATALYST framework, which generates optimal training data based on physics priors and the modeling structure of the chosen learning model. Stage 1 uses the CAD-derived inertia matrix M(q) to approximate the joint distribution of [q, M] with a PLM, thereby identifying the optimal locations for the local model centers (mu_k^opt). Stage 2 then optimizes an Operating-Point-Centered Excitation Trajectory (OPCET). This optimization simultaneously (i) aligns the trajectory with the target operating points (l_m), (ii) enforces range-of-motion (RoM) constraints (l_r), and (iii) achieves desirable velocityacceleration statistics (large volume, isotropy, low correlation, captured by l_s). We validate the approach in simulation using a 3-DoF yawpitchpitch manipulator, which allows visual demonstration of the process and outcomes. We then analyze the framework step by step. Results show that each stage meets its objective. A PLM trained on data generated by the proposed trajectories outperforms baselines (Spread/RoM, ill‑centered, Tukey‑windowed chirp, and cubic) in both torque regression and control. Thus, CATALYST yields more accurate regression and more reliable feedforward control than conventional designs.

Abstract:
Suckers are significant for robots in picking, transferring, manipulation and locomotion on diverse surfaces. However, conventional suckers lack high-fidelity tactile perception, which impedes them from resolving the fine-grained geometric features and interaction status of the target surface. This limits their robust performance with irregular objects and in complex, unstructured environments. Inspired by the adaptive structure and high-performance sensory capabilities of cephalopod suckers, we propose a novel, intelligent sucker, named SuckTac, that integrates a camera-based tactile sensor directly within its optimized structure to provide high-density perception and robust suction. Specifically, through joint structural optimization and a multi-material integrated casting technique, a camera and light source are embedded into the sucker, which enables in-situ, high-density perception of fine details such as surface shape, texture, and roughness. To further enhance robustness and adaptability, the sucker's mechanical design is also optimized by refining its profile, adding a compliant lip, and incorporating surface microstructure. Extensive experiments, including challenging tasks such as robotic cloth manipulation and soft mobile robot inspection, demonstrate the superior performance and broad applicability of the proposed system.

Abstract:
Imitation learning has proven to be highly effective in teaching robots dexterous manipulation skills. However, it typically relies on large amounts of robot data, which limits its scalability and applicability in dynamic, real-world environments. One key challenge in this context is object generalizationwhere a robot trained to perform a task with one object, such as "hand over the apple." struggles to transfer its skills to a semantically similar but visually different object, such as "hand over the peach." This gap in generalization to new objects beyond those in the same category has yet to be adequately addressed in previous work on end-to-end visuomotor policy learning. In this paper, we present a simple yet effective approach for achieving object generalization through Vision-Language-Action (VLA) models, referred to as ObjectVLA. We design a lightweight image-text-data-synthesis pipeline, Search2Scene, which enables robots to generalize learned skills to novel objects without requiring explicit human demonstrations for each new target object. By leveraging vision-language pair data, our method provides a lightweight and scalable way to inject knowledge about the target object, establishing an implicit link between the object and the desired action. We evaluate ObjectVLA on a real robotic platform, demonstrating its ability to generalize across 100 novel objects with a 64% success rate in selecting objects not seen during training. These results highlight the effectiveness of our approach in enabling object-level generalization and reducing the need for extensive human demonstrations, paving the way for more flexible and scalable robotic learning systems.

Abstract:
Sidewalk micromobility is a promising solution for last-mile transportation, but current learning-based control methods struggle in complex urban environments. Imitation learning (IL) learns policies from human demonstrations, yet its reliance on fixed offline data often leads to compounding errors, limited robustness, and poor generalization. To address these challenges, we propose a framework that advances IL through corrective behavior expansion and multi-scale imitation learning. On the data side, we augment teleoperation datasets with diverse corrective behaviors and sensor augmentations to enable the policy to learn to recover from its own mistakes. On the model side, we introduce a multi-scale IL architecture that captures both short-horizon interactive behaviors and long-horizon goal-directed intentions via horizon-based trajectory clustering and hierarchical supervision. Real-world experiments show that our approach significantly improves robustness and generalization in diverse sidewalk scenarios. Demo video and additional information are available on the project page.

Abstract:
High-definition (HD) maps are essential for autonomous driving, yet multi-modal fusion often suffers from inconsistency between camera and LiDAR modalities, leading to performance degradation under low-light conditions, occlusions, or sparse point clouds. To address this, we propose SEF-MAP, a Subspace-Expert Fusion framework for robust multi-modal HD map prediction. The key idea is to explicitly disentangle BEV features into four semantic subspaces: LiDAR-private, Image-private, Shared, and Interaction. Each subspace is assigned a dedicated expert, thereby preserving modality-specific cues while capturing cross-modal consensus. To adaptively combine expert outputs, we introduce an uncertainty-aware gating mechanism at the BEV-cell level, where unreliable experts are down-weighted based on predictive variance, complemented by a usage balance regularizer to prevent expert collapse. To enhance robustness in degraded conditions and promote role specialization, we further propose distribution-aware masking: during training, modality-drop scenarios are simulated using EMA-statistical surrogate features, and a specialization loss enforces distinct behaviors of private, shared, and interaction experts across complete and masked inputs. Experiments on nuScenes and Argoverse2 benchmarks demonstrate that SEF-MAP achieves state-of-the-art performance, surpassing prior methods by +4.2% and +4.8% in mAP, respectively. SEF-MAP provides a robust and effective solution for multi-modal HD map prediction under diverse and degraded conditions.

Abstract:
This letter presents a technique that allows unmanned vehicles to escort a human to their destinations. Current human-centered following methods depend solely on human movement, which presents significant limitations. The complexity of human movement during tactical maneuvers can lead to erratic vehicle motion. Additionally, the static relative positioning between the human and vehicle creates a rigid following pattern, thereby constraining the vehicles ability to dynamically adjust its position for optimal coverage. To address these limitations, we propose a data-driven end-to-end escorting system (EES) that takes into account both environmental in formation and human movement to achieve adaptive escorting. We propose a soft-coding paradigm to replace the traditional hard-coding intent modeling to address the inconsistency of human intention and vehicle motion, and establish human-scene following through a cross-modal attention gating network. We conducted experiments in the CARLA simulation and the real world. The results demonstrate that the proposed EES reduces prediction errors by 41.2% during overall processes and by 54.5% during cornering. Additionally, EES can adapt to various positions and dynamically adjust the relative positions between humans and unmanned systems to adapt to complex scenarios.

Abstract:
Recent advances in skill learning has propelled robot manipulation to new heights by enabling it to learn complex manipulation tasks from a practical number of demonstrations. However, these skills are often limited to the particular action, object, and environment instances that are shown in the training data, and have trouble transferring to other instances of the same category. In this work we present an open-vocabulary Spatial-Semantic Diffusion policy (S^2-Diffusion) which enables generalization from instance-level training data to category-level, enabling skills to be transferable between instances of the same category. We show that functional aspects of skills can be captured via a promptable semantic module combined with a spatial representation. We further propose leveraging depth estimation networks to allow the use of only a single RGB camera. Our approach is evaluated and compared on a diverse number of robot manipulation tasks, both in simulation and in the real world. Our results show that S^2-Diffusion is invariant to changes in category-irrelevant factors as well as enables satisfying performance on other instances within the same category, even if it was not trained on that specific instance. Project website: urlhttps://s2-diffusion.github.io.

Abstract:
Large vision-language models have been shown to perform complex tasks. However, aligning language instructions with object visual information to enable general inference for robotic grasping poses a significant challenge. To tackle this issue, we introduce GraspControl, a method that leverages grasp language instructions and sketches of objects to control the generation of grasps. Initially, we construct a dataset that augments language instructions with position and orientation information of grasps, and visual information with sketches of the gripper and target objects. Subsequently, we develop a model capable of generating 2D grasp sketches given grasp language and 2D object sketches as input prompts, thereby bridging the gap between the linguistic and visual representations of the object to be grasped. These generated 2D grasp sketches serve as an innovative input modality for grasp synthesis, directing the creation of 3D object models and corresponding 3D grasp poses through a 3D reconstruction module. Furthermore, we incorporate a multi-modal attention loss to ensure the consistency between high-level semantic grasp features and intricate low-level visual features, with a particular emphasis on the grasping area of the object. We evaluate the capabilities of our grasp approach through extensive experiments in both simulated and real-world robotic scenarios. The experimental results confirm that our method can execute grasps in complex environments.

Abstract:
Mobile service robots can benefit from object-level understanding of their environments, including the ability to distinguish object instances and re-identify previously seen instances. Object re-identification is challenging across different viewpoints and in scenes with significant appearance variation arising from weather or lighting changes. Existing works on object re-identification either focus on specific classes or require foreground segmentation. Further, these methods, along with object re-identification datasets, have limited consideration of challenges such as occlusions, outdoor scenes, and illumination changes. To address this problem, we introduce CODa Re-ID: an in-the-wild object re-identification dataset containing 1,037,814 observations of 557 objects across 8 classes under diverse lighting conditions and viewpoints. Further, we propose CLOVER, a representation learning method for object observations that can distinguish between static object instances without requiring foreground segmentation. We also introduce MapCLOVER, a method for scalably summarizing CLOVER descriptors for use in object maps and matching new observations to summarized descriptors. Our results show that CLOVER achieves superior performance in static object re-identification under varying lighting conditions and viewpoint changes and can generalize to unseen instances and classes.

Abstract:
We present a method that reduces, by an order of magnitude, the time and memory needed to train multi-task vision-language robotic diffusion policies. This improvement arises from a previously underexplored distinction between action diffusion and the image diffusion techniques that inspired it: In image generation, the target is high-dimensional. By contrast, in action generation, the dimensionality of the target is comparatively small, and only the image condition is high-dimensional. Our approach, Mini Diffuser, exploits this asymmetry by introducing two-level minibatching, which pairs multiple noised action samples with each vision-language condition, instead of the conventional one-to-one sampling strategy. To support this batching scheme, we introduce architectural adaptations to the diffusion transformer that prevent information leakage across samples while maintaining full conditioning access. In RLBench simulations, Mini-Diffuser achieves 95% of the performance of state-of-the-art multi-task diffusion policies, while using only 5% of the training time and 7% of the memory. Real-world experiments further validate that Mini-Diffuser preserves the key strengths of diffusion-based policies, including the ability to model multimodal action distributions and produce behavior conditioned on diverse perceptual inputs.

Abstract:
Data-driven methods have shown great potential in solving challenging manipulation tasks; however, their application in the domain of deformable objects has been constrained, in part, by the lack of data. To address this lack, we propose PokeFlex, a dataset featuring real-world multimodal data that is paired and annotated. The modalities include 3D textured meshes, point clouds, RGB images, and depth maps. Such data can be leveraged for several downstream tasks, such as online 3D mesh reconstruction, and it can potentially enable underexplored applications such as the real-world deployment of traditional control methods based on mesh simulations. To deal with the challenges posed by real-world 3D mesh reconstruction, we leverage a professional volumetric capture system that allows complete 360° reconstruction. PokeFlex consists of 18 deformable objects with varying stiffness and shapes. Deformations are generated by dropping objects onto a flat surface or by poking the objects with a robot arm. Interaction wrenches and contact locations are also reported for the latter case. Using different data modalities, we demonstrated a use case for our dataset training models that, given the novelty of the multimodal nature of Pokeflex, constitute the state-of-the-art in multi-object online template-based mesh reconstruction from multimodal data, to the best of our knowledge. We refer the reader to our website or the supplementary material for further demos and examples.

Abstract:
Robotic contact manipulation involves applying controlled forces at contact points to guide an object along a desired trajectory while respecting the underlying physical interactions. This paper presents a novel framework that integrates dynamic modeling and Reinforcement Learning (RL) to achieve robust object pushing with a redundant robotic manipulator. First, a comprehensive dynamic contact model is formulated, incorporating unilateral constraints and a box friction model to capture the nonlinearities present in real-world contact dynamics. Second, the model is extended to handle multiple simultaneous point contacts, enabling effective trajectory planning and tracking for redundant robotic manipulators in multi-contact pushing tasks. Third, an RL strategy is introduced as a residual module that augments a model-based controller to improve pushing performance. Simulation and real-world experiments with a Kinova Gen2 arm demonstrate that the proposed method achieves accurate trajectory following and stable contact interactions, significantly outperforming traditional PD control strategies in dynamic pushing scenarios.

Abstract:
With the rapid developments of autonomous driving technologies, accurate scene perception has become essential for safe and efficient navigation. The key perception tasks such as lane detection, semantic segmentation of road markings and road area, and object detection directly impact vehicle decision-making and obstacle avoidance. However, most existing methods are trained on a single-task dataset, limiting data diversity and reducing performance in complex scenarios or under occlusion and illumination variation. In this work we propose a multi-task perception network based on image sequence input, integrating lane detection, road marking and road area segmentation, and object detection into a unified framework. The network model employs multi-task learning to share features and improve the computational efficiency, and adopts the cross-dataset training paradigm to enhance generalization across tasks. Furthermore, the temporal information from adjacent frames is leveraged to compensate visual degradation in current frames. Experimental results obtained on multiple datasets demonstrate the proposed technique achieves competitive performance compared to state-of-the-art approaches. Code is available at https://github.com/(removed_for_review).

Abstract:
This letter presents a novel strategy for collision-free path planning in robotic manipulators. The method operates in two stages: first, a sampling-based exploration of the configuration space is performed to construct a safe corridor composed of axis-aligned bounding boxes. Within this corridor, an optimisation-based trajectory generation phase addresses the channel problem by computing smooth joint trajectories as non-rational splines. A multi-objective cost function is minimised to reduce geometric acceleration along the path, while also maximising the distance from obstacles to improve safety margins. The proposed algorithm is general and applicable to a wide range of kinematic structures, and supports user-defined path degree and geometric continuity. Simulation results demonstrate superior performance compared to existing methods, and experimental validation further confirms its practical effectiveness. Our implementation is open-sourced and available on Github.

Abstract:
Place recognition is the foundation for autonomous systems to achieve independent decision-making and secure operation. It is also crucial in tasks such as loop closure detection and global localization in Simultaneous Localization and Mapping (SLAM) technology. Existing LiDAR-based place recognition (LPR) methods use raw point cloud representations or multifarious point cloud representations as inputs, as well as employ convolutional neural networks or transformer architectures. However, the recently proposed Mamba deep learning model combined with State Space Models (SSMs) has enormous potential in long sequence modeling. Therefore, we have developed a novel place recognition network OverlapMamba, which represents input range views (RVs) as sequences. In a novel way, we use a stochastic reconstruction method to establish shifted state space models to compress the visual representation. Extensive experiments on three public datasets demonstrate that OverlapMamba achieves competitive performance with real-time inference speed, which effectively detects loop closure even when traversing previously visited locations from different directions, indicating its strong place recognition ability and real-time efficiency.

Abstract:
Continuum robots offer unique advantages for applications such as minimally invasive surgery, navigation through confined environments, and safe human-robot interaction. However, while most continuum robot segments are designed to exhibit constant curvature over their length, they passively deform into a non-constant curvature s-shape when holding payloads at the tip, and their dynamic movement is often subject to unwanted vibration of the passive non-constant curvature modes. In this paper, we propose a simple solution to dramatically improve these issues: a continuum robot segment design that utilizes a diagonal backbone and flexible push-pull actuation rods. This simple modification to common continuum-robot construction enables us to eliminate the passive s-shaped mode, creating a bending segment that can handle large loads without significant deformation or vibration while requiring no more actuation force than conventional designs. We show that a modified version of 1-DOF constant-curvature kinematics accurately describes the structure when actuator translations are equal and opposite. We also develop and validate a 2-DOF model that predicts tip position and orientation resulting from more general actuation inputs. The models and increased output stiffness were verified experimentally and the concept was demonstrated on a multi-segment robot following a 3D trajectory with minimal disturbance from added loads.

Abstract:
LiDAR-inertial odometry(LIO) has been widely applied in intelligent robotics and autonomous driving, providing high-precision and low-latency ego-motion estimation. However, the massive point clouds generated by LiDAR introduce intensive data processing demands, making k-nearest neighbor(KNN) search and map update a critical bottleneck that limits the real-time performance of the LIO system. This paper proposes a novel data structure, the hkd-Tree, which uses hashed voxel indices as keys and local k-d tree as values. It combines the localized search advantages of voxel-based methods with the efficient search capability of k-d tree, enabling fast KNN search and point cloud insertion. To further improve the performance of the hkd-Tree, we propose a voxel distribution mechanism and buffered update strategy, where each new point is assigned to neighboring voxels within the search radius and inserted into local k-d tree via parallel batch updates. We develop a LiDAR-inertial odometry system, LIO-HKDT, based on the proposed hkd-Tree. Extensive experiments demonstrate that the hkd-Tree enables highly efficient point cloud search and insertion. LIO-HKDT achieves comparable accuracy to state-of-the-art LIO systems while significantly improving runtime efficiency.

Abstract:
This article presents a novel vision-based artificial lateral line (ALL) sensor, FlowSight, enhancing the perception capabilities of underwater robots. Through an autonomous vision system, FlowSight allows for simultaneous sensing the speed and direction of local water flow without relying on external auxiliary equipment. Inspired by the lateral line neuromast of fish, a flexible bionic tentacle is designed to sense water flow. Deformation and motion characteristics of the tentacle are modeled and analyzed using bidirectional fluid-structure interaction (FSI) simulation. Upon contact with water flow, the tentacle converts water flow information into elastic deformation information, which is captured and processed into an image sequence by the autonomous vision system. Subsequently, a water flow perception method based on deep neural networks is proposed to estimate the flow speed and direction from the captured image sequence. The perception network is trained and tested using data collected from practical experiments conducted in a controllable swim tunnel. Finally, the FlowSight sensor is integrated into the bionic underwater robot RoboDact, and a closed-loop motion control experiment based on water flow perception is conducted. Experiments conducted in the swim tunnel and water pool demonstrate the feasibility and effectiveness of FlowSight sensor and the water flow perception method.

Abstract:
Probabilistic collision detection (PCD) is essential in motion planning for robots operating in unstructured environments, where considering sensing uncertainty helps prevent damage. Existing PCD methods mainly used simplified geometric models and addressed only position estimation errors. This paper presents an enhanced PCD method with two key advancements: (a) using superquadrics for more accurate shape approximation and (b) accounting for both position and orientation estimation errors to improve robustness under sensing uncertainty. Our method first computes an enlarged surface for each object that encapsulates its observed rotated copies, thereby addressing the orientation estimation errors. Then, the collision probability is formulated as a chance constraint problem that is solved with a tight upper bound. Both two steps leverage the recently developed normal parameterization of superquadric surfaces. Results show that our PCD method is twice as close to the Monte-Carlo sampled baseline as the best existing PCD method and reduces path length by 30% and planning time by 37%, respectively. A Real2Sim2Real pipeline further validates the importance of considering orientation estimation errors, showing that the collision probability of executing the planned path is only 2%, compared to 9% and 29% when considering only position estimation errors or none at all.

Abstract:
Hybrid aerial underwater vehicles (HAUVs) are developing rapidly with the urgent need for joint air-sea observation missions. This paper proposes a novel HAUV that combines a folding wing mechanism and an underwater thrust system with a centralized tail in an inverted triangle configuration. In addition to ensuring underwater and aerial maneuverability, the designs overall streamlined structure minimizes the drag of underwater movement and is more suitable for working in confined spaces.The hydrodynamic performance of the system was evaluated using computational fluid dynamics (CFD) simulation. The results indicate that the folding wing design effectively reduces underwater motion drag by 41.9%. Additionally, the centralized underwater thrust system located at the tail generates sufficient torque to ensure the underwater maneuverability of the HAUV. Field experiments further validate the vehicles capability to operate in confined environments, execute complex underwater missions, and maintain stable aerial flight. This study provides valuable insights into the drag reduction of HAUV folding wings and the optimization of thruster configuration.

Abstract:
In future operations on the lunar surface, automated vehicles will be required to transport cargo between known locations. Such vehicles must be able to navigate precisely in safe regions to avoid natural hazards, human-constructed infrastructure, and dangerous dark shadows. Rovers must be able to park their cargo autonomously within a small tolerance to achieve a successful pickup and delivery. In this field test, Lidar Teach and Repeat provides an ideal autonomy solution for transporting cargo in this way. A one-tonne path-to-flight rover was driven in a semi-autonomous remote-control mode to create a network of safe paths. Once the route was taught, the rover immediately repeated the entire network of paths autonomously while carrying cargo. The closed-loop performance is accurate enough to align the vehicle to the cargo and pick it up. This field report describes a two-week deployment at the Canadian Space Agencys Analogue Terrain, culminating in a simulated lunar operation to evaluate the system's capabilities. Successful cargo collection and delivery were demonstrated in harsh environmental conditions.

Abstract:
Arc welding induces thermal deformation that continuously displaces the seam path during execution, causing the robot to miss the joint on long seams. We present a real-time seam tracking system with three principal contributions: (1) a constrained ICP registration of live leading-laser scans against a prescan point-cloud prior combined with exponentially decaying spatial propagation; (2) a laser-line detection network retrained on 1,000 arc-on images, raising F1 from 0.59 (prescan baseline) to 0.84 on a held-out arc-on test set; and (3) an asynchronous execution architecture for ensuring that smooth joint commands are sent at the robot's control cycle (40 Hz) even with perception delays or interruptions. Internal testing confirms the system remains in the joint on 97 cm-long seams with less than 2 sec cycle-time overhead. Field deployment improved weld quality acceptance rate from 81% to 95-98%.

Abstract:
Body-mounted LiDAR sensors suffer from systematic blind spots during stair locomotion, creating a partial observability problem that single-step terrain snapshots cannot resolve. We address this with a recurrent locomotion policy for the Unitree Go2 that builds implicit knowledge of stair geometry through a GRU-based recurrent encoder over pointcloud and proprioceptive inputs, enabling robust stair ascent and descent even under occluded LiDAR conditions. Ablation experiments show that masking pointcloud input at inference time causes catastrophic failure on stair terrain and severe performance degradation overall, confirming that implicit stair knowledge is a critical cue for step negotiation rather than a merely complementary signal.

Abstract:
Electrostatic (ES) clutches are promising candidates for wearable and assistive robotics due to their thin, lightweight, and low-power characteristics. However, conventional ES clutches typically suffer from mechanical instability caused by the stick-slip phenomenon, restricting their operation to simple binary (locked or free) modes. In this work, we present a Stick-slip-free Variable Electrostatic (SV-ES) clutch that functions as a high-performance programmable robotic damper. By utilizing a PVC-gel friction layer, the device achieves stable and continuous sliding even under high shear stress (29 N/cm² at 100 V). We demonstrate that this stability allows for precise closed-loop modulation of kinetic friction and motor-free position control. The versatility of the SV-ES clutch is validated through three robotic applications: active motion assistance for a robotic arm, high-fidelity haptic rendering, and programmable impact damping for a robotic leg.

Abstract:
Robotic harvesting of date fruits requires precise grasping force control to prevent tissue damage, yet cultivar-specific biomechanical limits remain absent from the literature. This work presents the first continuous stress-strain characterization of three Saudi date cultivars across three hydration states, translated into validated robotic grasping constraints. A custom parallel-plate compression system emulates two-finger robotic grasping, while a Mask R-CNN vision model provides non-contact geometric measurement with below 5% relative error. Total of 500 samples are tested. Cyclic loading experiments establish elastic strain limits, with conservative operational thresholds of 7% for Ajwa and Barhi and 5% for Sagai. A linear calibration model maps gripper displacement to induced fruit strain, enabling strain-controlled robot commands. Validation using a UR10e manipulator confirms damage-free manipulation at these limits across all cultivars, with residual deformation below 1mm and strain tracking error below 1%. Future work will integrate vision and force feedback to train machine learning models on these experimentally derived limits, enabling real-time geometry-based gripper control for fully autonomous harvesting.

Abstract:
Robust all-weather localization is a critical capability for autonomous systems. While 4D mmWave radar offers superior resilience to adverse environmental conditions compared to LiDAR and cameras, its application in high-precision Simultaneous Localization and Mapping (SLAM) is hindered by significant challenges, including severe point cloud sparsity, complex noise characteristics, and the prevalence of dynamic objects. To address these issues, we propose RaCo-SLAM, a robust and real-time 4D mmWave radar SLAM framework with co-visibility consistency. This framework features a novel physics-informed probabilistic model for adaptive feature extraction from sparse and noisy point clouds. For global consistency, we introduce a co-visibility consistency factor (CoVC factor) into the global optimization, moving beyond conventional loop-closure methods. This factor directly minimizes point-to-point registration errors to enforce global consistency and is designed for parallel real-time execution on a standard CPU. Comprehensive evaluation on diverse and challenging real-world datasets demonstrates state-of-the-art accuracy and robustness, achieving real-time performance exceeding 40 Hz on a standard CPU. To benefit the community, the code and collected dataset will be released at https://github.com/sudo-robot0/RaCo-SLAM.

Abstract:
We present Y-MAP-Net, a Y-shaped neural network architecture designed for real-time multi-task learning on RGB images. Y-MAP-Net simultaneously predicts depth, surface normals, human pose, semantic segmentation, and generates multi-label captions in a single forward pass. To achieve this, we adopt a multi-teacher, single-student training paradigm, where task-specific foundation models supervise the learning of the network, allowing it to distill their capabilities into a unified real-time inference architecture. Y-MAP-Net exhibits strong generalization, architectural simplicity, and computational efficiency, making it well-suited for resource-constrained robotic platforms. By providing rich 3D, semantic, and contextual scene understanding from low-cost RGB cameras, Y-MAP-Net supports key robotic capabilities such as object manipulation and human-robot interaction. To encourage future research and reproducibility, we make our code publicly available.

Abstract:
Perception tasks for navigation in robotics, including aerial platforms such as drones and autonomous driving systems, are inherently structured. Drone-mounted cameras typically capture sky above, terrain below, and obstacles or man-made structures in between, while driving data often contains organized road layouts, lane markings, and surrounding agents. Motivated by these axis-aligned structural priors. These information are normally more structured than generic image tasks. We hypothesize that processing information in a quadtree-esque manner can not only model features effectively in a hierarchical manner, but also offers an efficient linear-time alternative to vanilla attention mechanisms, which run in quadratic time. In this paper, we propose Sibling-Selective Quadtree Attention (SSQA), which models image tokens hierarchically as a structured, full quadtree. We show analytical complexity analysis that guarantees linear-time feature modeling, in addition to empirical experiments comparing inference speeds with other popular modeling approaches, such as Mamba 2 and Quadtree Attention. Our results, benchmarked across several tasks, show that we achieve results at least as good, if not notably better, as others at a fraction of the computational costs.

Abstract:
Uncrewed aerial vehicles (UAVs) are increasingly deployed in forest environments for tasks such as environmental monitoring and search and rescue, which require safe navigation through dense foliage and precise data collection. Traditional sensing approaches, including passive multispectral and RGB imaging, suffer from latency, poor depth resolution, and strong dependence on ambient lightespecially under forest canopies. In this work, we present a novel event spectroscopy system that simultaneously enables high-resolution, low-latency depth reconstruction and multispectral imaging using a single sensor. Depth is reconstructed using structured light, and by modulating the wavelength of the projected structured light, our system captures spectral information in controlled bands between 650 nm and 850 nm. We demonstrate up to 60% improvement in RMSE over commercial depth sensors and validate the spectral accuracy against a reference spectrometer and commercial multispectral cameras, demonstrating comparable performance. A portable version limited to RGB is used to collect real-world depth and spectral data from a Masoala Rainforest. We demonstrate color image reconstruction and material differentiation between leaves and branches using this spectral and depth data. Our results show that adding depth (available at no extra effort with our setup) to material differentiation improves the accuracy by over 30% compared to color-only method. Our system, tested in both lab and real-world rainforest environments, shows strong performance in depth estimation, RGB reconstruction, and material differentiationpaving the way for lightweight, integrated, and robust UAV perception and data collection in complex natural environments.

Abstract:
Learning-based controllers are increasingly adopted in lower-extremity powered exoskeletons, yet their advantages over traditional adaptive approaches remain underexplored. We compared two adaptive assist-as-needed (AAN) controllers for gait training with an ankle exoskeleton: a reinforcement learning-based controller (RL-AAN) and a conventional iterative learning controller (ILC-AAN). Both adjusted assistance stride-by-stride, delivering torque as a percentage of the wearer's biological plantarflexion momentestimated online with a subject-agnostic modeland progressively faded assistance as performance improved. Healthy participants walked on a self-paced treadmill under a perturbed-gait protocol. Performance was assessed as average percent stride-velocity (SV) improvement relative to unassisted perturbed walking (Δ%εSV+) and percent of strides above the SV threshold (N%SV+). During training, RL-AAN and ILC-AAN elicited comparable gains in Δ%εSV+ between the first and last training sessions, but RL-AAN yielded greater adherence across sessions, as indicated by larger N%SV+. After training, RL-AAN demonstrated superior retention in Δ%εSV+ and N%SV+. These results support RL-AAN as a promising strategy for subject-tailored gait training, motivating future studies in neurological and musculoskeletal populations.

Abstract:
In the robotics community, it has been a longstanding challenge for quadrupeds to achieve highly explosive movements similar to their biological counterparts. In this work, we introduce a novel training framework that achieves height-aware and omnidirectional jumping for quadrupedal robots. To facilitate the precise tracking of the user-specified jumping height, our pipeline concurrently trains an estimator that infers the robot and its end-effector states in an online fashion. Besides, a novel reward is involved by solving the analytical inverse kinematics with pre-defined end-effector positions. Guided by this term, the robot is empowered to regulate its gestures during the aerial phase. In the comparative studies, we verify that this controller can not only achieve the longest relative forward jump distance, but also exhibit the most comprehensive jumping capabilities among all the existing jumping controllers. A video summarizing the methodology and the validation in both simulation and real hardware is submitted along with this paper.

Abstract:
Soft robots,compared to rigid robots,possess inherent advantages,including higher degrees of freedom, compliance,and enhanced safety,which have contributed to their increasing application across various fields. Among these benefits, adaptability is particularly noteworthy. In this paper, adaptability in soft robots is categorized into external and internal adaptability. External adaptability refers to the robots ability to adjust, either passively or actively, to variations in environments, object properties, geometries, and task dynamics. Internal adaptability refers to the robots ability to cope with internal variations, such as manufacturing tolerances or material aging, and to generalize control strategies across different robots. As the field of soft robotics continues to evolve, the significance of adaptability has become increasingly pronounced. In this review, we summarize various approaches to enhancing the adaptability of soft robots, including design, sensing, and control strategies. Additionally, we assess the impact of adaptability on applications such as surgery, wearable devices, locomotion, and manipulation. We also discuss the limitations of soft robotics adaptability and prospective directions for future research. By analyzing adaptability through the lenses of implementation, application, and challenges, this paper aims to provide a comprehensive understanding of this essential characteristic in soft robotics and its implications for diverse applications.

Abstract:
Morphing quadrotors offer enhanced maneuverability and adaptability in confined spaces, while their structural variations pose challenges to trajectory planning and control. This paper presents a time-optimal trajectory planning and model predictive control framework for the morphing quadrotor. The trajectory generator computes time-optimal trajectories by dynamically adjusting arm lengths, allowing the quadrotor to traverse waypoints as quickly as possible while satisfying constraints. The generated trajectory is then brought into the designed dual-loop model predictive control architecture to achieve autonomous flight, in which the outer loop tracks the desired trajectory and the inner loop synchronously regulates attitude and arm length of the morphing quadrotor. Experimental validation demonstrates that the proposed framework achieves high-precision trajectory tracking, robust dynamic response, and superior adaptability in confined environments.

Abstract:
Nighttime UAV tracking faces significant challenges in real-world robotics operations. Low-light conditions not only limit visual perception capabilities, but cluttered backgrounds and frequent viewpoint changes also cause existing trackers to drift or fail during deployment. To address these difficulties, researchers have proposed solutions based on low-light enhancement and domain adaptation. However, these methods still have notable shortcomings in actual UAV systems: low-light enhancement often introduces visual artifacts, domain adaptation methods are computationally expensive and existing lightweight designs struggle to fully leverage dynamic object information. Based on an in-depth analysis of these key issues, we propose MATracka multiscale adaptive system designed specifically for nighttime UAV tracking. MATrack tackles the main technical challenges of nighttime tracking through the collaborative work of three core modules: Multiscale Hierarchy Blende (MHB) enhances feature consistency between static and dynamic templates. Adaptive Key Token Gate accurately identifies object information within complex backgrounds. Nighttime Template Calibrator (NTC) ensures stable tracking performance over long sequences. Extensive experiments show that MATrack achieves a significant performance improvement. On the UAVDark135 benchmark, its precision, normalized precision and AUC surpass state-of-the-art (SOTA) methods by 5.9%, 5.4% and 4.2% respectively, while maintaining a real-time processing speed of 81 FPS. Further tests on a real-world UAV platform validate the system's reliability, demonstrating that MATrack can provide stable and effective nighttime UAV tracking support for critical robotics applications such as nighttime search and rescue and border patrol.

Abstract:
This paper presents a novel infinite-horizon Partially Observable Markov Decision Process (POMDP) framework with adaptive sliding mode control (ASMC) for autonomous navigation of the balloons. The proposed method integrates an altitude controller designed to account for thermodynamic and real-wind field constraints with an infinite-horizon POMDP for wind-optimal navigation. First, an adaptive sliding mode control is developed to ensure the balloons internal stability under uncertainties in pressure, external wind fields, and temperature. Subsequently, a reference strategy is formulated using the infinite-horizon POMDP to exploit wind dynamics for station-keeping. The system estimates wind direction in real time and computes actions based on these observations. Experimental results demonstrate the frameworks ability to converge on efficient navigation policies while compensating for partial observability of wind dynamics. This approach is particularly suited for aerial or underwater vehicles operating in stratified flow environments, offering a computationally tractable solution for real-world deployment.

Abstract:
We introduce real-is-sim, a new approach to integrating simulation into behavior cloning pipelines. In contrast to real-only methods, which lack the ability to safely test policies before deployment, and sim-to-real methods, which require complex adaptation to cross the sim-to-real gap, our framework allows policies to seamlessly switch between running on real hardware and running in parallelized virtual environments. At the center of real-is-sim is a dynamic digital twin, powered by the Embodied Gaussian simulator, that synchronizes with the real world at 60Hz. This twin acts as a mediator between the behavior cloning policy and the real robot. Policies are trained using representations derived from simulator states and always act on the simulated robot, never the real one. During deployment, the real robot simply follows the simulated robots joint states, and the simulation is continuously corrected with real world measurements. This setup, where the simulator drives all policy execution and maintains real-time synchronization with the physical world, shifts the responsibility of crossing the sim-to-real gap to the digital twin's synchronization mechanisms, instead of the policy itself. We demonstrate real-is-sim on a long-horizon manipulation task (PushT), showing that virtual evaluations are consistent with real-world results. We further show how real-world data can be augmented with virtual rollouts and compare to policies trained on different representations derived from the simulator state including object poses and rendered images from both static and robot-mounted cameras. Our results highlight the flexibility of the real-is-sim framework across training, evaluation, and deployment stages.

Abstract:
Thermal imaging, with its all-weather capabilities and strong penetration, enables 3D reconstruction in low- light and adverse conditions. In this paper, we investigate RGB-independent pure 3D thermal reconstruction, aiming to overcome the challenges of 3D reconstruction in extreme environments where RGB images are unavailable. However, directly applying visible-light 3D reconstruction methods to thermal images often leads to severe artifacts due to two key challenges: (i) thermal images lack rich textures, hindering detail reconstruction, and (ii) heat conduction causes intensity diffusion, resulting in blurred edges. To address these issues, we propose TI-3DGS, a novel 3D Gaussian Splatting framework guided by thermal imaging. We introduce a Thermal Imaging Field (TIF) to model radiance in thermal domains and a Thermal Attenuation-aware Density Control (TADC) strategy to densify sparse point clouds from low-texture thermal inputs. Additionally, we incorporate an edge-enhancement constraint to mitigate blur from heat diffusion. Extensive experiments on the TI-NSD dataset, covering indoor and outdoor scenarios, show that our TI-3DGS achieves state-of-the-art performance, effectively overcoming texture sparsity and edge degradation in thermal reconstruction.

Abstract:
LiDAR-based 3D semantic segmentation is a critical task in autonomous driving, but its scalability is limited by the reliance on large-scale labeled datasets. Semi-supervised learning (SSL) offers a potential solution by leveraging unlabeled data. However, most existing SSL segmentation methods are designed for mechanical spinning LiDAR (MSLR) and fail to generalize well to solid-state LiDAR (SSLR) due to different scanning patterns and point cloud distributions. To address this challenge, we propose SSLiMix, a novel semi-supervised segmentation method with checkerboard mixing for solid-state LiDAR. Unlike prior MSLR-oriented methods, SSLiMix employs 2D grid partitioning with checkerboard mixing to adapt to SSLRs dense and uniform point clouds, thereby preserving spatial consistency even when beam-based augmentations fail. Additionally, we introduce a hierarchical confidence-aware pseudo-labeling mechanism (HCAP), which classifies pseudolabels by confidence and applies targeted processing to enhance pseudo-label reliability. Experiments on the PandaSet dataset show that SSLiMix improves mIoU by 11.3% over the fullysupervised baseline using only 1% labeled data, demonstrating its effectiveness in low-label regimes and providing a strong benchmark for semi-supervised SSLR segmentation.

Abstract:
The development of camera-based real-time 3D perception network for edge devices is essential for embodied systems such as autonomous vehicles and robots. However, existing methods often demand substantial computational resources and tend to overlook performance on resource-constrained devices. In this paper, we propose RT-BEVFormer, a simple yet effective multi-task 3D perception framework designed for efficiency. Based on BEVFormer, RT-BEVFormer enhances the feature extraction capability of the backbone and redesigns the spatial cross-attention module in the encoder, guided by two key observations: 1) the computational load and total number of parameters are dominated by the backbone, and 2) the sampling process within the deformable attention module is a primary bottleneck. Specifically, we leverage powerful foundation models to distill their rich and comprehensive knowledge, thereby crafting a highly efficient student backbone. This allows RT- BEVFormer to achieve significant performance gains without incurring additional latency. Furthermore, we introduce an efficient static sampling method. This approach replaces the dynamic and deployment unfriendly nature of standard spatial cross-attention, allowing the model to focus on salient image features with minimal overhead. On the widely-used edge device, NVIDIA Jetson Orin, RT-BEVFormer outperforms the previous state-of-the-art model in both accuracy and inference speed. Extensive experiments on the nuScenes dataset show that each component of our framework is effective in both inference speed and overall accuracy. Finally, as RT-BEVFormer is implemented without any model-specific custom plugin, it ensures superior flexibility and ease of deployment.

Abstract:
Robotic planning tasks often involve diverse complexities, which make adaptive improvement through reflection particularly challenging. Existing LLM-based approaches typically rely on fixed routines, lacking the ability to adjust to task-specific complexity and often leading to redundant reflections. To address this, we propose DyRef, a dynamic reflection framework that models tasks as a Diagnostic Graph, measures task complexity through structural factors, and routes them through a Reflection Toolkit via a learned Routing Policy network. This design enables tailored reflection strategies that reduce redundancy and improve reasoning efficiency. Experiments in AlfWorld and on real-world robotic platforms show that DyRef improves first trial success rates by 16.1%, while reducing redundant reflections by 64.4%.

Abstract:
Global data association is an essential prerequisite for robot operation in environments seen at different times or by different robots. Repetitive or symmetric data creates significant challenges for existing methods, which typically rely on maximum likelihood estimation or maximum consensus to produce a single set of associations. However, in these ambiguous scenarios, the distribution of solutions to global data association problems is often highly multimodal, and such single-solution approaches frequently fail. In this work, we introduce a data association framework that leverages approximate Bayesian inference to capture multiple solution modes to the data association problem, thereby avoiding premature commitment to a single solution under ambiguity. Our approach represents hypothetical solutions as particles that evolve via deterministic or randomized updates, naturally parallelizable on GPUs, to cover the modes of the underlying solution distribution. Simulated and real-world experiments with highly ambiguous data show that our method correctly estimates the distribution over transformations when registering point clouds or object maps. Code is available at: https://github.com/mit-acl/mmda.

Abstract:
We propose a hybrid grasp synthesis framework that combines a learning-based Energy Based Model (EBM) with an analytical Iterative Closest Point (ICP) methodto generate robustgrasps from partially observed point clouds. The learned energy function acts as a prior within a Stein Variational Gradient Descent (SVGD) framework, guiding iterative refinement of grasp configurations. Evaluated on 67 objects with 5,360 grasp attempts, our method achieves an average success rate of 60.9%, outperforming AnyGrasp (31.1%) and Grasp Pose Detection (48.4%) and AS-ICP (56.6%). These results highlight the strong generalization ability of our approach and demonstrate how combining data-driven learning with geometric optimization addresses the limitations of either strategy in isolation.

Abstract:
Mechanical stimulation is essential for regulating cellular processes such as proliferation, differentiation, and apoptosis. Magnetic microrobot swarms offer a promising platform for delivering targeted mechanical stimulation to cells via remote actuation under rotating magnetic fields. However, magnetic fields globally activate swarms in non-target regions, risking undesired biological effects. To overcome this limitation, we propose a spatially selective magnetic actuation strategy that confines mechanical stimulation to user-defined regions. A dual-robotic-arm magnetic actuation system is developed to generate a selective rotating magnetic field. The field ensures swarms have smooth rotation and longer chain formation within the target area, enabling effective mechanical stimulation, while swarms outside exhibit shortened chains and disordered motion. We further demonstrate that the area swept by the rotating chain of microrobots peaks within the targeted region but drops sharply beyond it. This approach provides a foundation for precise mechanostimulation in biomedical applications with minimal off-target effects.

Abstract:
Structure from Motion (SfM) estimates camera poses and reconstructs point clouds, forming a foundation for various tasks. However, applying SfM to driving scenes captured by multi-camera systems presents significant difficulties, including unreliable pose estimation, excessive outliers in road surface reconstruction, and low reconstruction efficiency. To address these limitations, we propose a Multi-camera Reconstruction and Aggregation Structure-from-Motion (MRASfM) framework specifically designed for driving scenes. MRASfM enhances the reliability of camera pose estimation by leveraging the fixed spatial relationships within the multi-camera system during the registration process. To improve the quality of road surface reconstruction, our framework employs a plane model to effectively remove erroneous points from the triangulated road surface. Moreover, treating the multi-camera set as a single unit in Bundle Adjustment (BA) helps reduce optimization variables to boost efficiency. In addition, MRASfM achieves multi-scene aggregation through scene association and assembly modules in a coarse-to-fine fashion. We deployed multi-camera systems on actual vehicles to validate the generalizability of MRASfM across various scenes and its robustness in challenging conditions through real-world applications. Furthermore, large-scale validation results on public datasets show the state-of-the-art performance of MRASfM, achieving 0.124 absolute pose error on the nuScenes dataset.The code is available at https://github.com/IRMVLab/MRASfM.

Abstract:
In ex-situ object rearrangement tasks within open environments, robots face significant challenges due to the increased cost of moving objects over large workspaces. To address this issue, we propose a hierarchical reinforcement learning-based approach that takes into account the supportive relationships and semantic correlations between objects. The robot groups and stacks objects with compatible supportive capabilities, moving them together to their target locations to optimize task execution. Specifically, we use a large language model to assess the supportive relationships and semantic correlations between objects. In the high-level decision-making process, objects are grouped based on their supportive capabilities, while the low-level process refines these groupings using a graph capsule convolutional network. Experimental results demonstrate that our approach not only reduces the number of movements required but also improves task efficiency and significantly decreases task completion time by approximately 50%, compared to methods that do not consider supportive relationships.

Abstract:
Understanding the motion of articulated mechanical assemblies from static geometry remains a core challenge in 3D perception and design automation. Prior work on everyday articulated objects such as doors and laptops typically assumes simplified kinematic structures or relies on joint annotations. However, in mechanical assemblies like gears, motion arises from geometric coupling, through meshing teeth or aligned axes, making it difficult for existing methods to reason about relational motion from geometry alone. To address this gap, we introduce MechBench, a benchmark dataset of 693 diverse synthetic gear assemblies with part-wise ground-truth motion trajectories. MechBench provides a structured setting to study coupled motion, where part dynamics are induced by contact and transmission rather than predefined joints. Building on this, we propose DYNAMO, a dependency-aware neural model that predicts per-part SE(3) motion trajectories directly from segmented CAD point clouds. Experiments show that DYNAMO outperforms strong baselines, achieving accurate and temporally consistent predictions across varied gear configurations. Together, MechBench and DYNAMO establish a novel systematic framework for data-driven learning of coupled mechanical motion in CAD assemblies.

Abstract:
In complex environments, traditional path planning methods rely on manually defined models, requiring tedious adjustments under varying scenarios or constraints. They also suffer from unstable time overhead and exponentially increasing computational costs as environmental complexity grows. Deep learning-enhanced methods, while optimizing decisions via neural networks, remain constrained by explicit search/sampling frameworksthis leads to unstable real-time performance and failure to capture real-world trajectory distributions. In contrast, diffusion-based planning directly learns trajectory distributions from data, offering predictable inference latency via fixed inversion steps and inherent support for multimodal solutions. However, its lack of explicit safety constraints often leads to trajectory safety issues, resulting in planning failures. To address these limitations, this paper proposes ST-DiffPlanner, a global path planner following the pipeline of ''topology cognitiondirection focusingtrajectory generation". It introduces three targeted optimizations: (1) leveraging topological awareness to constrain the diffusion model to focus on collision-free regions; (2) optimizing inference-phase projection to ensure trajectory continuity and safe distances from obstacles; (3) designing a topology anchor-based safety loss to enhance model safety and training stability. Experimental results demonstrate that ST-DiffPlanner exhibits strong generalization across multiple scenarios and modalities, accurately capturing environmental features and learning task-compliant trajectory characteristics. Our method achieves an average trajectory generation success rate of 96.9%, significantly outperforming baseline methods. Moreover, validation in both simulated and real-world robot platforms confirms its applicability across different systems.

Abstract:
Autonomous driving has rapidly advanced with diverse sensors, especially Light Detection and Ranging (LiDAR), which provides precise geometry for tasks like simultaneous localization and mapping (SLAM). Recently, the performance of LiDAR-based SLAM has improved through studies leveraging intensity as a complementary cue to depth. However, in urban environments, dynamic objects occlude static scenes, degrading the stability and accuracy of LiDAR-based SLAM. While previous studies have focused mainly on completing occluded depth, they often disregard intensity, assuming it to be less critical or difficult to estimate due to inherent noise. This overlooks the strong complementary relationship between the two modalities, which can be exploited for effective multimodal completion. Furthermore, completing intensity alongside depth enables broader applicability to intensity-aware perception tasks. To address this issue, a Multimodal Mutual-Guidance (M2G) module is proposed for the joint completion of occluded depth and intensity in LiDAR data. M2G is integrated into a deep learning-based network that takes projected range and intensity images as input, enabling progressive cross-modal feature interaction. Leveraging the shared origin of LiDAR depth and intensity, M2G balances noisy intensity and smooth depth via attention and structure-aware guidance. Experimental results demonstrate that the proposed method outperforms existing inpainting and depth completion approaches, validating its effectiveness for LiDAR completion.

Abstract:
Object-oriented embodied navigation tasks require agents to locate specific objects, either defined by category or images, in unseen environments. While recent methods have made progress in extending closed-set models to open-vocabulary scenarios with foundation models, they typically rely on training-free large language models (LLMs) or finetuning with end-to-end reinforcement learning (RL). However, they face challenges in efficiency (e.g., the overhead and cost of LLM inference) and limited generalization from intensive RL training. In this paper, we propose OVExp, a training-efficient framework for open-vocabulary exploration. We make the first effort to demonstrate the generalization capabilities of semantic map-based goal prediction networks using Dense CLIP models. A major challenge is that preserving both precise point-wise object locations and generalizable visual representations in the semantic map leads to unaffordable training costs. To address this, we design a Cross-Modal Transfer on Semantic Mapping strategy which adapts an intriguing text-only training and transfer to multi-model semantic mapping and goals in test-time. Despite relying on text-based spatial layouts with limited objects, OVExp demonstrates robust generalization to unseen targets on established ObjectNav benchmarks.

Abstract:
Swarm perception enables a robot swarm to collectively sense and understand the environment by integrating sensory inputs from individual robots. We explore its application to person re-identification (re-id), the task of recognizing previously observed individuals. Traditional re-id systems rely on static offline galleries, which restricts their use in open-world scenarios where new identities appear over time. In robotics, most methods address single-robot re-id in person-following tasks, limiting scalability to multi-person settings, while swarm perception studies largely overlook the role of re-id algorithms. To address these gaps, we propose Swarm-ReID, an unsupervised method for decentralized swarm re-identification. Our method introduces mechanisms for robot-to-robot communication and informed movement strategies, enabling the swarm to collaboratively construct adaptive galleries online without centralized control. Simulations across diverse environments, number of people, swarm sizes, communication protocols, and exploration behaviors show that Swarm-ReID consistently outperforms existing swarm perception methods. Our results highlight how communication and informed movement improve recognition performance, establishing Swarm-ReID as a state-of-the-art method for open-world multi-robot person re-identification.

Abstract:
Contactless diagnosis of musculoskeletal disorders can potentially improve population health as well as robot behaviours in collaborative settings. However, current diagnosis methods require an in-person physical examination in which a trained physician senses, through contact, the force applied by various muscles. Simulation tools exist, but their use for diagnosis with real data is under-explored. In this paper, we propose an algorithm for identifying which upper-limb muscle group is fatigued. Our algorithm compares the real-world free-space motion of the subject with that of a simulated musculoskeletal model, and is therefore contactless: preventing the the need for invasive sensing or in-person assessment. Our algorithm simulates various fatigue conditions using a physics-based musculoskeletal model and extracts diagnostic motion features from both real and simulated data, which are compared for diagnosis. Experimental results on real data demonstrate that the proposed method can reliably distinguish between multiple muscle-groups of fatigue. Additionally, through comprehensive performance comparisons, we show how recent advanced musculoskeletal simulators can be properly configured to address the sim-to-real gap in the context of the fatigue diagnosis task. Our approach can potentially spur further research in remote and automated diagnosis, significantly lowering the barrier to large-scale and early detection.

Abstract:
This work proposes a systematic workflow for constructing grid-based Linear Parameter-Varying (LPV) models from frequency response data. Transfer functions are estimated at multiple scheduling-parameter grid points, fitted with a fixed model order, and transformed into controllable canonical realizations to ensure structural consistency. These vertex models are interpolated into an LPV state-space representation, while robust stability is verified using the Edge Theorem, which reduces the problem to checking edge polynomials of the convex hull. The novelty of the approach lies in integrating frequency-domain identification, canonical-form embedding, and polytope-based robust stability analysis into a unified LPV framework. Unlike conventional methods that rely on time-domain experiments or subspace techniques, the proposed method exploits experimentally accessible frequency-response data and avoids coordinate mismatches during interpolation. Validation on a precision motion system demonstrates both theoretical soundness and practical applicability, confirming the workflow as a reliable pathway from frequency-domain data to robust LPV control design.

Abstract:
The capability of multi-robot SLAM approaches to merge localization history and maps from different observers is often challenged by the difficulty in establishing data association. Loop closure detection between perceptual inputs of different robotic agents is easily compromised in the context of perceptual aliasing, or when perspectives differ significantly. For this reason, direct mutual observation among robots is a powerful way to connect partial SLAM graphs, but often relies on the presence of calibrated arrays of fiducial markers (e.g., AprilTag arrays), which severely limits the range of observations and frequently fails under sharp lighting conditions, e.g., reflections or overexposure. In this work, we propose a novel solution to this problem leveraging recent advances in Deep-Learning-based 6D pose estimation. We feature markerless pose estimation as part of a decentralized multi-robot SLAM system and demonstrate the benefit to the relative localization accuracy among the robotic team. The solution is validated experimentally on data recorded in a test field campaign on a planetary analogous environment

Abstract:
Conventional soft pneumatic actuators, typically based on hollow elastomeric chambers, often suffer from small structural support and require costly geometry-specific redesigns for multimodal functionality. Porous materials, such as foam, filled into chambers, can provide structural stability to the actuators. However, methods to achieve programmable deformation by tailoring the porous body itself remain underexplored. In this paper, a novel design method is presented to realize soft porous actuators with programmable deformation by incising specific patterns into the porous foam body. This approach introduces localized structural anisotropy of the foam, guiding the material's deformation under a global vacuum input. Furthermore, three fundamental patterns on a cylindrical foam substrate are discussed: transverse for bending, longitudinal for tilting, and diagonal for twisting. A computational model is built with Finite Element Analysis (FEA) to investigate the mechanism of the incision-patterning method. Experiments demonstrate that with a potential optimal design of the pattern array number N, actuators can achieve bending up to 80^circ (N=2), tilting of 18^circ (N=1), and twisting of 115^circ (N=8). The versatility of our approach is demonstrated via pattern transferability, scalability, and mold-less rapid prototyping of complex designs. As a comprehensive application, we translate the human hand crease map into a functional incision pattern, creating a bio-inspired soft gripper capable of human-like adaptive grasping. Our work provides a new, efficient, and scalable paradigm for the design of multi-functional soft porous robots.

Abstract:
Rendering haptic feedback with nonlinear virtual environments (VEs) is important in many applications that require highly accurate force feedback. This paper considers the use of the Koopman operator to represent a nonlinear VE interacting with a haptic system. Simulation and experimental results demonstrated that the proposed method provides an effective representation of the nonlinear dynamics of a Duffing-oscillator VE. A multi-user study further confirmed this conclusion. In addition, a closed-loop (CL) stability analysis is performed leveraging the Koopman representation of the nonlinear VE to access stability of the overall haptic system. This alternative way of representing nonlinear VEs enables a convenient CL stability analysis that is less conservative than traditional passivity-based methods. Since a linear combination of all lifted states is used to represent the nonlinearity, such representation is also more robust to uncertainties in the modelling of the haptic device than a traditional nonlinear model.

Abstract:
Shared control aims at assisting human operators using robots in physically and cognitively demanding tasks which cannot be automated as they require human expertise and deliberative abilities. Sharing control for a given task typically involves blending algorithms that combine human control inputs and (pre)planned assistance trajectories. Conventional blending techniques, such as Linear Blending, compute a combined output but neither guarantee the feasibility of the blended motion nor the optimality of the combined decision. In the context of teleoperation, this paper presents a formulation where blending is defined as a constrained optimal control problem. Model Predictive Control is used to determine a feasible blended trajectory through a receding horizon constrained optimization. The proposed method is evaluated in a 13-participant pick and place teleoperation study and compared to Linear Blending and unassisted Teleoperation. The experimental results demonstrate the superiority of the proposed shared control framework in terms of safety, performance as well as physical and cognitive comfort.

Abstract:
We address the problem of reactive motion planning for quadrotors operating in unknown environments with dynamic obstacles. Our approach leverages a 4-dimensional spatio-temporal planner, integrated with vision-based Safe Flight Corridor (SFC) generation and trajectory optimization. Unlike prior methods that rely on map fusion, our framework is mapless, enabling collision avoidance directly from perception while reducing computational overhead. Dynamic obstacles are detected and tracked using a vision-based object segmentation and tracking pipeline, allowing robust classification of static versus dynamic elements in the scene. To further enhance robustness, we introduce a backup planning module that reactively avoids dynamic obstacles when no direct path to the goal is available, mitigating the risk of collisions during deadlock situations. We validate our method extensively in both simulation and real-world hardware experiments, and benchmark it against state-of-the-art approaches, showing significant advantages for reactive UAV navigation in dynamic, unknown environments.

Abstract:
Wearable exosuits assist human movement in tasks ranging from rehabilitation to daily activities; specifically, head-neck support is necessary for patients with certain neurological disorders. Rigid-link exoskeletons have shown to enable head-neck mobility compared to static braces, but their bulkiness and restrictive structure inspire designs using ``soft" actuation methods. In this paper, we propose a fabric pneumatic artificial muscle-based exosuit design for head-neck support. We describe the design of our prototype and physics-based model, enabling us to derive actuator pressures required to compensate for gravitational load. Our modeled range of motion and workspace analysis indicate that the limited actuator lengths impose slight limitations (83% workspace coverage), and gravity compensation imposes a more significant limitation (43% workspace coverage). We introduce compression force along the neck as a novel, potentially comfort-related metric. We further apply our model to compare the torque output of various actuator placement configurations, allowing us to select a design with stability in lateral deviation and high axial rotation torques. The model correctly predicts trends in measured data where wrapping the actuators around the neck is not a significant factor. Our test dummy and human user demonstration confirm that the exosuit can provide functional head support and trajectory tracking, underscoring the potential of artificial musclebased soft actuation for headneck mobility assistance.

Abstract:
Autonomous inspection is a central problem in robotics, with applications ranging from industrial monitoring to search-and-rescue. Traditionally, inspection has often been reduced to navigation tasks, where the objective is to reach a predefined location while avoiding obstacles. However, this formulation captures only part of the real inspection problem. In real-world environments, the inspection targets may become visible well before their exact coordinates are reached, making further movement both redundant and inefficient. What matters more for inspection is not simply arriving at the targets position, but positioning the robot at a viewpoint from which the target becomes observable. In this work, we revisit inspection from a perception-aware perspective. We propose an end-to-end reinforcement learning framework that explicitly incorporates target visibility as the primary objective, enabling the robot to find the shortest trajectory that guarantees visual contact with the target without relying on a map. The learned policy leverages both perceptual and proprioceptive sensing and is trained entirely in simulation, before being deployed to a real-world robot. We further develop an algorithm to compute ground-truth shortest inspection paths, which provides a reference for evaluation. Through extensive experiments, we show that our method outperforms existing classical and learning-based navigation approaches, yielding more efficient inspection trajectories in both simulated and real-world settings.

Abstract:
This paper presents a decentralized control framework for cooperative object transportation with multiple robotic manipulators. In particular two admittance schemes are designed in order to regulate external contact wrenches and internal interaction wrenches without a central unit or all-to-all communication. Each manipulator estimates the wrenches exerted by its teammates through a bank of consensus-based observers that exploits a strongly connected communication graph. These estimates feed two local admittance filters: an external filter, computing the reference object trajectory while limiting environmental wrenches, and an internal filter, generating the end-effector trajectory to minimize each robots contribution to internal wrenches. Experiments carried out with three 7-DOF Franka Emika Panda arms show a marked reduction of both external and internal wrenches, demonstrating the effectiveness and robustness of the proposed approach.

Abstract:
Trajectory prediction and planning in autonomous driving are highly challenging due to the complexity of predicting surrounding agents' movements and planning the ego agent's actions in dynamic environments. Existing methods encode map and agent positions and decode future trajectories in Cartesian coordinates. However, modeling the relationships between the ego vehicle and surrounding traffic elements in Cartesian space can be suboptimal, as it does not naturally capture the varying influence of different elements based on their relative distances and directions. To address this limitation, we adopt the Polar coordinate system, where positions are represented by radius and angle. This representation provides a more intuitive and effective way to model spatial changes and relative relationships, especially in terms of distance and directional influence. Based on this insight, we propose Polaris, a novel method that operates entirely in Polar coordinates, distinguishing itself from conventional Cartesian-based approaches. By leveraging the Polar representation, this method explicitly models distance and direction variations and captures relative relationships through dedicated encoding and refinement modules, enabling more structured and spatially aware trajectory prediction and planning. Extensive experiments on the challenging prediction (Argoverse 2) and planning benchmarks (nuPlan) demonstrate that Polaris achieves state-of-the-art performance.

Abstract:
Weakly-supervised 3D occupancy perception is crucial for vision-based autonomous driving in outdoor environments. Previous methods based on NeRF often face a challenge in balancing the number of samples used. Too many samples can decrease efficiency, while too few can compromise accuracy, leading to variations in the mean Intersection over Union (mIoU) by 5-10 points. Furthermore, even with surrounding-view image inputs, only a single image is rendered from each viewpoint at any given moment. This limitation leads to duplicated predictions, which significantly impacts the practicality of the approach. However, this issue has largely been overlooked in existing research. To address this, we propose GSRender, which uses 3D Gaussian Splatting for weakly-supervised occupancy estimation, simplifying the sampling process. Additionally, we introduce the Ray Compensation module, which reduces duplicated predictions by compensating for features from adjacent frames. Finally, we redesign the dynamic loss to remove the influence of dynamic objects from adjacent frames. Extensive experiments show that our approach achieves SOTA results in RayIoU (+6.0), while also narrowing the gap with 3D-supervised methods. This work lays a solid foundation for weakly-supervised occupancy perception. The code will be released soon.

Abstract:
Long-term human path forecasting in crowds is critical for autonomous moving platforms (like autonomous driving cars and social robots) to avoid collision and make high-quality planning. Although the current research take into account social interactions for prediction, they don't reveal the exact kinds of social interactions happened among people and how the social interactions affect the decision-making process of pedestrians, which further limits its robustness. Social interactions in pedestrian walking are intuitively massive and hard to label and quantify. In this paper, we explore creatively to quantify and interpret how pedestrians interact with others by proposing Learn to Cluster. Our clustering social interactions is probabilistic latent variable generative, learning directly from sequential trajectory observations, scalable to arbitrary number of pedestrians. Learn to cluster is label-free and can be naturally integrated into the training process of the prediction model. The latent variables will then serve as 'labels' to categorize social interactions. Extensive experiments over several trajectory prediction benchmarks demonstrate that our method is able to learn the patterns of social interactions and effectively integrate the patterns to pedestrian trajectory prediction.

Abstract:
With the advancement of connected and automated vehicles (CAVs), achieving accurate vehicle trajectory prediction and optimal control has become a critical challenge for improving the efficiency and safety of mixed traffic flow. However, due to the complex dynamic interactions between CAVs and human-driven vehicles (HVs) and the nonlinear nature of signal coordination, existing studies lack comprehensive consideration of CAV position adjustments within the platoon and their guidance effects on trailing HVs. This paper proposes a data-driven method for CAV state prediction and trajectory optimization. Employing an application-specific improved Informer model, our method accurately predicts CAV arrival states at a signalized intersection in mixed traffic. Additionally, Bayesian optimization (BO) is utilized to achieve automated and rapid tuning of CAV model predictive control (MPC) parameters through learning human driving characteristics. Experimental results demonstrate that our proposed method significantly enhances overall traffic efficiency and optimization when CAVs operate within mixed traffic, showing strong feasibility and adaptability.

Abstract:
Dual-arm robots promise greater efficiency but require planning for complex tasks with nonlinear sub-task dependencies. Current methods using Large Language Models (LLMs) suffer from a fundamental trade-off: generating linear sequences is efficient but fails to model parallelism and adapt to changes, while iterative querying is adaptive but too slow and costly. To bridge this gap, we introduce DAG-Plan, a novel task planning framework that for the first time employs a Directed Acyclic Graph (DAG) as the central representation for dual-arm coordination. The key insight is that a DAG natively captures complex sub-task dependencies and explicitly reveals opportunities for parallel execution. Within this framework, an LLM is used only once as a powerful semantic parser to translate a natural language instruction into a structured DAG. During execution, our system dynamically assigns candidate nodes to the suitable arm based on real-time environmental observations, enabling truly adaptive and parallel operation. Extensive evaluation on a dual-arm kitchen benchmark shows that DAG-Plan's structured approach fundamentally outperforms existing paradigms. It achieves a 48% higher success rate than single-query linear sequence methods with dual arm by robustly managing dependencies, and an 84.1% higher execution efficiency than iterative querying methods by eliminating the latency of repeated LLM calls. Our work demonstrates that a principled, graph-based representation is the key to unlocking efficient and reliable LLM-based planning for complex robotic systems. More demos and information are available on https://sites.google.com/view/dag-plan.

Abstract:
Contact-rich manipulation tasks such as precision assembly require precise control of interaction forces, yet existing imitation learning methods rely mainly on vision-only demonstrations. We propose ManipForce, a handheld system designed to capture high-frequency forcetorque (F/T) and RGB data during natural human demonstrations for contact-rich manipulation. Building on these demonstrations, we introduce the Frequency-Aware Multimodal Transformer (FMT). FMT encodes asynchronous RGB and F/T signals using frequency- and modality-aware embeddings and fuses them via bi-directional cross-attention within a transformer diffusion policy. Through extensive experiments on six real-world contact-rich manipulation taskssuch as gear assembly, box flipping, and battery insertionFMT trained on ManipForce demonstrations achieves robust performance with an average success rate of 83% across all tasks, substantially outperforming RGB-only baselines. Ablation and sampling-frequency analyses further confirm that incorporating high-frequency F/T data and cross-modal integration improves policy performance, especially in tasks demanding high precision and stable contact. Hardware, software, and video demos are available at: https://sites.google.com/view/manipforce/.

Abstract:
Intercepting fast moving objects, by its very nature, is challenging because of its tight time constraints. This problem becomes further complicated in the presence of sensor noise because noisy sensors provide, at best, incomplete information, which results in a distribution over target states to be intercepted. Since time is of the essence, to hit the target, the planner must begin directing the interceptor, in this case a robot arm, while still receiving information. We introduce an tree-like structure, which is grown using kinodynamic motion primitives in state-time space. This tree-like structure encodes reachability to multiple goals from a single origin, while enabling real-time value updates as the target belief evolves and seamless transitions between goals. We evaluate our framework on an interception task on a 6 DOF industrial arm (ABB IRB-1600) with an onboard stereo camera (ZED 2i). A robust Innovation-based Adaptive Estimation Adaptive Kalman Filter (RIAE-AKF) is used to track the target and perform belief updates.

Abstract:
A clean map of the surrounding environment is essential for autonomous driving systems to ensure reliable localization and safe path planning. However, the existence of dynamic objects introduces ghost traces into the map, significantly degrading its quality. To address this issue, we propose EnhanceERASOR, a two-stage framework for static 3D point cloud mapping, consisting of a lightweight OnlineERASOR stage for real-time static mapping and an OfflineRefinement stage for global optimization. The Online-ERASOR stage utilizes the egocentric ratio of pseudo occupancy between consecutive scans to identify dynamic points, followed by verification and post-processing strategies to suppress false positives and false negatives. The Offline-Refinement stage introduces a submap-to-map consistency check to suppress semi-dynamic and slow-moving objects, and adopts a voxel-guided strategy for dense static mapping. Extensive experiments on diverse datasets with different scenarios and sensors demonstrate the superior performance, robustness, and generalization ability of our proposed method in static map construction.

Abstract:
Quadruped robots show important potential for load carrying tasks due to their terrain adaptability, and a unique challenge of these tasks is to maintain quadrupedal stability when the load has active and dynamic characteristics. Their mass and center of mass change dynamically, rather than being integrated as a whole-body component of the quadruped. Unlike traditional load-carrying tasks, where the load is typically passive and its influence on the robot's movement is predictable and static, active dynamic loads can actively alter the robot's balance control in real-time, posing load disturbances to locomotion. These load disturbances, when combined with the fundamental attitude changes induced by complex terrain, create dual dynamic disturbances for the robot. To address these dual disturbances, we propose an active dynamic load modeling approach that captures the active and dynamic characteristics of the load, enabling the robot to adapt to the real-time changes in load movement. This approach is integrated into a Reinforcement Learning (RL) framework that leverages dynamic models: an Inverse Dynamic Model (IDM) that learns the dynamic characteristics of the active load, and a Forward Dynamic Model (FDM) that predicts the effects of complex terrain on the robot's motion, enabling synchronous adaptation to both types of dynamic disturbances. Extensive comparative simulations and physical experiments across diverse terrains, with active dynamic loads of varying movements, demonstrate the effectiveness of our method in enhancing balance control and adaptability.

Abstract:
Aerial manipulators extend the reach and manipulation capabilities of uncrewed multirotor aerial vehicles for inspection, agriculture, sampling, and delivery. Continuum arm aerial manipulation systems offer lightweight, dexterous, and compliant interaction opportunities. Existing designs allow manipulation only below the UAV which restricts their deployability in multiple directions and through clutter. They are also sensitive to propeller downwash. Addressing these limitations, we present Tilt-X, a continuum arm aerial manipulator that integrates a tilting mechanism, a telescopic stage, and a cable-driven continuum section. We present its design and kinematic model and validate it through flight demonstrations. Tilt-X enables a volumetric workspace with up to 75 mm extension and planar orientations between 0^circ to 90^circ. Experiments comparing end effector pose with and without downwash quantitatively measure its accuracy, providing critical evidence to guide the design and control of reliable aerial manipulators. Results show stabilisation of end effector pose as the manipulator extends out of the propeller influence zone.

Abstract:
High-fidelity simulation of vision-based tactile sensors is essential for developing data-driven robotic manipulation algorithms. However, a significant sim-to-real gap persists due to the difficulty in modeling complex optical effects, such as refraction through protective glass layers, and in accurately estimating physical parameters like sensor pose and lighting. To bridge this gap, we introduce a novel, fully differentiable pipeline for visual tactile simulation. Leveraging a differentiable path tracer, our method optimizes critical parametersincluding camera pose, lighting conditions, and object texturedirectly from just three real images. This approach achieves highly realistic simulations with physically accurate light transport and glass refraction. We validate our method through a comprehensive benchmark against real-world data, demonstrating state-of-the-art sim-to-real accuracy. We also enable novel applications, such as mesh reconstruction from a single tactile image via inverse rendering. To overcome the computational cost of path tracing, we further use a image-to-image translation model. This model uses high-fidelity simulated data alongside Normalized Object Coordinate Space (NOCS) maps as input, preserving crucial deformation information while enabling rapid inference. The code is available on https://tacdiffrend.github.io.

Abstract:
Teleoperation via natural-language reduces operator workload and enhances safety in high-risk or remote settings. However, in dynamic remote scenes, transmission latency during bidirectional communication creates gaps between remote perceived states and operator intent, leading to command misunderstanding and incorrect execution. To mitigate this, we introduce the Spatio-Temporal Open-Vocabulary Scene Graph (ST-OVSG), a representation that enriches open-vocabulary perception with temporal dynamics and lightweight latency annotations. ST-OVSG leverages LVLMs to construct open-vocabulary 3D object representations, and extends them into the temporal domain via Hungarian assignment with our temporal matching cost, yielding a unified spatio-temporal scene graph. A latency tag is embedded to enable LVLM planners to retrospectively query past scene states, thereby resolving localremote state mismatches caused by transmission delays. To further reduce redundancy and highlight task-relevant cues, we propose a task-oriented subgraph filtering strategy that produces compact inputs for the planner. ST-OVSG generalizes to novel categories and enhances planning robustness against transmission latency without requiring fine-tuning. Experiments show that our method achieves 74% node accuracy on Replica benchmark, outperforming ConceptGraph. Notably, in latency-robustness experiment, the LVLM planner assisted by ST-OVSG achieved a planning success rate of 70.5%. We refer to the project for the code and results.

Abstract:
Task and motion planning (TAMP) is a well-established approach for solving long-horizon robot planning problems. Although TAMP methods have historically assumed that each task-level robot action, or skill, can be reduced to kinematic motion planning, recent work has explored integrating closed-loop controllers and learned skills into TAMP-style systems. Our approach integrates pre-existing, heterogeneous robot skills--including learned, force-controlled, and black-box policies--into a hierarchical planner while preserving the object-centric failure reasoning of typical TAMP solvers. We leverage Composable Interaction Primitives (CIPs) to synthesize head and tail motion plans bridging consecutive skills, facilitating both planning-time refinement and execution-time adjustment. We validate our Task and Skill Planning (TASP) approach through real-world experiments on a bimanual manipulator and a mobile manipulator, demonstrating that CIPs enable diverse robots to combine heterogeneous skills to solve complex, long-horizon tasks, including multi-room mobile manipulation problems with non-monotonic task structure.

Abstract:
Grasping operations constitute a fundamental mechanism for robotic interaction with the environment and task execution, playing a critical role in logistics, unmanned systems, and complex terrain exploration. Conventional rigid grasping devices are often bulky and exhibit limited adaptability and controllability in unstructured environments. Suction-based grippers offer improved environmental compliance but typically require extensive tubing and vacuum pumps, constraining their integration into lightweight and soft robotic platforms. Inspired by octopus suction cups, recent bioinspired designs have leveraged geometrical optimization and flexible materials to enhance adhesion, yet most still rely on external actuation or complex vacuum systems, failing to replicate the rapid, reversible adhesion achieved through muscular contraction. To address this challenge, we present a bioinspired suction actuator based on liquid crystal elastomer (LCE), exploiting their reversible anisotropicisotropic phase transition under thermal stimuli to dynamically modulate the cavity volume and generate controllable negative pressure. The proposed design closely emulates octopus muscle mechanics while significantly simplifying structural complexity, achieving a combination of light weight, compliance, and programmability. Experiments demonstrate stable adhesion of 56 kPa on glass over 300 cycles, with rapid and reliable attachment/detachment under varying conditions, highlighting potential applications in climbing ro-bots, aerial grasping, and underwater exploration.

Abstract:
A robot with kinematical redundancy with respect to a main task may perform additional tasks simultaneously with the main one. Often, it is desirable to prioritize the performance of some tasks over that of others. To create a strict priority between the different tasks, meaning the performance of higher-prioritized tasks is unaffected by lower-prioritized tasks, null-space projections are often used. Null-space projections may, however, cause the closed-loop system to lose the desirable passivity property, which is necessary to ensure stable interactions with passive environments. In previous works, an energy tank has therefore been introduced to compensate for the potential activity stemming from the null-space projections. However, if the energy tank becomes empty when using these previous methods, the performance of the lower-prioritized tasks suffers more than when using a classical, non-passive hierarchical control scheme. Thus, a new approach to handling this case is proposed in this work. In the event of the energy tank becoming empty and unable to compensate for any null-space projection-induced activity, the hierarchy is ceded to preserve the passivity of the system, leading to better performance of the lower-prioritized tasks compared to previous passivation schemes. Output strict passivity of the closed-loop system is proven irrespective of the amount of energy available from the energy tank, and the performance of the proposed method is validated and compared to that of a classical hierarchical impedance controller and that of an earlier passivation method through simulation and experiments of redundant robotic manipulators.

Abstract:
In typical robot learning, deep reinforcement learning policies are employed in the upper control layer to generate target joint angles for robot motion, while conventional controllers are used in the fast lower control layer to control each joint motor. This paper presents a fully neural network-based hierarchical reinforcement learning approach for real-time robot joint control. The proposed method divides joint control into two layers: a high-frequency current control policy and a low-frequency position control policy. The current control policy drives the motor to follow the target current while learning the dynamic characteristics of the joint. The position control policy generates the target current to achieve a desired joint angle, allowing learning and inference at a slower frequency. By decoupling motor dynamics from position control, our method improves learning performance and enables policy generalization across joints. Experimental results on a three-joint robotic arm demonstrate the effectiveness of the proposed approach, including posture control using a shared position control policy across joints.

Abstract:
With the increasing presence of service robots and autonomous vehicles in human environments, navigation systems need to evolve beyond simple destination reach to incorporate social awareness. This paper introduces GSON, a novel group-based social navigation framework that leverages Large Multimodal Models (LMMs) to enhance robots' social perception capabilities. Our approach uses visual prompting to enable zero-shot extraction of social relationships among pedestrians and integrates these results with robust pedestrian detection and tracking pipelines to overcome the inherent inference speed limitations of LMMs. The planning system incorporates a mid-level planner that sits between global path planning and local motion planning, effectively preserving both global context and reactive responsiveness while avoiding disruption of the predicted social group. We validate GSON through extensive real-world mobile robot navigation experiments involving complex social scenarios such as queuing, conversations, and photo sessions. Comparative results show that our system significantly outperforms existing navigation approaches in minimizing social perturbations while maintaining comparable performance on traditional navigation metrics.

Abstract:
Accurate LiDAR-inertial odometry (LIO) highly depends on the geometric fidelity of the underlying environment representation. We explore the new and interesting research direction of integrating semantic segmentation models into metric odometry algorithms to enrich their representational capacity. Specifically, this letter proposes a semantic-driven hybrid voxel representation in which an off-the-shelf 3D segmentation network assigns every voxel to either a planar or nonplanar class, using planar and Gaussian representations, respectively. Consequently, a hybrid scan matching strategy is presented using class-specific residual models that are tailored to the distinct error statistics of each surface category. The scan matcher is embedded within an Iterated Extended Kalman Filter (IEKF) for odometry and mapping. We evaluate our method on diverse platforms and environments, and show improved localization accuracy across various indoor and outdoor scenarios, while maintaining real-time performance.

Abstract:
We provide an algorithm for adaptive legged locomotion via online learning and model predictive control. The algorithm is composed of two interacting modules: model predictive control (MPC) and online learning of residual dynamics. The residual dynamics can represent modeling errors and external disturbances. We are motivated by the future of autonomy where quadrupeds will autonomously perform complex tasks despite real-world unknown uncertainty, such as unknown payload and uneven terrains. The algorithm uses random Fourier features to approximate the residual dynamics in reproducing kernel Hilbert spaces. Then, it employs MPC based on the current learned model of the residual dynamics. The model is updated online in a self-supervised manner using least squares based on the data collected while controlling the quadruped. The algorithm enjoys sublinear dynamic regret, defined as the suboptimality against an optimal clairvoyant controller that knows how the residual dynamics. We validate our algorithm in Gazebo and MuJoCo simulations, where the quadruped aims to track reference trajectories. The Gazebo simulations include constant unknown external forces up to 12g, where g is the gravity vector, in flat terrain, slope terrain with 20 degree inclination, and rough terrain with 0.25m height variation. The MuJoCo simulations include time-varying unknown disturbances with payload up to 8 kg and time-varying ground friction coefficients in flat terrain.

Abstract:
Perceiving the physical properties of different surfaces/textures via tactile sensing has been a long-standing problem in robotics. Most prior work has been limited to discriminative models that classify textures into a fixed set of categories. However, to enable seamless autonomous manipulation, robots must infer physical properties as structured, continuous variables rather than as discrete class labels. In this work, we present a novel deep state-space model (DSSM) to learn and infer key causal textural properties in an unsupervised manner. Using variational inference to solve the DSSM, our proposed Latent Filter allows robotic systems to perceive textures in a continuous and generalizable manner. In addition, we explore a novel interaction approach: Tacser (Tactile Enhancer), to further enhance tactile sensing through vibrations induced by high-frequency micro-movements and thereby improve perception. We evaluated our approach against state-of-the-art techniques and performed extensive ablation studies to demonstrate its effectiveness. This work advances tactile-based texture perception, providing a generalizable and comprehensive framework for robotics.

Abstract:
Gecko-inspired adhesives have attracted considerable attention due to their unique combination of strong, yet reversible adhesion to diverse surfaces. However, their integration into robotic systems remains limited due to sensitivity to contact alignment, typically requiring near-perpendicular engagement. Yet, many robotic tasks involve varying approach and detachment angles, highlighting the need for adhesion systems that operate reliably across different orientations and loading conditions. This study addresses two key questions: Can the adhesion strength of gecko-inspired adhesives, integrated into robotic systems, be optimized using trajectory optimization? And is this optimization surface-dependent? A gecko-inspired adhesive was integrated on a robotic arms end-effector, which attached to and detached from surfaces along various trajectories. The arms energy expenditure for each attachment-detachment cycle, along with the corresponding adhesion strength, were measured. Online particle swarm optimization (PSO) algorithm was applied to identify conditions that optimized adhesion strength, either to resist or ease detachment. Results show that trajectory optimization significantly improves both adhesion strength and detachment efficiency up to 17-fold, with surface-specific effectiveness. These findings underscore the importance of considering both the forces generated by gecko-inspired adhesives and the energy required by the robot to attach and detach from surfaces at various angles and positions. By optimizing adhesion strength across surfaces, this study helps overcome current limitations in the use of gecko-inspired adhesives for robotic applications, including grippers and climbers.

Abstract:
We focus on planning minimum-length robot paths to cover environments using the robot's sensor or coverage (e.g. cleaning) tool. Many algorithms use the following framework: (i) compute a grid decomposition of the environment, (ii) partition the grid to be covered by non-overlapping coverage lines (straight-line paths), and (iii) compute a cost-minimizing tour of the coverage lines to get a coverage path. While this framework aims to minimize turns in the path, it does not yield guarantees on the resulting path length. In this paper, we show that this framework guarantees a coverage path of length (1 + 1.5 gamma) times the optimal, where gamma > 1 is the approximation factor to solve the metric traveling salesman problem (metric-TSP). Following this, we propose the Minimum Length Coverage Approx (MLC-Approx) approach that modifies this framework to achieve an approximation factor of (1.5 + epsilon), where epsilon << 1 depends on the number of coverage lines. Instead of computing a tour of the coverage lines, MLC-Approx merges minimum-length sub-tours of coverage lines while minimizing the turns added by the merges. We also propose a lazy variation of MLC-Approx that achieves the same result with faster empirical runtime. We validate MLC-Approx in simulations using maps of real-world environments and compare against state-of-the-art CPP approaches.

Abstract:
Unmanned aerial vehicles (UAVs) performing tasks such as transportation and aerial photography are vulnerable to intentional projectile attacks from humans. Dodging such a sudden and fast projectile poses a significant challenge for UAVs, requiring ultra-low latency responses and agile maneuvers. Drawing inspiration from baseball, in which pitchers' body movements are analyzed to predict the ball's trajectory, we propose a novel real-time dodging system that leverages an RGB-D camera. Our approach integrates human pose estimation with depth information to predict the attacker's motion trajectory and the subsequent projectile trajectory. Additionally, we introduce an uncertainty-aware dodging strategy to enable the UAV to dodge incoming projectiles efficiently. Our perception system achieves high prediction accuracy and outperforms the baseline in effective distance and latency. The dodging strategy addresses temporal and spatial uncertainties to ensure UAV safety. Extensive real-world experiments demonstrate the framework's reliable dodging capabilities against sudden attacks and its outstanding robustness across diverse scenarios.

Abstract:
Imitation learning in contact-rich tasks requires both global spatial awareness and fine-grained in-hand interaction understanding. However, vision-only policies based on images or point clouds are often susceptible to occlusion and struggle to capture critical contact details, particularly in visually ambiguous regions or during subtle tactile interactions. In this work, we present PoCoDP3, a pose- and contact-aware visual-tactile policy that integrates 3D point clouds and tactile inputs to generate actions in contact-rich tasks. PoCoDP3 introduces a dual-branch tactile encoder that jointly models contact dynamics and estimates in-hand object pose, enabling structured tactile representations for precise contact-rich manipulation. A contact-driven cross-modal fusion mechanism adaptively prioritizes sensory modalities based on real-time interaction cues, enabling efficient visual-tactile integration. Moreover, a reference-guided diffusion policy leverages reference action offsets to reduce sampling steps, significantly accelerating inference while maintaining action quality. Experiments across simulation and real-world tasks demonstrate that PoCoDP3 consistently outperforms representative 2D and 3D policies in terms of both accuracy and inference efficiency.

Abstract:
In many robotics applications, it is necessary to compute not only the distance between the robot and the environment, but also its derivative - for example, when using control barrier functions. However, since the traditional Euclidean distance is not differentiable, meaning it is not guaranteed to be differentiable everywhere, there is a need for alternative distance metrics that possess this property. Recently, a metric with guaranteed differentiability was proposed [1]. This approach has some important drawbacks, which we address in this paper. We provide much simpler and practical expressions for the smooth projection for general convex polytopes. Additionally, as opposed to [1], we ensure that the distance vanishes as the objects overlap. We show the efficacy of the approach in experimental results. Our proposed distance metric is publicly available through the Python-based simulation package UAIBot.

Abstract:
We present a scalable framework for cross-embodiment humanoid robot control by learning a shared latent representation that unifies motion across humans and diverse humanoid platforms, including single-arm, dual-arm, and legged humanoid robots. Our method proceeds in two stages: first, we construct a decoupled latent space that captures localized motion patterns across different body parts using contrastive learning, enabling accurate and flexible motion retargeting even across robots with diverse morphologies. To enhance alignment between embodiments, we introduce tailored similarity metrics that combine joint rotation and end-effector positioning for critical segments, such as arms. Then, we train a goal-conditioned control policy directly within this latent space using only human data. Leveraging a conditional variational autoencoder, our policy learns to predict latent space displacements guided by intended goal directions. We show that the trained policy can be directly deployed on multiple robots without any adaptation. Furthermore, our method supports the efficient addition of new robots to the latent space by learning only a lightweight, robot-specific embedding layer. The learned latent policies can also be directly applied to the new robots. Experimental results demonstrate that our approach enables robust, scalable, and embodiment-agnostic robot control across a wide range of humanoid platforms.

Abstract:
As robots integrate into human society, safe robot-environment interaction has emerged as a growing priority. A promising solution is introducing compliance to existing robots, akin to musculoskeletal systems, to absorb impacts. However, mimicking longitudinal compliance in biological joints remains a challenge due to its complex architecture. Here, adapting the elastic longitudinal movement structure of knee, we incorporate mechanical hinges with a compact buffer structure to enable both simple rotation and effective longitudinal impact absorption. Under longitudinal loading, the buffer structure transmits the limited compression to amplified deformations of elastic elements, and thus produces resistance. The load-displacement curve is tailored for a high-static-low-dynamic stiffness to improve energy absorption efficiency. Drop tests and walking robot demonstrations confirm that our knee-inspired hinge not only mitigates acceleration transmitted to robot body but also reduces ground reaction forces, thus improving robot-environment interaction safety. This work highlights the design paradigm of adapting natural solutions, and holds potential for direct integration into robots.

Abstract:
Autonomous aerial navigation in absolute darkness is crucial for post-disaster search and rescue operations, which often occur from disaster-zone power outages. Yet, due to resource constraints, tiny aerial robots, perfectly suited for these operations, are unable to navigate in the darkness to find survivors safely. In this paper, we present an autonomous aerial robot for navigation in the dark by combining an Infra-Red (IR) monocular camera with a large-aperture coded lens and structured light without external infrastructure like GPS or motion-capture. Our approach obtains depth-dependent defocus cues (each structured light point appears as a pattern that is depth dependent), which acts as a strong prior for our AsterNet deep depth estimation model. The model is trained in simulation by generating data using a simple optical model and transfers directly to the real world without any fine-tuning or retraining. AsterNet runs onboard the robot at 20 Hz on an NVIDIA Jetson Orin Nano. Furthermore, our network is robust to changes in the structured light pattern and relative placement of the pattern emitter and IR camera, leading to simplified and cost-effective construction. We successfully evaluate and demonstrate our proposed depth navigation approach AsterNav using depth from AsterNet in many real-world experiments using only onboard sensing and computation, including dark matte obstacles and thin ropes (diameter 6.25mm), achieving an overall success rate of 95.5% with unknown object shapes, locations and materials. To the best of our knowledge, this is the first work on monocular, structured-light-based quadrotor navigation in absolute darkness.

Abstract:
This paper presents a hierarchical control framework for quadrupedal locomotion that unifies the complementary strengths of model-based optimization and reinforcement learning. We develop a convex Quadratic Programming~(QP) solver based on the primal-dual Chambolle-Pock algorithm, enabling both massively parallel policy training and real-time deployment through efficient handling of constrained optimization problems. Our hierarchical framework employs learned policies for robust high-level control to handle real-world perturbations, while ensuring safety and energy efficiency through a low-level whole-body controller powered by the proposed solver. Extensive benchmarks and experimental validation demonstrate quantifiable improvements in energy consumption, constraint satisfaction, and task transferability across simulated and real-world environments.

Abstract:
Autonomous quadrotor flight in confined spaces such as pipes and tunnels presents significant challenges due to unsteady, self-induced aerodynamic disturbances. Very recent advances have enabled flight in such conditions, but they either rely on constant motion through the pipe to mitigate airflow recirculation effects or suffer from limited stability during hovering. In this work, we present the first closed-loop control system for quadrotors for hovering in narrow pipes that leverages real-time flow field measurements. We develop a low-latency, event-based smoke velocimetry method that estimates local airflow at high temporal resolution. This flow information is used to by a disturbance estimator based on a recurrent convolutional neural network, which infers force and torque disturbances in real time. The estimated disturbances are integrated into a learning-based controller trained via reinforcement learning. To the best of our knowledge, this work represents the first demonstration of an aerial robot with closed-loop control informed by real-time flow field measurements.

Abstract:
Equipping robotic end-effectors with human-like tactile perception is crucial for dexterous manipulation, requiring simultaneous thermal and mechanical sensing at the contact interface. Conventional multimodal sensors often rely on stacked or patterned layers, which increase device thickness, reduce conformability on curved robotic fingers, and introduce response delays. To address this, we present a time-division tactile perception platform tailored for robotic applications that utilizes memristive Ag-Cu2O core-sheath nanowire networks. This ultrathin artificial skin alternates between thermal and mechanical modalities at 16 Hz via memristive transitions, mirroring the processing of biological mechanoreceptors. In the SET state, sparse silver filaments form a mechanically sensitive network. During RESET, the semiconducting Cu2O sheath provides high thermal sensitivity. Lacking reactive components, the sensor achieves sub-microsecond mechanical and millisecond thermal responses, ideal for real-time robotic feedback. A deep learning pipeline processing these time-division signals improved object classification accuracy to 95%. Using a wireless module, 20 household objects were recognized with 83% accuracy. This single-layer architecture enables direct, seamless integration onto robotic hands, laying the groundwork for multimodal tactile intelligence in physical AI.

Abstract:
Fence-free collaborative manufacturing lets workers and machines share space, but autonomous safety monitoring cannot handle every situation alone. When an anomaly is flagged - unauthorized access, ambiguous sensor data, or unexpected worker behavior - a human operator must visually assess the scene. Our open-source framework deploys a mobile inspection robot controlled through an immersive VR headset. The operator sees through the robot's cameras with low-latency head-coupled video, navigates to the scene, and assesses the situation remotely - closing the loop between autonomous detection and human decision-making.

Abstract:
This work presents a compact, four-degree-of-freedom fingertip cutaneous feedback device capable of rendering normal force, shear, and rotational cues over a large range of motion. Based on a tendon-driven truss mechanism, the device achieved high repeatability in rotational and shear motions, with consistent positional errors that could be addressed through feedforward compensation. A proof-of-concept user study with additional z-axis actuation demonstrated reliable perception of all three tactile cues, achieving accuracies of 98.3% for normal force, 77.5% for rotation, and 91.2% for shear. These results support the feasibility of the proposed mechanism as a compact multi-modal tactile interface for a future handheld haptic device.

Abstract:
The Velocity Flow Field (VFF) lower-limb exoskeleton controller is widely applicable for gait rehabilitation because it provides the user with considerable agency over their gait; however, previous studies reported the feeling of "walking through water", and resistance to the user's efforts. In this work, a mathematical explanation for the viscous damping behavior when users deviate from the reference trajectory is presented. The controller was corrected and an adaptation law is proposed that synchronizes the speed gain with the user's current walking speed by minimizing the average mechanical work transferred between the user and exoskeleton per step. Experiments comparing a fixed and adaptive controller with 12 participants walking at 0.4 +/- 0.1 body length/s on a treadmill showed that the adaptive controller tracks changes in walking speed, while reducing the energy absorbed by 0.589 +/- 0.126 J/step compared to the fixed controller at the fastest walking speed. Analysis of changes in muscle effort and interaction torques with a human-exoskeleton interaction portrait showed that for most participants, the adaptive controller at medium and fast speeds substantially reduced user-controller disagreement and increased user agency over the walking motion. These positive results suggest that optimizing the energy supplied per step can serve as an effective coordination mechanism, enabling personalized and real-time adjustments of walking speed between the user and the exoskeleton.

Abstract:
Tactile information plays a crucial role in human manipulation tasks and has recently garnered increasing attention in robotic manipulation. However, existing approaches mostly focus on the alignment of visual and tactile features and the integration mechanism tends to be direct concatenation. Consequently, they struggle to effectively cope with occluded scenarios due to neglecting the inherent complementary nature of both modalities and the alignment may not be exploited enough, limiting the potential of their real-world deployment. In this paper, we present ViTaS, a simple yet effective framework that incorporates both visual and tactile information to guide the behavior of an agent. We introduce Soft Fusion Contrastive Learning, an advanced version of conventional contrastive learning method and a CVAE module to utilize the alignment and complementarity within visuo-tactile representations. We demonstrate the effectiveness of our method in 12 simulated and 3 real-world environments, and our experiments show that ViTaS significantly outperforms existing baselines. Project page: https://skyrainwind.github.io/ViTaS/index.html.

Abstract:
We present an efficient task and motion replanning approach for sequential multi-object manipulation in dynamic environments. Conventional Task And Motion Planning (TAMP) solvers experience an exponential increase in planning time as the planning horizon and number of objects grow, limiting their applicability in real-world scenarios. To address this, we propose learning problem decompositions from demonstrations to accelerate TAMP solvers. Our approach consists of three key components: goal decomposition learning, computational distance learning, and object reduction. Goal decomposition identifies the necessary sequences of states that the system must pass through before reaching the final goal, treating them as subgoal sequences. Computational distance learning predicts the computational complexity between two states, enabling the system to identify the temporally closest subgoal from a disturbed state. Object reduction minimizes the set of active objects considered during replanning, further improving efficiency. We evaluate our approach on three benchmarks, demonstrating its effectiveness in improving replanning efficiency for sequential multi-object manipulation tasks in dynamic environments

Abstract:
Robotics and automation are key enablers to increase throughput in ongoing conservation efforts across various threatened ecosystems. Cataloguing, digitisation, husbandry, and similar activities require the ability to interact with delicate, fragile samples without damaging them. Additionally, learning-based solutions to these tasks require the ability to safely acquire data to train manipulation policies, e.g., reinforcement learning. To address these twin needs, we introduce a novel method to print free-form, highly sensorised soft physical twins. We present an automated design workflow to create complex and customisable 3D soft sensing structures on demand from 3D scans or models. Compared to the state of the art, our soft liquid metal sensors faithfully recreate complex natural geometries and display excellent sensing properties suitable for validating performance in delicate manipulation tasks. We demonstrate the application of our physical twins as 'sensing corals': high-fidelity, 3D printed replicas of scanned corals that eliminate the need for live coral experimentation, whilst increasing data quality, offering an ethical and scalable pathway for advancing autonomous coral handling and soft manipulation broadly. Through extensive bench-top manipulation and underwater grasping experiments, we show that our sensing coral is able to detect grasps under 0.5 N, effectively capturing the delicate interactions and light contact forces required for coral handling. Finally, we showcase the value of our physical twins across two demonstrations: (i) automated coral labelling for lab identification and (ii) robotic coral aquaculture. Sensing physical twins such as ours can provide richer grasping feedback than conventional sensors providing experimental validation prior to deployment in handling fragile and delicate items.

Abstract:
Controlling robots with strongly nonlinear, high-dimensional dynamics remains challenging, as direct nonlinear optimization with safety constraints is often intractable in real time. The Koopman operator offers a way to represent nonlinear systems linearly in a lifted space, enabling the use of efficient linear control. We propose a data-driven framework that learns a Koopman embedding and operator from data, and integrates the resulting linear model with the Safe Set Algorithm (SSA). This allows the tracking and safety constraints to be solved in a single quadratic program (QP), ensuring feasibility and optimality without a separate safety filter. We validate the method on a Kinova Gen3 manipulator and a Go2 quadruped, showing accurate tracking and obstacle avoidance.

Abstract:
Separating thin, flexible layers that must be individually grasped is a common but challenging manipulation primitive for most off-the-shelf grippers. A prominent example arises in clinical settings: the opening of sterile flat pouches for the preparation of the operating room, where the first step is to separate and grasp the flaps. We present a novel gripper design and opening strategy that enables reliable flap separation and robust seal opening. This capability addresses a high-volume repetitive hospital procedure in which nurses manually open up to 240 bags per shift, a physically demanding task linked to musculoskeletal injuries. Our design combines an active dented-roller fingertip with compliant fingers that exploit environmental constraints to robustly grasp thin flexible flaps. Experiments demonstrate that the proposed gripper reliably grasps and separates sealed bag flaps and other thin-layered materials from the hospital, the most sensitive variable affecting performance being the normal force applied. When two copies of the gripper grasp both flaps, the system withstands the forces needed to open the seals robustly. To our knowledge, this is one of the first demonstrations of robotic assistance to automate this repetitive, low-value, but critical hospital task.

Abstract:
Robot navigation in human-populated environments poses challenges due to the diversity of human behaviors and the unpredictability of human paths. However, existing Reinforcement Learning (RL)-based methods often rely on simulators that lack sufficient diversity in human behavior, resulting in navigation policies that overfit specific human behavior and perform poorly in unseen environments. To address this, we propose a diversity-aware crowd model based on Reinforcement Learning, employing Constrained Variational Exploration (VE) with a Mutual Information (MI)-based auxiliary reward to capture fine-grained behavioral diversity.The proposed model leverages a Centralized Training Decentralized Execution (CTDE) paradigm, which ensures stable exploration under multi-agent settings. Using the proposed diversity-aware model for training, we obtain robust robot navigation policies capable of handling diverse unseen scenarios. Extensive simulation and real-world experiments demonstrate the superior performance of our approach in achieving diverse crowd behaviors and enhancing robot navigation robustness. These findings highlight the potential of our method to advance safe and efficient robot operations in complex dynamic environments.

Abstract:
In this paper, we address the problem of manipulating multi-particle aggregates using a bimanual robotic system. Our approach enables the autonomous transport of dispersed particles through a series of shaping and pushing actions using robotically-controlled tools. Achieving this advanced manipulation capability presents two key challenges: high-level task planning and trajectory execution. For task planning, we leverage Vision Language Models (VLMs) to enable primitive actions such as tool affordance grasping and non-prehensile particle pushing. For trajectory execution, we represent the evolving particle aggregate's contour using truncated Fourier series, providing efficient parametrization of its closed shape. We adaptively compute trajectory waypoints based on group cohesion and the geometric centroid of the aggregate, accounting for its spatial distribution and collective motion. Through real-world experiments, we demonstrate the effectiveness of our methodology in actively shaping and manipulating multi-particle aggregates while maintaining high system cohesion.

Abstract:
In the context of safety-critical control, we propose and analyse the use of Control Barrier Functions (CBFs) to limit the kinetic energy of torque-controlled robots. The proposed scheme is able to modify a nominal control action in a minimally invasive manner to achieve the desired kinetic energy limit. We show how this safety condition is achieved by appropriately injecting damping in the underlying robot dynamics independently of the nominal controller structure. We present an extensive experimental validation of the approach on a 7-Degree of Freedom (DoF) Franka Emika Panda robot. The results demonstrate that this approach provides an effective, minimally invasive safety layer that is straightforward to implement and is robust in real experiments.

Abstract:
Small multirotors demonstrate significant potential due to their simple airframe and human-friendly operation. However, the reduced size results in substantially higher energy consumption, which severely limits their flight endurance and restricts their range of applications. Ornithopters, while offering better aerodynamic efficiency, experience energy losses due to the mechanical complexity required to generate reciprocating motion. In this work, inspired by the samara, we present a lightweight aircraft with an exceptionally simple design featuring a single actuator and a mono airfoil. To optimize the flight configuration for minimal power consumption, we employed a Surrogate optimization method that integrates spinning airfoil dynamics, motor-propeller efficiency, and hovering equilibrium. As a result, the proposed vehicle achieves position-controlled hovering flight for up to 26 minutes with a takeoff weight of only 32 grams. Its superior power efficiency is demonstrated by a high power loading of 9.1 grams per watt. Compared to state-of-the-art systems, the proposed design shows significant improvements in both flight endurance and power efficiency. The reliable and stable position-holding flight over an extended period further validates the effectiveness of the proposed methods and the practical applicability of the fabricated prototype.

Abstract:
We propose CLEVER, an active learning system for robust semantic perception with Deep Neural Networks (DNNs). For data arriving in streams, our system seeks human support when encountering failures and adapts DNNs online based on human instructions. In this way, CLEVER can eventually accomplish the given semantic perception tasks. Our main contribution is the design of a system that meets several desiderata of realizing the aforementioned capabilities. The key enabler herein is our Bayesian formulation that encodes domain knowledge through priors. Empirically, we not only motivate CLEVER's design but further demonstrate its capabilities with a user validation study as well as experiments on humanoid and deformable objects. To our knowledge, we are the first to realize stream-based active learning on a real robot, providing evidence that the robustness of the DNN-based semantic perception can be improved in practice. The project website can be accessed at https://sites.google.com/view/thecleversystem.

Abstract:
This article presents a motion planning and control framework for flexible robotic manipulators, integrating deep reinforcement learning (DRL) with a nonlinear partial differential equation (PDE) controller. Unlike conventional approaches that focus solely on control, we demonstrate that the desired trajectory significantly influences endpoint vibrations. To address this, a DRL motion planner, trained using the soft actor-critic (SAC) algorithm, generates optimized trajectories that inherently minimize vibrations. The PDE nonlinear controller then computes the required torques to track the planned trajectory while ensuring closed-loop stability using Lyapunov analysis. The proposed methodology is validated through both simulations and real-world experiments, demonstrating superior vibration suppression and tracking accuracy compared to traditional methods. The results underscore the potential of combining learning-based motion planning with model-based control for enhancing the precision and stability of flexible robotic manipulators.

Abstract:
Multi-object tracking (MOT) plays a critical role in applications such as autonomous driving and surveillance. Camera-based approaches offer rich texture features for object association, while LiDAR-based methods provide accurate geometric information for spatial reasoning. Although each modality addresses different challenges, their intrinsic discrepancies hinder effective cross-modal fusion and unified representation learning. To overcome these limitations, we propose IMH-MOT, an interactive multi-hierarchical MOT framework comprising three key modules. The Multi-modality~Alignment~Module~(MMAM) enhances spatial representations by sampling and clustering instance-level point clouds. From different modalities are motion cues integrated by the Multi-modality~Motion~Estimation~Module~(MMEM) to build a unified motion model. To mitigate the impact of occlusion on single-frame appearance features, the Long-term~Appearance~Module~(LAM) captures temporal appearance consistency by constructing a long-term appearance embedding. Guided by modality-aware cues from MMAM, MMEM generates reliable spatial representations, while LAM encodes robust long-term appearance features. These components are jointly integrated through a Multi-hierarchical~Data~Association~(MHDA) strategy, enabling stable and accurate tracking. Extensive experiments on the KITTI MOT benchmark demonstrate the effectiveness of our framework, achieving 80.90% HOTA, 89.73% MOTA, and 470 IDSW, outperforming state-of-the-art methods in both standard and challenging scenarios.

Abstract:
We address the challenge of learning to manipulate deformable objects with unknown dynamics. In non-rigid objects, the dynamics parameters define how they react to interactions --how they stretch, bend, compress, and move-- and they are critical to determining the optimal actions to perform a manipulation task successfully. In other robotic domains, such as legged locomotion and in-hand rigid object manipulation, state-of-the-art approaches can handle unknown dynamics using Rapid Motor Adaptation (RMA). Through a supervised procedure in simulation that encodes each rigid object's dynamics, such as mass and position, these approaches learn a policy that conditions actions on a vector of latent dynamic parameters inferred from sequences of state-actions. However, in deformable object manipulation, the object's dynamics not only includes its mass and position, but also how the shape of the object changes. Our key insight is that the recent ground-truth particle positions of a deformable object in simulation capture changes in the object's shape, making it possible to extend RMA to deformable object manipulation. This key insight allows us to develop RAPiD, a two-phase method that learns to perform real-robot deformable object mobile manipulation by: 1) learning a visuomotor policy conditioned on the object's dynamics embedding, which is encoded from the object's privileged information in simulation, such as its mass and ground-truth particle positions, and 2) learning to infer this embedding using non-privileged information instead, such as robot visual observations and actions, so that the learned policy can transfer to the real world. On a mobile manipulator with 22 degrees of freedom, RAPiD enables over 80%+ success rates across two vision-based deformable object mobile manipulation tasks in the real world, under unseen object dynamics, categories, and instances.

Abstract:
Autonomous aerial robots are increasingly being deployed in real-world scenarios, where transparent glass obstacles present significant challenges to reliable navigation. Researchers have investigated the use of non-contact sensors and passive contact-resilient aerial vehicle designs to detect glass surfaces, which are often limited in terms of robustness and efficiency. In this work, we propose a novel approach for robust autonomous aerial navigation in unknown environments with transparent glass obstacles, combining the strengths of both sensor-based and contact-based glass detection. The proposed system begins with the incremental detection and information maintenance about potential glass surfaces using visual sensor measurements. The vehicle then actively engages in touch actions with the visually detected potential glass surfaces using a pair of lightweight contact-sensing modules to confirm or invalidate their presence. Following this, the volumetric map is efficiently updated with the glass surface information and safe trajectories are replanned on the fly to circumvent the glass obstacles. We validate the proposed system through real-world experiments in various scenarios, demonstrating its effectiveness in enabling efficient and robust autonomous aerial navigation in complex real-world environments with glass obstacles.

Abstract:
Precision agriculture offers the opportunity to auto- mate routine or difficult tasks in orchards and vineyards, such as spraying or inspection, with Unmanned Ground Vehicles (UGV). In this context, human operators should be kept in the closed-loop control of the robot for safety and reliability. This work is motivated by the challenges of deploying effectively human-robot shared control in the field. First, an asymptotically stable controller must keep the robot on the desired trajectory between rows of trees, whose distance is on the order of the robots width. Second, the robot must efficiently avoid static and moving obstacles (e.g. a rock or a human) in its path. Third, the control inputs must not exceed the actuator limits, which can degrade the trajectory tracking performance, cause instability, or damage critical hardware. Finally, in real-life scenarios, user intervention is sometimes required to manage unpredictable situations. To overcome these challenges, we propose and deploy a shared controller that smoothly blends automatic trajectory tracking, collision avoidance, and human commands. At the same time, it guarantees the system is stable and control actions are within the actuator limits at all times. We extensively tested our approach in simulation and field experiments in an apple orchard.

Abstract:
Biological synergies have emerged as a widely adopted paradigm for dexterous hand design, enabling human-like manipulation with a small number of actuators. Nonetheless, excessive coupling tends to diminish the dexterity of hands. This paper tackles the trade-off between actuation complexity and dexterity by proposing an anthropomorphic finger topology with 4-DoFs driven by 2 actuators, and by developing an adaptive, modular dexterous hand based on this finger topology. We explore the biological basis of hand synergies and human gesture analysis, translating joint-level coordination and structural attributes into a modular finger architecture. Leveraging these biomimetic mappings, we design a five-finger modular hand and establish its kinematic model to analyze adaptive grasping and in-hand manipulation. Finally, we construct a physical prototype and conduct preliminary experiments, which validate the effectiveness of the proposed design and analysis.

Abstract:
Visual Place Recognition (VPR) enables systems to identify previously visited locations within a map, a fundamental task for autonomous navigation. Prior works have developed VPR solutions using event cameras, which asynchronously measure per-pixel brightness changes with microsecond temporal resolution. However, these approaches rely on dense representations of the inherently sparse camera output and require tens to hundreds of milliseconds of event data to predict a place. Here, we break this paradigm with Flash, a lightweight VPR system that predicts places using sub-millisecond slices of event data. Our method is based on the observation that active pixel locations provide strong discriminative features for VPR. Flash encodes these active pixel locations using efficient binary frames and computes similarities via fast bitwise operations, which are then normalized based on the relative event activity in the query and reference frames. Flash improves Recall@1 for sub-millisecond VPR over existing baselines by 11.33x on the indoor QCR-Event-Dataset and 5.92x on the 8 km Brisbane-Event-VPR dataset. Moreover, our approach reduces the duration for which the robot must operate without awareness of its position, as evidenced by a localization latency metric we term Time to Correct Match (TCM). To the best of our knowledge, this is the first work to demonstrate sub-millisecond VPR using event cameras.

Abstract:
Accurate trajectory prediction and motion planning are crucial for autonomous driving systems to navigate safely in complex, interactive environments characterized by multimodal uncertainties. However, current generation-then-evaluation frameworks typically construct multiple plausible trajectory hypotheses but ultimately adopt a single most likely outcome, leading to overconfident decisions and a lack of fallback strategies that are vital for safety in rare but critical scenarios. Moreover, the usual decoupling of prediction and planning modules could result in socially inconsistent or unrealistic joint trajectories, especially in highly interactive traffic. To address these challenges, we propose a contingency-aware diffusion planner (CoPlanner), a unified framework that jointly models multi-agent interactive trajectory generation and contingency-aware motion planning. Specifically, the pivot-conditioned diffusion mechanism anchors trajectory sampling on a validated, shared short-term segment to preserve temporal consistency, while stochastically generating diverse long-horizon branches that capture multimodal motion evolutions. In parallel, we design a contingency-aware multi-scenario scoring strategy that evaluates candidate ego trajectories across multiple plausible long-horizon evolution scenarios, balancing safety, progress, and comfort. This integrated design preserves feasible fallback options and enhances robustness under uncertainty, leading to more realistic interaction-aware planning. Extensive closed-loop experiments on the nuPlan benchmark demonstrate that CoPlanner consistently surpasses state-of-the-art methods on both Val14 and Test14 datasets, achieving significant improvements in safety and comfort under both reactive and non-reactive settings. The source code is available on GitHub.

Abstract:
Imitation learning (IL) offers a scalable framework for teaching robots complex manipulation skills from human demonstrations. However, conventional end-to-end visuomotor IL models often suffer from poor performance and robustness due to the significant modality mismatch between high-dimensional visual inputs and low-dimensional motor actions. The redundant information in RGB image, such as color of ambient light, leads models to depend on strong yet brittle task irrelevant priors that ultimately degrade performance across diverse visual environments. To address these limitations, we propose Motion-Aware Two-Stream Policy (MTP) -- a novel imitation learning architecture that explicitly incorporates motion priors via optical flow alongside RGB observations. MTP employs a two-stream perception module that separately encodes spatial (RGB) and temporal (optical flow) information. These spatial-temporal features are fused and fed into a conditional flow matching module to generate actions. We evaluate MTP extensively in both simulation and real-world robot tasks. Results show that MTP significantly outperforms state-of-the-art baselines in terms of success rate and robustness to visual perturbations, demonstrating its effectiveness in generalizable robotic manipulation. To benefit the community, our code will be released.

Abstract:
Several recently released humanoid robots, in- spired by the mechanical design of Cassie, employ actuator configurations in which the motors are displaced from the joints to reduce leg inertia. While studies accounting for the full kinematic complexity have demonstrated the benefits of these designs, the associated loop-closure constraints greatly increase computational cost and limit their use in control and learning. As a result, the non-linear transmission is often approximated by a constant reduction ratio, preventing exploitation of the mechanisms full capabilities. This paper introduces a compact analytical formulation for the two standard knee and ankle mechanisms that captures the exact non-linear transmission while remaining computationally efficient. The model is fully differentiable up to second order with a minimal formulation, enabling low-cost evaluation of dynamic derivatives for trajectory optimization and of the apparent transmission impedance for reinforcement learning. We integrate this formulation into trajectory optimization and locomotion policy learning, and compare it against simplified constant-ratio approaches. Hardware experiments demonstrate improved accuracy and robustness, showing that the proposed method provides a practical means to incorporate parallel actuation into modern control algorithms.

Abstract:
Robust monocular visual Simultaneous Localization and Mapping (SLAM) serves as a cornerstone for various applications. However, its performance frequently suffers degradation in challenging scenarios including fast motion, dynamic objects, and scale ambiguity. This paper proposes CDV-SLAM, a compact deep visual SLAM framework that unifies geometric and semantic perception through a shared visual foundation model. A tight semantic-geometric fusion network is devised to predict optical flow in fast motion. Semantic features are efficiently reused to obtain segmentation and monocular depth for dynamic objects exclusion and scale acquisition. To further address scale drift, we introduce local scale correction in bundle adjustment. Experimental results demonstrate a 42% decrease in average Absolute Trajectory Error (ATE) on the KITTI dataset over the state-of-the-art. Furthermore, our flow-only visual odometry surpasses geometric-only methods on the TartanAir and EuRoC datasets, with a marginal speed reduction of 6%. Our code is publicly available at https://github.com/FrankYard/CDV-SLAM.

Abstract:
Robotic manipulation of deformable linear objects (DLOs) presents significant challenges due to complex dynamics and frequent self-occlusions. Existing robotic knot tying methods typically rely on precise topological state tracking with ordered keypoints and explicit edge connectivity. This reliance makes them prone to failures due to tracking drift and topology mismatch caused by repeated bending and crossings during knot formation. To address these limitations, we introduce RoboHitch, a novel framework that learns to perform hitch knot tying from human demonstrations using only disordered 3D keypoints and RGB images. This eliminates the need for explicit topological order, allowing for more flexible manipulation. Our method employs a dynamic Graph Autoencoder to extract geometric features from untracked keypoints, complemented by a Convolutional Autoencoder that captures essential visual context. A bidirectional cross-attention mechanism then fuses these modalities to jointly predict pick and place affordances, facilitating implicit reasoning about the rope's state and enabling knot tying under occlusion. Real-world experiments demonstrate the effectiveness and generalizability of our approach, successfully completing hitch knots in scenarios with self-occlusions.

Abstract:
Imitation learning is a popular paradigm to teach robots new tasks, but collecting robot demonstrations through teleoperation or kinesthetic teaching is tedious and time-consuming. In contrast, directly demonstrating a task using our human embodiment is much easier and data is available in abundance, yet transfer to the robot can be non-trivial. In this work, we propose Real2Gen to train a manipulation policy from a single human demonstration. Real2Gen extracts required information from the demonstration and transfers it to a simulation environment, where a programmable expert agent can demonstrate the task arbitrarily many times, generating an unlimited amount of data to train a flow matching policy. We evaluate Real2Gen on human demonstrations from three different real-world tasks and compare it to a recent baseline. Real2Gen shows an average increase in the success rate of 26.6% and better generalization of the trained policy due to the abundance and diversity of training data. We further deploy our purely simulation-trained policy zero-shot in the real world. We make the data, code, and trained models publicly available at real2gen.cs.uni-freiburg.de.

Abstract:
Unmanned aerial vehicles (UAVs) have been widely employed to achieve autonomous exploration of 3D unknown environments. However, most existing algorithms suffer from low exploration efficiency caused by inaccurate motion time cost evaluation, which typically leads to the motion inconsistency during the UAV flight. In this work, we propose a learning-based motion time prediction method for real-time evaluating the accurate motion time costs to candidate viewpoints. Specifically, the prediction method takes the current state of the UAV and its surrounding environment features as input to predict the arrival time to each viewpoint. Based on the motion time cost prediction, the UAV can minimize the time wasted by unnecessary acceleration and deceleration during exploration. To further improve the efficiency, we also develop an optimal exploration target decision algorithm that benefits from the predicted motion time costs and the adaptive upper-bound constraints. Simulation and real-world experiments demonstrate that our method can significantly improve the exploration efficiency and increase the average flight speed of the UAV.

Abstract:
Multi-robot motion planning and crowd simulations are crucial in social navigation, enabling agents to avoid collisions with one another in dynamic environments. While existing methods typically use simple circular models for robot and pedestrian boundaries, superquadric models offer greater flexibility in accurately representing non-circular objects. This paper addresses the challenges of employing superquadric models to avoid dynamic obstacles and other moving robots. We tackle three primary challenges: (i) approximating the complex parametric boundary surface of Minkowski sum for easier differentiation; (ii) computing the boundary of velocity obstacles; and (iii) rapidly calculating velocity changes. The approximation of differentiable Minkowski sum boundary is formulated as a semidefinite programming problem using convex sum-of-squares polynomials. We then develop a tangency point-finding algorithm with superlinear convergence speed, and introduce a rule-based collision-avoidance approach, named SSCA (Superquadric-based Sum-of-Squares Collision Avoidance for Multi-Robot Systems) for efficient velocity change calculation. Our proposed method is evaluated through extensive experiments, demonstrating millisecond-level computational efficiency and scalability to dozens of robots. This work provides a more effective solution for collision avoidance algorithm using superquadric models, enhancing the safety and performance of robots in dynamic shared environments.

Abstract:
In this paper, we combine first-order approximations of hybrid systems (i.e., the so-called saltation matrix) with previous works on parametric sensitivity for continuous systems to propose a general framework for robust trajectory optimization of hybrid systems subject to parametric uncertainties. A method for computing parametric sensitivities of both continuous dynamics and hybrid events is presented. The obtained "hybrid parametric sensitivity" is then combined with sensitivity-based tubes that encapsulate all possible perturbed states and control trajectories given a known bounded range for the uncertain parameters. The proposed method is then applied to the problem of planning robust trajectories for legged robot systems, which allows obtaining trajectories that remain feasible w.r.t.~the contact constraints even in presence of uncertainties in the dynamics, guard conditions, and reset maps. We also illustrate one of the fundamental limitations of first-order approximations, that is, the fact that the sensitivity reset time is fixed, and propose an extension to the sensitivity analysis that can form the basis for future developments.

Abstract:
We present a fast and reactive grasping framework that combines task-space velocity fields with joint-space Quadratic Program (QP) in a hierarchical structure. Reactive, collision-free global motion planning is particularly challenging for high-DoF systems, as simultaneous increases in state dimensionality and planning horizon trigger a combinatorial explosion of the search space, making real-time planning intractable. To address this, we plan globally in a lower-dimensional task space such as fingertip positions and track locally in the full joint space while enforcing all constraints. This approach is realized by constructing velocity fields in multiple task-space coordinates (or, in some cases, a subset of joint coordinates) and solving a weighted joint-space QP to compute joint velocities that track these fields with appropriately assigned priorities. Through simulation experiments and real-world tests using the recent pose-tracking algorithm FoundationPose, we verify that our method enables high-DoF armhand systems to perform real-time, collision-free reaching motions while adapting to dynamic environments and external disturbances.

Abstract:
Underwater robots have significant potential for a wide range of applications, including deep-sea exploration, hydrocarbon extraction, marine biodiversity observation, and waste retrieval. A hybrid actuation system that combines electromagnets and permanent magnets preserves the main benefits of magnetic-driven robots, addressing the issues of bulky coil systems and limited mobility. However, most electromagnetically actuated underwater robots are limited to a fixed swimming mode due to their relatively simple designs, which restrict their adaptability to unpredictable and unstructured aquatic environments. In this work, we present an untethered multi-mode swimming robot driven by four 2-degreesof-freedom (DoF) electromagnetic actuators, each with a rigid shell interconnected by flexible connectors and covered with silicone membranes. Initially, we conducted tests to determine the optimal hardness of the flexible connector by validating the modules range of motion across different activation times. Next, we demonstrated that the robot can swim forward and backward in a water tank, exhibiting snake-inspired motion, front- and rear-undulation, and wave-shaped motion, and reaching a maximum speed of 87.8 mm/s. Finally, we showed the lateral translation and steering motions achieved with different control signals, resulting in an average turning speed of 3�?s. This approach enables a novel robot design strategy based on compact multi-DoF electromagnetic modules, facilitating potential applications in search-and-rescue missions and environmental inspections.

Abstract:
Thruster failures in unmanned surface vehicles (USVs) can critically compromise mission completion, particularly when severe degradation eliminates controllability in essential degrees of freedom. While traditional fault-tolerant control treats environmental disturbances as impediments to be rejected, this paper presents a novel approach: strategically exploiting wind and wave forces as virtual actuators for emergency harbor return. The proposed environment-assisted model predictive control (EAMPC) framework adaptively modulates environmental force utilization factors based on fault severity and the environmental force prediction confidence, transforming natural disturbances into environmental assistance. The hierarchical architecture integrates state estimation and prediction with physics-informed learning for short-term environmental forces, and reachability-based trajectory planning that exploits environmental forces to expand feasible zones. Theoretical analysis establishes practical input-to-state stability with explicit bounds quantifying degradation. Extensive validation across 320 trials demonstrates 91.25% mission success under 95% thruster degradation compared to 0% for baseline methods. This work demonstrates that strategic environmental exploitation fundamentally transforms fault recovery capabilities in marine robotics.

Abstract:
Learning whole-body mobile manipulation via imitation is essential for generalizing robotic skills to diverse environments and complex tasks. However, this goal is hindered by significant challenges, particularly in effectively processing complex observation, achieving robust generalization, and generating coherent actions. To address these issues, we propose DSPv2, a novel policy architecture. DSPv2 introduces an effective encoding scheme that aligns 3D spatial features with multi-view 2D semantic features. This fusion enables the policy to achieve broad generalization while retaining the fine-grained perception necessary for precise control. Furthermore, we extend the Dense Policy paradigm to the whole-body mobile manipulation domain, demonstrating its effectiveness in generating coherent and precise actions for the whole-body robotic platform. Extensive experiments show that our method significantly outperforms existing approaches in both task performance and generalization ability. Project page is available at: https://selen-suyue.github.io/DSPv2Net/.

Abstract:
Cross-embodiment dexterous grasp synthesis refers to adaptively generating and optimizing grasps for various robotic hands with different morphologies. This capability is crucial for achieving versatile robotic manipulation in diverse environments and requires substantial amounts of reliable and diverse grasp data for effective model training and robust generalization. However, existing approaches either rely on physics-based optimization that lacks human-like kinematic understanding or require extensive manual data collection processes that are limited to anthropomorphic structures. In this paper, we propose CEDex, a novel cross-embodiment dexterous grasp synthesis method at scale that bridges human grasping kinematics and robot kinematics by aligning robot kinematic models with generated human-like contact representations. Given an object's point cloud and an arbitrary robotic hand model, CEDex first generates human-like contact representations using a Conditional Variational Auto-encoder pretrained on human contact data. It then performs kinematic human contact alignment through topological merging to consolidate multiple human hand parts into unified robot components, followed by a signed distance field-based grasp optimization with physics-aware constraints. Using CEDex, we construct the largest cross-embodiment grasp dataset to date, comprising 500K objects across four gripper types with 20M total grasps. Extensive experiments show that CEDex outperforms state-of-the-art approaches and our dataset benefits cross-embodiment grasp learning with high-quality diverse grasps.

Abstract:
Multi-robot cooperative control has been extensively studied using model-based distributed control methods. However, such control methods rely on sensing and perception modules in a sequential pipeline of design, and the separation of perception and controls may cause processing latency and compounding errors that affect control performance. End-to-end learning overcomes such limitation by learning directly from onboard sensing data, and outputs control command to robots. Challenges exist in end-to-end learning for multi-robot cooperative control and previous results are not scalable. We propose in this paper a novel decentralized cooperative control method for multi-robot formation using deep neural networks, in which inter-robot communication is modeled by a graph neural network (GNN). Our method takes LIDAR sensor data as input, and the control policy is learned from demonstration provided by an expert controller in a decentralized way. While training with a fixed number of robots, the learned control policy is scalable. Evaluation in a robot simulator demonstrates the triangulation formation behavior of multi-robot teams with varying sizes using the learned control policy.

Abstract:
Touch is a fundamental modality for conveying emotions and intentions in HumanRobot Interaction. However, conventional approaches to touch pattern recognition often lack robustness to inter-user variability, whereas alternative solutions are frequently bulky or costly. This study proposes a novel feature extraction framework for touch pattern recogni tion, which adapts MFCC from speech processing to capacitive touch signals. The proposed method preserves the strengths of MFCCdimensionality reduction and noise robustnesswhile addressing the physical differences between audio and touch signals by introducing a new frequency reference axis in place of the conventional Mel scale. To evaluate its effectiveness, a representative set of social touch patterns, including gestures traditionally difficult to classify, was defined and analyzed. The proposed framework ensures stable recognition across diverse users while reducing feature dimensionality for efficient operation in lightweight models. This efficiency highlights its suitability for real-time robotic interfaces.

Abstract:
Robotic pushing is a versatile non-prehensile manipulation skill that enables robots to handle ungraspable objects without specialized tools. This paper introduces a contact-aware, goal-oriented pushing framework that achieves dexterous and robust manipulation by explicitly allowing free-motion of the end-effector. Central to our approach is the contact-aware generalized velocitymotion model (C-GVMM), which captures the relationship between pusher velocity and slider motion across all contact modes, including separation. Unlike prior methods that rely on predefined trajectories or fixed contact-mode sequences, our framework enables seamless transitions among sticking, sliding, and separating modes. Building upon C-GVMM, we employ Model Predictive Path Integral (MPPI) control to generate goal-directed actions, and UKF-based online estimation to handle the uncertain object properties in real-world setting. We validate our approach through both numerical simulations and real-robot experiments, demonstrating that the framework accomplishes diverse pushing tasks with more optimal pusher and slider motion with high success rates. These results demonstrate the practical viability of the proposed approach for real-world robotic pushing tasks.

Abstract:
Model Predictive Control (MPC) is widely used for torque-controlled robots, but classical formulations often neglect real-time force feedback and struggle with contact-rich industrial tasks under collision constraints. Deburring in particular requires precise tool insertion, stable force regulation, and collision-free circular motions in challenging configurations, which exceeds the capability of standard MPC pipelines. We propose a framework that integrates force-feedback MPC with diffusion-based motion priors to address these challenges. The diffusion model serves as a memory of motion strategies, providing robust initialization and adaptation across multiple task instances, while MPC ensures safe execution with explicit force tracking, torque feasibility, and collision avoidance. We validate our approach on a torque-controlled manipulator performing industrial deburring tasks. Experiments demonstrate reliable tool insertion, accurate normal force tracking, and circular deburring motions even in hard-to-reach configurations and under obstacle constraints. To our knowledge, this is the first integration of diffusion motion priors with force-feedback MPC for collision-aware, contact-rich industrial tasks.

Abstract:
This paper tackles decentralized continuous task allocation in heterogeneous multi-agent systems. We present a novel framework HIPPO-MAT that integrates graph neural networks (GNN) employing a GraphSAGE architecture to compute independent embeddings on each agent with an Independent Proximal Policy Optimization (IPPO) approach for multi-agent deep reinforcement learning. In our system, unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs) share aggregated observation data via communication channels while independently processing these inputs to generate enriched state embeddings. This design enables dynamic, cost-optimal, conflict-aware task allocation in a 3D grid environment without the need for centralized coordination. A modified A path planner is incorporated for efficient routing and collision avoidance. Simulation experiments demonstrate scalability with up to 30 agents and preliminary real-world validation on JetBot ROS AI Robots, each running its model on a Jetson Nano and communicating through an ESP-NOW protocol using ESP32-S3, which confirms the practical viability of the approach that incorporates simultaneous localization and mapping (SLAM). Experimental results revealed that our method achieves a high 92.5% conflict-free success rate, with only a 16.49% performance gap compared to the centralized Hungarian method, while outperforming the heuristic decentralized baseline based on greedy approach. Additionally, the framework exhibits scalability with up to 30 agents with allocation processing of 0.32 simulation step time and robustness in responding to dynamically generated tasks.

Abstract:
Autonomous robots must reason about the physical consequences of their actions to operate effectively in unstructured, real-world environments. We present Scan, Materialize, Simulate (SMS), a unified framework that combines 3D Gaussian Splatting for accurate scene reconstruction, visual foundation models for semantic segmentation, vision-language models for material property inference, and physics simulation for reliable prediction of action outcomes. By integrating these components, SMS enables generalizable physical reasoning and object-centric planning without the need to relearn foundational physical dynamics. We empirically validate SMS in a billiards-inspired manipulation task and a challenging quadrotor landing scenario, demonstrating robust performance on both simulated domain transfer and real-world experiments. Our results highlight the potential of bridging differentiable rendering for scene reconstruction, foundation models for semantic understanding, and physics-based simulation to achieve physically grounded robot planning across diverse settings. Our project page with additional materials can be found at https://sites.google.com/view/scan-materialize-simulate.

Abstract:
Gliding offers small fixed-wing UAVs extended endurance and silent operation but requires accurate energy management, especially under wind disturbances and obstacle constraints. Traditional Total Energy Control Systems based controllers regulate the trade between potential and kinetic energy reactively, often requiring fine-tuning and trim-conditions knowledge. In this work, we shift the regulation to the planning level and present a nonlinear, multi-cost trajectory planner for small UAV gliders. The method generates mathcalC^3 continuous trajectories based on Bernstein polynomials, mapped into control commands through differential flatness, and re-planned online to match experimentally derived sink polar curves. A simulated netto variometer is integrated into the optimization to estimate air mass motion, constraining the glide to energy-balanced states. Consecutive gliding trajectories are linked by cruising segments computed through trajectories initialized on Dubins path-based waypoints, enabling hybrid missions that combine powered and unpowered flight. The approach is validated in CFD simulations and real-world experiments with a fixed-wing platform, showing reliable stabilization of sink rate, airspeed, and glide ratio under wind gusts and in presence of obstacles.

Abstract:
We present a computational framework for simulating filaments interacting with rigid bodies through contact. Filaments are challenging to simulate due to their codimensionality, i.e., they are one-dimensional structures embedded in three-dimensional space. Existing methods often assume that filaments remain permanently attached to rigid bodies. Our framework unifies discrete elastic rod (DER) modeling, a pressure field patch contact model, and a convex contact formulation to accurately simulate frictional interactions between slender filaments and rigid bodies - capabilities not previously achievable. Owing to the convex formulation of contact, each time step can be solved to global optimality, guaranteeing complementarity between contact velocity and impulse. We validate the framework by assessing the accuracy of frictional forces and comparing its physical fidelity against baseline methods. Finally, we demonstrate its applicability in both soft robotics, such as a stochastic filament-based gripper, and deformable object manipulation, such as shoelace tying, providing a versatile simulator for systems involving complex filament-filament and filamentrigid body interactions.

Abstract:
We present an algorithm for planning trajectories that avoid obstacles and satisfy key-door precedence specifi- cations expressed with a fragment of signal temporal logic. Our method includes a novel exact convex partitioning of the obstacle free space that encodes connectivity among convex free space sets, key sets, and door sets. We then construct an augmented graph of convex sets that exactly encodes the key-door precedence specifications. By solving a shortest path problem in this augmented graph of convex sets, our pipeline provides an exact solution up to a finite parameterization of the trajectory. To illustrate the effectiveness of our approach, we present a method to generate key-door mazes that provide challenging problem instances, and we perform numerical experiments to evaluate the proposed pipeline. Our pipeline is faster by several orders of magnitude than recent state-of-the art methods that use general purpose temporal logic tools.

Abstract:
Reliable insertion of industrial connectors remains a central challenge in robotics, requiring sub-millimeter precision under uncertainty and often without full visual access. Vision-based approaches struggle with occlusion and limited generalization, while learning-based policies frequently fail to transfer to unseen geometries. To address these imitations, we leverage tactile sensing, which captures local surface geometry at the point of contact and thus provides reliable information even under occlusion and across novel connector shapes. Building on this capability, we present Touch2Insert, a tactile-based framework for arbitrary peg insertion. Our method reconstructs cross-sectional geometry from high-resolution tactile images and estimates the relative pose of the hole with respect to the peg in a zero-shot manner. By aligning reconstructed shapes through registration, the framework enables insertion from a single contact without task-specific training. To evaluate its performance, we conducted experiments with three diverse connectors in both simulation and real-robot settings. The results indicate that Touch2Insert achieved sub-millimeter pose estimation accuracy for all connectors in simulation, and attained an average success rate of 86.8% on the real robot, thereby confirming the robustness and generalizability of tactile sensing for real-world robotic connector insertion.

Abstract:
Robotic manipulators excel in structured environments but face substantial challenges in unstructured and dynamic settings. This paper presents SplatCtrl, a unified framework for real-time scene reconstruction and reactive robot motion generation to enable collision-free robotic arm control in previously unseen and continuously changing environments. Building on 3D Gaussian Splatting (3D-GS), we introduce a hybrid voxel-based filtering and dynamic Gaussian relocation strategy that supports efficient scene reconstruction from RGBD streams while accommodating environmental changes. For safe and reactive control, we further propose a method for deriving continuous signed distance functions from isotropic Gaussians, providing stable and differentiable collision probability estimates that bridge classical distance fields with the modern implicit representation. These continuous distance metrics are incorporated into control barrier functions, resulting in a unified perceptionaction coupling framework that supports smooth and reliable real-time motion generation in response to scene changes. Experimental validation in simulation, on physical robot, and within shared humanrobot workspace demonstrates the frameworks effectiveness, achieving integrated scene reconstruction and reactive control in uncertain, and dynamic environments.

Abstract:
The programmable assembly and actuation of micro- and nanostructures remain key challenges in the development of micro-robotics. This work presents a programmable assembly and cooperative actuation strategy for heterogeneous microspheres based on optoelectronic tweezers (OET). By employing Ag-PS microspheres as actuators and PS microspheres as payloads, we constructed stable actuatorpayload units and investigated their frequency response and dynamic characteristics. The proposed method enables controlled assembly into coresatellite and satellitecore configurations with tunable coordination angles. Furthermore, the cooperative effect of the dual actuating units was revealed, enabling the composite system to maintain a continuous and precise circular trajectory following a ring-shaped light pattern. In addition, the modular assembly strategy was used to construct chain-like structures exceeding 172 μm in length, thereby confirming the approach's scalability. This work expands the application of OET from particle transport to modular microstructure construction and multi-actuator cooperative control, offering new opportunities for designing microbotic systems and their biomedical applications.

Abstract:
Vision-Language-Action models (VLAs) mark a major shift in robot learning. They replace specialized architectures and task-tailored components of expert policies with large-scale data collection and setup-specific fine-tuning. In this machine learning-focused workflow that is centered around models and scalable training, traditional robotics software frameworks become a bottleneck, while robot simulations offer only limited support for transitioning from and to real-world experiments. In this work, we close this gap by introducing Robot Control Stack (RCS), a lean ecosystem designed from the ground up to support research in robot learning with large-scale generalist policies. At its core, RCS features a modular and easily extensible layered architecture with a unified interface for simulated and physical robots, facilitating sim-to-real transfer. Despite its minimal footprint and dependencies, it offers a complete feature set, enabling both real-world experiments and large-scale training in simulation. Our contribution is twofold: First, we introduce the architecture of RCS and explain its design principles. Second, we evaluate its usability and performance along the development cycle of VLA and RL policies. Our experiments also provide an extensive evaluation of Octo, OpenVLA, and Pi-Zero on multiple robots and shed light on how simulation data can improve real-world policy performance. Our code, datasets and videos are available at https://robotcontrolstack.github.io/

Abstract:
Arm end-effector stabilization is essential for humanoid loco-manipulation tasks, yet it remains challenging due to the high degrees of freedom and inherent dynamic instability of bipedal robot structures. Previous model-based controllers achieve precise end-effector control but rely on precise dynamics modeling and estimation, which often struggle to capture real-world factors (e.g., friction and backlash) and thus degrade in practice. On the other hand, learning-based methods can better mitigate these factors via exploration and domain randomization, and have shown potential in real-world use. However, they often overfit to training conditions, requiring retraining with the entire body, and still struggle to adapt to unseen scenarios. To address these challenges, we propose a novel stable end-effector control (SEEC) framework with model-enhanced residual learning that learns to achieve precise and robust end-effector compensation for lower-body induced disturbances through model-guided reinforcement learning (RL) with a perturbation generator. This design allows the upper-body policy to achieve accurate end-effector stabilization as well as adapt to unseen locomotion controllers with no additional training. We validate our framework in different simulators and transfer trained policies to the Booster T1 humanoid robot. Experiments demonstrate that our method consistently outperforms baselines and robustly handles diverse and demanding loco-manipulation tasks.

Abstract:
Driving without considering the preferred separation distance from surrounding vehicles may cause discomfort for users. To address this limitation, we propose a planning framework that explicitly incorporates user preferences regarding the desired level of safe clearance from surrounding vehicles. We design a questionnaire purposefully tailored to capture user preferences relevant to our framework, while minimizing unnecessary questions. Specifically, the questionnaire considers various interaction-relevant factors, including the surrounding vehicles size, speed, position, and maneuvers of surrounding vehicles, as well as the maneuvers of the ego vehicle. The response indicates the user-preferred clearance for the scenario defined by the question and is incorporated as constraints in the optimal control problem. However, it is impractical to account for all possible scenarios that may arise in a driving environment within a single optimal control problem, as the resulting computational complexity renders real-time implementation infeasible. To overcome this limitation, we approximate the original problem by decomposing it into multiple subproblems, each dealing with one fixed scenario. We then solve these subproblems in parallel and select one using the cost function from the original problem. To validate our work, we conduct simulations using different user responses to the questionnaire. We assess how effectively our planner reflects user preferences compared to preference-agnostic baseline planners by measuring preference alignment.

Abstract:
Shared autonomy blends operator intent with autonomous assistance. In cluttered environments, linear blending can produce unsafe commands even when each source is individually collision-free. Many existing approaches model obstacle avoidance through potentials or cost terms, which only enforce safety as a soft constraint. In contrast, safety-critical control requires hard guarantees. We investigate the use of control barrier functions (CBFs) at the inverse kinematics (IK) layer of shared autonomy, targeting post-blend safety while preserving task performance. Our approach is evaluated in simulation on representative cluttered environments and in a VR teleoperation study comparing pure teleoperation with shared autonomy. Across conditions, employing CBFs at the IK layer reduces violation time and increases minimum clearance while maintaining task performance. In the user study, participants reported higher perceived safety and trust, lower interference, and an overall preference for shared autonomy with our safety filter. Additional materials available at https://berkguler.github.io/barrierik.

Abstract:
Topological mapping offers a compact and robust representation for navigation, but progress in the field is hindered by the lack of standardized evaluation metrics, datasets, and protocols. Existing systems are evaluated in different environments under different criteria, preventing fair and reproducible comparison. Moreover, a key challenge---perceptual aliasing---remains under-quantified despite its strong influence on system performance. We address these gaps by (i) formalizing emphtopological consistency as the fundamental property of topological maps and showing that, under mild assumptions, localization accuracy provides an efficient and interpretable surrogate metric, and (ii) introducing, to our knowledge, the first quantitative measure of dataset ambiguity for fair comparison across environments. To support this protocol, we curate a diverse benchmark dataset with calibrated ambiguity levels, implement and release deep learning-based baseline systems, and evaluate them alongside classical methods. Our experiments provide new insights into the limitations of current approaches under perceptual aliasing. All datasets, baselines, and evaluation tools are publicly released to foster consistent and reproducible research in topological mapping.

Abstract:
Autonomous vehicle navigation in complex environments such as dense and fast-moving highways and merging scenarios remains an active area of research. In the past decade, many planning and control approaches have used reinforcement learning (RL) with notable success. However, a key limitation of RL is its reliance on well-specified reward functions, which often fail to capture the full semantic and social complexity of diverse, out-of-distribution situations. As a result, a rapidly growing line of research explores using Large Language Models (LLMs) to replace or supplement RL for direct planning and control, on account of their ability to reason about rich semantic context. However, LLMs present significant drawbacks: they can be unstable in zero-shot safety-critical settings, produce inconsistent outputs, and often depend on expensive API calls with network latency. This motivates our investigation into whether small, locally deployed LLMs (�?4B parameters) can meaningfully support autonomous highway driving through reward shaping rather than direct control. These models are attractive for practical deployment as they can run on a single GPU and avoid external API dependencies. We present a case study comparing RL-only, LLM-only, and hybrid approaches, where LLMs augment RL rewards by scoring state-action transitions during training, while standard RL policies execute at test time. Our findings reveal that RL-only agents achieve moderate success rates (73-89%) with reasonable efficiency, LLM-only agents can reach higher success rates (up to 94%) but with severely degraded speed performance, and hybrid approaches consistently fall between these extremes. Critically, despite explicit efficiency instructions, LLM-influenced approaches exhibit systematic conservative bias with substantial model-dependent variability, highlighting important limitations of current small LLMs for safety-critical control tasks.

Abstract:
Ensuring authenticity of hyperspectral imagery (HSI) at the moment of acquisition is critical: subtle spectral attacks can mislead downstream analysis before digital defenses take effect. By injecting the optical key before digitization, the effect is created in hardware and cannot be replicated in software, resulting in stronger protection. We present the Spectral Security Imaging System (SSIS), an acquisition-stage approach that injects a data-aware additive spectral key in the optical path of a pushbroom imager, binding integrity to the measurement while preserving the class-informative structure. We describe the complete system forward model, detectorkey joint optimization, and a laboratory prototype together with a thorough calibration process. A laboratory dataset (unsigned and optically signed cubes) supports two evaluations. For manipulation detection, SSIS achieves detection accuracy of 92% with visual distortion of PSNR = 41.5 dB and SSIM = 0.981. For downstream classification on clean data, macro-F1 remains close to the unsigned ceiling, about 93% of the monochromator baseline (0.915 vs 0.981) and about 99% of the pushbroom baseline (0.892 vs 0.903), while outperforming multiplicative and watermarking baselines by up to 16.1 points in macro-F1 and 19.4 points in accuracy.

Abstract:
Magnetic actuation enables surgical robots to navigate complex anatomical pathways while reducing tissue trauma and improving surgical precision. However, clinical deployment is limited by the challenges of controlling such systems under fluoroscopic imaging, which provides low framerate and noisy pose feedback. This paper presents a control framework that remains accurate and stable under such conditions by combining a nonlinear model predictive control (NMPC) framework that directly outputs coil currents, an analytically differentiable magnetic field model based on Zernike polynomials, and a Kalman filter to estimate the robot state. Experimental validation is conducted with two magnetic robots in a 3D-printed fluid workspace and a spine phantom replicating drug delivery in the epidural space. Results show the proposed control method remains highly accurate when feedback is downsampled to 3 Hz with added Gaussian noise (σ = 2), mimicking clinical fluoroscopy. In the spine phantom experiments, the proposed method successfully executed a drug delivery trajectory with a root mean square (RMS) position error of 1.18 mm while maintaining safe clearance from critical anatomical boundaries.

Abstract:
Targeted coordination of swarm robotic systems is an emerging robot control task arising from numerous applications across diverse domains, ranging from medicine and agriculture to cyber-physical systems. However, state-of-the-art control techniques for robot swarms often require comprehensive measurement data for each robot and are not scalable with the growth of the swarm size. To address these issues, in this work, we develop a latent space control architecture for robust manipulation of patterns in arbitrarily large, potentially infinite, robot swarms using only partial measurements. In particular, we model such a swarm as a parameterized control system and formulate its patterns in terms of probability distributions. We then develop a moment kernel transform, which generates a reduced latent space representation for the pattern dynamics of the robot swarm over a reproducing kernel Hilbert space. The moment representation of the robot swarm can be learned using partial measurements of the swarm. Building on this, we propose a reinforcement learning (RL)-based pattern control framework operating on the moment latent space. In this framework, the data is organized to flow between the workspace and moment latent space episodically to achieve both robust control performance and high training efficiency. The proposed moment latent RL framework is validated by various pattern control tasks involving wheeled robot swarms, using both numerical simulations and TurtleBot3 swarms in the Gazebo simulator.

Abstract:
This paper tackles the problem of generating representations of underwater 3D terrain. Off-the-shelf generative models, trained on Internet-scale data but not on specialized underwater images, exhibit downgraded realism, as images of the seafloor are relatively uncommon. To this end, we introduce DreamSea, a generative model to generate hyper-realistic underwater scenes. DreamSea is trained on real-world image databases collected from underwater robot surveys. Images from these surveys contain massive real seafloor observations and covering large areas. We extract 3D geometry and latent embeddings from the data with visual foundation models, and train a diffusion model that generates realistic seafloor images in RGBD channels, conditioned on novel fractal-distribution-based latent embeddings. We then fuse the generated images into a 3D map, building a 3DGS model supervised by 2D diffusion priors which allows photorealistic novel view rendering. DreamSea is rigorously evaluated, demonstrating the ability to robustly generate large-scale underwater scenes that are consistent, diverse, and photorealistic. Our work drives impact in underwater robotics, and in particular, underwater robot simulation.

Abstract:
Vision-Language Models (VLMs) have demonstrated significant potential for end-to-end autonomous driving, yet a substantial gap remains between their current capabilities and the reliability necessary for real-world deployment. A critical challenge is their fragility, characterized by hallucinations and poor generalization in out-of-distribution (OOD) scenarios. To bridge this gap, we introduce MTRDrive, a novel framework that integrates procedural driving experiences with a dynamic toolkit to enhance generalization and proactive decision-making. MTRDrive addresses these limitations through a closed-loop system that combines a memory-based experience retrieval mechanism with dynamic toolkits. This synergy enables the model to interact more effectively with its environment, improving both reasoning and decision-making capabilities with the help of our memory-tool synergistic reasoning. Additionally, we introduce a new benchmark based on complex Roadwork construction scenarios to rigorously evaluate zero-shot generalization. Extensive experiments demonstrate the effectiveness of our approach. On the public NavSim benchmark, MTRDrive achieves state-of-the-art performance with a driving metric score of 79.8% and a planning accuracy of 82.6%. To rigorously test generalization, we evaluate our model in a zero-shot setting on our new Roadwork-VLM benchmark. In this challenging out-of-distribution test, it attains a driving metric score of 80.2% and a planning accuracy of 33.5%, showcasing its strong ability to reason robustly in unseen scenarios. These results highlight the potential of MTRDrive to advance the field of autonomous driving towards safer and more reliable systems.

Abstract:
Learning from Demonstrations (LfD) has emerged as a prominent paradigm for imparting motion skills to robotic systems. Dynamical systems (DS) offer a potent mathematical framework for representing point-to-point motions, a critical requirement for numerous practical applications in robotics. While existing approaches typically construct DS models by employing diffeomorphic mappings to morph stable reference systems toward observed demonstrations, the requirement to preserve strict diffeomorphic properties introduces architectural constraints on neural network design, thereby constraining their expressiveness. To address this limitation, we present a DS-based LfD formulation that relaxes traditional diffeomorphism constraints. Our framework employs bidirectional temporal integration of ordinary differential equations (ODEs) to simultaneously satisfy stability guarantees and trajectory alignment objectives. A key innovation lies in a variational calculus framework for Jacobian estimation, enabling efficient computation of DS vector fields while maintaining numerical stability. Comprehensive evaluations demonstrate that our method achieves 33.7% improvement in trajectory reproduction accuracy compared to state-of-the-art baselines while preserving Lyapunov stability. The proposed methodology significantly expands the representational capacity of DS-based learning systems, enabling robust reproduction of complex motion patterns.

Abstract:
Learning from demonstration (LfD) enables robots to acquire new skills from human examples without explicit programming. Dynamical system (DS)-based approaches, in particular, have shown robustness to disturbances and adaptability in unstructured environments. However, existing methods often fail to incorporate task-specific constraintssuch as grasp locations, execution starting points, or motion restrictionsthat are critical for reliable execution. This limitation becomes especially problematic in tool-use scenarios, where both the environment and the grasped tool impose strict restrictions on feasible motions. To address this challenge, we propose a novel constraint-aware DS framework that automatically extracts and encodes task-specific constraints directly from demonstration data. The key idea is that task-critical configurations, repeatedly observed across successful demonstrations, can be identified and modeled as essential regions for task success using Gaussian Process Regression. By embedding these constraints, the proposed method generates motions that remain robust to environmental variations and tool-induced limitations. Experiments with a 7-DoF robotic manipulator demonstrate that our framework significantly improves task success rates over state-of-the-art methods. Real-world evaluations on daily-life tasks, such as dishware collection, further confirm its practicality and potential for real-world robotic applications.

Abstract:
We present emphLLM Trainer, a fully automated pipeline that leverages the world knowledge of Large Language Models (LLMs) to transform a small number of human demonstrations (as few as one) into a large robot dataset for imitation learning. Our approach decomposes demonstration generation into two steps: (1) offline demonstration annotation that extracts keyframes, salient objects, and poseobject relations; and (2) online keypose retargeting that adapts those keyframes to a new scene, given an initial observation. Using these modified keypoints, our system warps the original demonstration to generate a new trajectory, which is then executed, and the resulting demo, if successful, is saved. Because the annotation is reusable across scenes, we use Thompson sampling to optimize the annotation, significantly improving generation success rate. We evaluate our method on a range of tasks, and find that our data annotation method consistently outperforms expert-engineered baselines. We further show an ensemble policy that combines the optimized LLM feed-forward plan with a learned feedback imitation learning controller. Finally, we demonstrate hardware feasibility on a Franka Emika Panda robot.

Abstract:
Visual seafloor imaging using autonomous underwater vehicles (AUVs) has become an established method for seafloor mapping and monitoring. With AUVs now achieving multiweek endurance and several hundred kilometers of range on a single charge, image quality assessment (IQA) on-board vehicles in the field is necessary for robust data acquisition given the sensitivity of underwater imaging surveys to environmental conditions. This research develops a metric to assess seafloor image quality in situ, and demonstrates its use for quality assurance during a 21-day, shore-launched AUV campaign that visited three sites up to 170 km from shore. The metric was transmitted via satellite communication along with vehicle telemetry to shore-based AUV operators during regular surfacing intervals without relying on vehicle recovery. The method was implemented on the seafloor laser scan and strobed imaging system BioCam, deployed on the Autosub Long Range (ALR) AUV (also known as Boaty McBoatface) in the North Sea. Several tens of hectares of seafloor imagery were collected, and image quality scores were transmitted. This information was used to retask the AUV and maximize the quality of acquired images within operational constraints. Data products generated from the collected imagery show the improvements achieved that would otherwise have been missed. This highlights the importance of remote awareness of data quality to facilitate longer consecutive mapping missions without vehicle recovery.

Abstract:
Typical LiDAR SLAM architectures feature a front-end for odometry estimation and a back-end for refining and optimizing the trajectory and map, commonly through loop closures. However, loop closure detection in large-scale missions presents significant computational challenges due to the need to identify, verify, and process numerous candidate pairs for pose graph optimization. Keyframe sampling bridges the front-end and back-end by selecting frames for storing and processing during global optimization. This article proposes an online keyframe sampling approach that constructs the pose graph using the most impactful keyframes for loop closure. We introduce the Minimal Subset Approach (MSA), which optimizes two key objectives: redundancy minimization and information preservation, implemented within a sliding window framework. By operating in the feature space rather than 3-D space, MSA efficiently reduces redundant keyframes while retaining essential information. Evaluations on diverse public datasets show that the proposed approach outperforms naive methods in reducing false positive rates in place recognition, while delivering superior ATE and RPE in metric localization, without the need for manual parameter tuning. Additionally, MSA demonstrates efficiency and scalability by reducing memory usage and computational overhead during loop closure detection and pose graph optimization.

Abstract:
The computation of time-optimal velocity profiles along prescribed paths, subject to generic acceleration constraints, is a crucial problem in robot trajectory planning, with particular relevance to autonomous racing. However, the existing methods either support arbitrary acceleration constraints at high computational cost or use conservative box constraints for computational efficiency. We propose FBGA, a new Forward-Backward algorithm with Generic Acceleration constraints, which achieves both high accuracy and low computation time. FBGA operates forward and backward passes to maximize the velocity profile in short, discretized path segments, while satisfying user-defined performance limits. Tested on five racetracks and two vehicle classes, FBGA handles complex, non-convex acceleration constraints with custom formulations. Its maneuvers and lap times closely match optimal control baselines (within 0.11% to 0.36%), while being up to three orders of magnitude faster. FBGA maintains high accuracy even with coarse discretization, making it well suited for online multi-query trajectory planning.

Abstract:
Mapping and understanding complex 3D environments is fundamental to how autonomous systems perceive and interact with the physical world, requiring both precise geometric reconstruction and rich semantic comprehension. While existing 3D semantic mapping systems excel at reconstructing and identifying predefined object instances, they lack the flexibility to efficiently build semantic maps with open-vocabulary during online operation. Although recent vision-language models have enabled open-vocabulary object recognition in 2D images, they haven't yet bridged the gap to 3D spatial understanding. The critical challenge lies in developing a training-free unified system that can simultaneously construct accurate 3D maps while maintaining semantic consistency and supporting natural language interactions in real time. In this paper, we develop a zero-shot framework that seamlessly integrates GPU-accelerated geometric reconstruction with open-vocabulary vision-language models through online instance-level semantic embedding fusion, guided by hierarchical object association with spatial indexing.

Abstract:
Drone technology is proliferating in many industries, including agriculture, logistics, defense, infrastructure, and environmental monitoring. Vision-based autonomy is one of its key enablers, particularly for real-world applications. This is essential for operating in novel, unstructured environments where traditional navigation methods may be unavailable. Autonomous drone racing has become the de facto benchmark for such systems. State-of-the-art research has shown that autonomous systems can surpass human-level performance in racing arenas. However, direct applicability to commercial and field operations is still limited as current systems are often trained and evaluated in highly controlled environments. In our contribution, the system's capabilities are analyzed within a controlled environment---where external tracking is available for ground-truth comparison---but also demonstrated in a challenging, uninstrumented environment---where ground-truth measurements were never available. We show that our approach can match the performance of professional human pilots in both scenarios. We also publicly release the data from the flights carried out by our approach and a world-class human pilot.

Abstract:
Preference-based Reinforcement Learning (PbRL) enables policy learning through simple queries comparing trajectories from a single policy, yet suffers from low query efficiency as policy bias limits trajectory diversity and reduces discriminable queries for learning human preferences. This paper identifies preference discriminability, which quantifies how easily a human can judge which trajectory is closer to their ideal behavior, as a key metric for improving query efficiency. To address this, we move beyond single-policy sampling and generate queries by comparing trajectories from different policies, as learning multiple policies from scratch promotes trajectory diversity without policy bias. We propose Discriminability-Aware Policy-to-Policy Preference-Based Efficient Reinforcement Learning (DAPPER), which integrates preference discriminability with trajectory diversification achieved by multiple policies. DAPPER trains new policies from scratch after each reward update and employs a discriminator that learns to estimate preference discriminability, enabling the prioritized sampling of more discriminable queries. During training, it jointly maximizes the preference reward and pr

Abstract:
Autonomous robotic handling requires accurate 3-D scene measurement followed by grasp planning. Conventional systems struggle with transparent or specular objects. Additionally, in handeye setups, moving through multiple viewpoints increases handling execution time. In this paper, we propose HEAPGraspHand-Eye Active Perception to Grasp objects with diverse optical properties. To measure such objects, we focus on the ability to segment objects regardless of their optical properties in RGB images. We employ Shape from Silhouette based on the segmented images for 3-D measurement. To shorten the time required for multi-view capture with a hand-eye camera, we plan its trajectory using a cost function that balances 3-D measurement accuracy against its trajectory length. Real-robot experiments achieve a 94.3% grasp success rate on transparent, specular, and opaque objects, while reducing the hand-eye cameras trajectory length by 52% and handling execution time by 18% relative to baselines that circle around the scene for 3-D measurement.

Abstract:
Dynamic hand manipulation requires both precise motion control and rapid actuation capability, yet most existing robotic hands are primarily optimized for dexterity and accuracy, often sacrificing speed performance. To address this limitation, this work presents a mode-switchable wire-driven robotic hand incorporating a planetary transmission mechanism capable of continuously varying output speed and torque characteristics. The proposed system operates in two distinct modes: a torque-enhancing mode for stable and precise grasping, and a speed-amplifying mode for agile dynamic motions such as flicking and throwing. A dedicated mechanical switching mechanism enables real-time transition between the two transmission modes according to task requirements. A full prototype of the five-finger robotic hand is currently under fabrication and system integration, and preliminary analytical results demonstrate the feasibility of both precise grasping and enhanced high-speed manipulation capability. These results validate the proposed transmission architecture as a promising solution for robotic hands requiring both dexterous and dynamic manipulation.

Abstract:
Dexterous control of multi-joint manipulators such as humanoid robot hands remains challenging when relying solely on visual feedback. Cameras are fundamentally limited in measuring contact forces, slip, and surface deformation during physical interaction, and are susceptible to occlusion in contact-rich scenarios. Tactile sensing is therefore widely considered essential for robust dexterous manipulation. However, existing approaches predominantly rely on high-cost sensors, imposing non-trivial burdens on data collection and robot deployment, which constrains the scalability of tactile sensing in practical robotic systems. To address these limitations, we present a low-dimensional wearable tactile glove as a scalable platform for visuo-tactile robot hand control, and propose a two-level learning framework built upon it. The glove incorporates 20 FSR400 sensors and achieves stable 300 Hz acquisition through dual multiplexing and WiFi TCP communication with clock synchronization. Hardware validation confirms low inter-frame jitter and noise-free signal acquisition across all channels. The proposed framework first investigates whether binary tactile signals are sufficient to recover meaningful force distributions through learning, and subsequently extends to visuo-tactile representation learning, where tactile and visual modalities are jointly leveraged to learn shared cross-modal representations for downstream manipulation tasks.

Abstract:
As mobile service robots expand into human living environments, their ability to negotiate structured obstaclessuch as thresholds, curbs, and stairshas become increasingly important. Transformable wheels offer a compelling alternative to high-DoF locomotion by preserving the efficiency and maneuverability of conventional wheels on flat ground while selectively reconfiguring their geometry only when obstacle negotiation is required. Among such systems, the 1-DoF RPRP transformable wheel achieves step climbing with minimal actuation by mechanically coupling radial transformation and spoke tilting through an internal linkage. This reduced-actuation architecture, however, also creates a distinctive design challenge: because the mechanism lacks kinematic redundancy, a very small number of trajectory parameters exert a disproportionately large influence on climbing behavior. As a result, performance is governed less by control flexibility and more by how transformation timing and posture are coordinated throughout the climbing cycle. Despite this, prior studies on transformable wheels have largely focused on mechanism design and kinematic feasibility, leaving insufficient understanding of how trajectory design shapes the trade-offs among motion smoothness, actuator load, and power demand [1][3]. To address this gap, this study presents a trajectory-level design-space exploration framework for a 1-DoF transformable wheel, in which the obstacle-climbing motion is parameterized us

Abstract:
Robotic food scooping is a critical manipulation skill for food preparation and service robots. However, existing robot learning algorithms, especially learn-from-demonstration methods, still struggle to handle diverse and dynamic food states, which often results in spillage and reduced reliability. In this work, we introduce GRITS: A Spillage-Aware Guided Diffusion Policy for Robot Food Scooping Tasks. This framework leverages guided diffusion policy to minimize food spillage during scooping and to ensure reliable transfer of food items from the initial to the target location. Specifically, we design a spillage predictor that estimates the probability of spillage given current observation and action rollout. The predictor is trained on a large-scale simulated dataset with food spillage scenarios, constructed from four primitive shapes (spheres, cubes, cones, and cylinders) with varied physical properties such as mass, friction, and particle size. At inference time, the predictor serves as a differentiable guidance signal, steering the diffusion sampling process toward safer trajectories while preserving task success. We validate GRITS on a real-world robotic food scooping platform. GRITS is trained on six food categories and evaluated on ten unseen categories with different shapes and quantities. GRITS achieves an 82% task success rate and a 4% spillage rate, reducing spillage by over 40% compared to baselines without guidance, thereby demonstrating its effectiveness. More details are available on our project website: https://hcis-lab.github.io/GRITS/.

Abstract:
Geometrically accurate and semantically expressive map representations have proven invaluable for robot deployment and task planning in unknown environments. Nevertheless, real-time, open-vocabulary semantic understanding of large-scale unknown environments still presents open challenges, mainly due to computational requirements. In this paper we present FindAnything, an open-world mapping framework that incorporates vision-language information into dense volumetric submaps. Thanks to the use of vision-language features, FindAnything combines pure geometric and open-vocabulary semantic information for a higher level of understanding. It proposes an efficient storage of open-vocabulary information through the aggregation of features at the object level. Pixel-wise vision-language features are aggregated based on eSAM segments, which are in turn integrated into object-centric volumetric submaps, providing a mapping from open-vocabulary queries to 3D geometry that is scalable also in terms of memory usage. We demonstrate that FindAnything performs on par with the state-of-the-art in terms of semantic accuracy while being substantially faster and more memory-efficient, allowing its deployment in large-scale environments and on resource-constrained devices, such as MAVs. We show that the real-time capabilities of FindAnything make it useful for downstream tasks, such as autonomous MAV exploration in a simulated Search and Rescue scenario. Project Page: https://ethz-mrl.github.io/findanything/.

Abstract:
Aerial outdoor semantic navigation requires robots to explore large, unstructured environments to locate target objects. Recent advances in semantic navigation have demonstrated open-set object-goal navigation in indoor settings, but these methods remain limited by constrained spatial ranges and structured layouts, making them unsuitable for long-range outdoor search. While outdoor semantic navigation approaches exist, they either rely on reactive policies based on current observations, which tend to produce short-sighted behaviors, or precompute scene graphs offline for navigation, limiting adaptability to online deployment. We present RAVEN, a 3D memory-based, behavior tree framework for aerial semantic navigation in unstructured outdoor environments. It (1) uses a spatially consistent semantic voxel-ray map as persistent memory, enabling long-horizon planning and avoiding purely reactive behaviors, (2) combines short-range voxel search and long-range ray search to scale to large environments, (3) leverages a large vision-language model to suggest auxiliary cues, mitigating sparsity of outdoor targets. These components are coordinated by a behavior tree, which adaptively switches behaviors for robust operation. We evaluate RAVEN in 10 photorealistic outdoor simulation environments over 100 semantic tasks, encompassing single-object search, multi-class, multi-instance navigation and sequential task changes. Results show RAVEN outperforms baselines by 85.25% in simulation and demonstrate its real-world applicability through deployment on an aerial robot in outdoor field tests.

Abstract:
Accurate and real-time 6-DoF localization is mission-critical for autonomous lunar landing, yet existing approaches remain limited: visual odometry (VO) drifts unboundedly, while map-based absolute localization fails in texture-sparse or low-light terrain. We introduce KANLoc, a monocular localization framework that tightly couples VO with a lightweight but robust absolute pose regressor. At its core is a Kolmogorov-Arnold Network (KAN) that learns the complex mapping from image features to map coordinates, producing sparse but highly reliable global pose anchors. These anchors are fused into a bundle adjustment framework, effectively canceling drift while retaining local motion precision. KANLoc delivers three key advances: (i) a KAN-based pose regressor that achieves high accuracy with remarkable parameter efficiency, (ii) a hybrid VO-absolute localization scheme that yields globally consistent real-time trajectories (>=15 FPS), and (iii) a tailored data augmentation strategy that improves robustness to sensor occlusion. On both realistic synthetic and real lunar landing datasets, KANLoc reduces average translation and rotation error by 32% and 45%, respectively, with per-trajectory gains of up to 45%/48%, outperforming strong baselines.

Abstract:
Robotic cloth folding is a challenging task, particularly when considering dynamic folding tasks, which aim at folding cloth by fast motions that leverage its dynamics. When subject to such fast motions, the complexity of cloth dynamics hinders both system identification and planning of folding trajectories, resulting in a difficult simulation-to-reality transfer when using physical models of cloth. Compared to the dexterity that humans exhibit when performing folding tasks, robotic approaches usually employ small garments with quite rigid dynamics, and are either too slow, or fast but imprecise, requiring several attempts to achieve a reasonably good fold. In this paper, we tackle these challenges by generating fast folding trajectories with a novel model predictive controller, integrating physics-based simulation of cloth dynamics and efficient, kernel-based Koopman operator regression. Koopman operator regression, an increasingly popular machine learning technique for nonlinear system identification, is used to obtain a linear model for the cloth being folded. Such a surrogate model, trained with data from a high-fidelity, physics-based cloth simulator, can then be employed within a suitable model predictive control algorithm, in place of the costly, nonlinear one, to efficiently generate folding trajectories to be executed by a robotic manipulator. Both in simulated and real-robot experiments, we show how the linearization supplied by the Koopman operator-based model can be employed to efficiently generate fast folding trajectories to unseen poses, without sacrificing folding accuracy.

Abstract:
Soft robotics offers the opportunity to create dexterous machines that can safely handle delicate objects. Grippers made from deformable actuators and compliant materials can deform around the objects with which they come in contact. The continuum mechanics of flexible manipulators can be leveraged for safe manipulation tasks such as twisting and grasping during manufacturing. However, to achieve this goal, contact sensing and controls for manipulators in these soft systems still remain a challenge in the field. This letter demonstrates a shape-memory alloy actuated soft gripper, with each finger able to bend about multiple axes. This enables the soft gripper to perform twisting tasks and handle various and fragile objects. Using capacitive bend sensors, we also demonstrate that the measured impedance of motion can be used as a proxy for contact, greatly increasing performance in a delicate manipulation task.

Abstract:
We consider the problem of generating mobile manipulation instructions based on a target object image and receptacle image. Conventional image captioning models are not able to generate appropriate instructions because their architectures are typically optimized for single-image. In this study, we propose a model that identifies both the target object and receptacle to generate free-form instruction sentences for mobile manipulation tasks. Furthermore, we introduce a novel training method, the human centric calibration phase that combines learning-based automatic evaluation metrics with n-gram based automatic evaluation metrics. This method enables the model to learn the co-occurrence relationships between words and appropriate paraphrases. The results demonstrate that our proposed method outperforms baseline methods including representative multimodal large language models on all automatic evaluation metrics. Moreover, physical experiments reveal that using our method to augment data on language instructions improves the performance of an existing multimodal language understanding model for mobile manipulation.

Abstract:
This paper presents a novel hierarchical, safety-critical control framework that integrates distributed nonlinear model predictive controllers (DNMPCs) with control barrier functions (CBFs) to enable cooperative locomotion of multi-agent quadrupedal robots in complex environments. While NMPC-based methods are widely used to enforce safety constraints and navigate multi-robot systems (MRSs) through complex environments, trajectory optimization frameworks based on invariant sets offer formal safety guarantees for MRSs. CBFs, typically implemented via quadratic programs (QPs) at the planning layer, provide formal safety guarantees. However, their zero-control horizon limits their effectiveness for extended trajectory planning in inherently unstable, underactuated, and nonlinear legged robot models. Furthermore, the integration of CBFs into real-time NMPC for sophisticated MRSs, such as quadrupedal robot teams, remains underexplored. This paper develops computationally efficient, distributed NMPC algorithms that incorporate CBF-based collision safety guarantees within a consensus protocol, enabling longer planning horizons for safe cooperative locomotion under disturbances and rough terrain conditions. The optimal trajectories generated by the DNMPCs are tracked using full-order, nonlinear whole-body controllers at the low level. The proposed approach is validated through extensive numerical simulations with up to four Unitree A1 robots and hardware experiments involving two A1 robots subjected to external pushes, rough terrain, and uncertain obstacle information. Comparative results demonstrate that the proposed CBF-integrated DNMPC achieves a higher success rate than baseline NMPCs employing CBFs at the high or low-level layers.

Abstract:
Human feet are crucial for supporting body weight and adapting to complex terrains. Adult-acquired flatfoot deformity (AAFD) arises from congenital or acquired causes, impairing the foot's ability to transition between flexible and rigid states, known as the lock-unlock mechanism during the stance and swing phases. In this study, we propose a plantar dynamic support system that utilizes pneumatic airbags, regulated through a model predictive control (MPC) strategy to minimize tracking errors. Experiments were conducted to measure kinetic parameters and electromyography signals, validating the system's efficacy. The results showed improvements in the normalized navicular height truncated (NNHt) index and reductions in muscle activity of the fibularis longus (FL), soleus (SOL), and gastrocnemius (GAST) by 4.42%, 16.65%, and 23.84%, respectively, during the stance phase.

Abstract:
Completely capturing the three-dimensional (3D) data of an object is essential in industrial and robotic applications. The task of next-best-view (NBV) planning is to calculate the next optimal viewpoint based on the current data, gradually achieving a complete 3D reconstruction of the object. However, many existing NBV planning algorithms incur heavy computational costs due to the extensive use of ray-casting. Specifically, this framework refits different types of voxel clusters into ellipsoids based on the voxel structure. Then, the next optimal viewpoint is selected from the candidate views using a projection-based viewpoint quality evaluation function in conjunction with a global partitioning strategy. This process replaces extensive ray-casting, significantly improving the computational efficiency. Comparison experiments in the simulation environment show that our framework achieves the highest point cloud coverage with low computational time compared to other frameworks. The real-world experiments also confirm the efficiency and feasibility of the framework. Our method will be made open source to benefit the community.

Abstract:
This paper introduces the BOW Planner, a scalable motion planning algorithm designed to navigate robots through complex environments using constrained Bayesian optimization (CBO). Unlike traditional methods, which often struggle with kinodynamic constraints such as velocity and acceleration limits, the BOW Planner excels by concentrating on a planning window of reachable velocities and employing CBO to sample control inputs efficiently. This approach enables the planner to manage high-dimensional objective functions and stringent safety constraints with minimal sampling, ensuring rapid and secure trajectory generation. Theoretical analysis confirms the algorithms asymptotic convergence to near-optimal solutions, while extensive evaluations in cluttered and constrained settings reveal substantial improvements in computation times, trajectory lengths, and solution times compared to existing techniques. Successfully deployed across various real-world robotic systems, the BOW Planner demonstrates its practical significance through exceptional sample efficiency, safety-aware optimization, and rapid planning capabilities, making it a valuable tool for advancing robotic applications. The BOW Planner is released as an open-source package and videos of real-world and simulated experiments are available at https://bow-web.github.io/.

Abstract:
This paper proposes an integrated approach for the safe and efficient control of mobile robots in dynamic and uncertain environments. The approach consists of two key steps: one-shot multimodal motion prediction to anticipate motions of dynamic obstacles and model predictive control to incorporate these predictions into the motion planning process. Motion prediction is driven by an energy-based neural network that generates high-resolution, multi-step predictions in a single operation. The prediction outcomes are further utilized to create geometric shapes formulated as mathematical constraints. Instead of treating each dynamic obstacle individually, predicted obstacles are grouped by proximity in an unsupervised way to improve performance and efficiency. The overall collision-free navigation is handled by model predictive control with a specific design for proactive dynamic obstacle avoidance. The proposed approach allows mobile robots to navigate effectively in dynamic environments. Its performance is accessed across various scenarios that represent typical warehouse settings. The results demonstrate that the proposed approach outperforms other existing dynamic obstacle avoidance methods.

Abstract:
Research indicates that single-agent reinforcement learning is vulnerable to adversarial attacks, which can lead to decision-making errors. Similarly, multi-agent deep reinforcement learning (MADRL) systems face analogous adversarial threats. However, existing attack methods require substantial investment in agent design and computational resources, limiting the feasibility of such attacks. To address this issue, we reformulate adversarial attacks as an optimization problem and propose the MREFDW-GA algorithm, which integrates dimension-weighted perturbations and a multi-stage robustness evaluation function. This approach combines dimension-weighted perturbations with a multi-stage robustness evaluation function, thereby enhancing the efficiency of evolutionary algorithms while dynamically adjusting search strategies to escape local optima. Experimental results demonstrate that this method can effectively execute black-box attacks by iteratively generating adversarial perturbations, significantly degrading the performance of MADRL systems and opening new research avenues for efficient black-box attacks.

Abstract:
Rigid-soft hybrid grippers show good protection and high-payload capacity for fragile and heavy objects. However, because of inadequate actuation speed, it is still challenging for hybrid grippers to grasp moving objects in unstructured environments. To solve the limitation, this article presents a rigid-soft hybrid gripper with four grasping modes that can not only grasp deformable and heavy objects like tofu and a dumbbell but also capture moving objects with low response time. Inspired by the structure of human fingers, a rigid-soft hybrid finger with a soft outer body and a rigid inner skeleton is designed. The finger consists of a soft pneumatic actuator (SPA), an endoskeleton linkage, a self-locking mechanism, a fast-responding mechanism, a pneumatic artificial muscle actuator (PAMA), a power transition bolt, and two split pins. The fast response speed of the PAMA and the amplification of the endoskeleton linkage enable the gripper to capture moving objects. A kinematic model is established to verify the endoskeleton linkages angular velocity amplification ability and describe its bending angle. Experiments demonstrate that the rigid-soft finger can bend to 145.14° within 71 ms. Eventually, the gripper is mounted on a robotic arm to demonstrate that it can grasp fragile and deformable objects, hold heavy objects, and capture moving objects. The grasping strategies and structure of the gripper provide a new idea for designing a high-performance rigid-soft hybrid gripper.

Abstract:
Recent 3D Gaussian Splatting (3DGS) techniques for visual Simultaneous Localization and Mapping (SLAM) have significantly progressed in tracking and high-fidelity mapping. However, their sequential optimization framework and sensitivity to dynamic objects limit real-time performance and robustness in real-world scenarios. We present UP-SLAM, a real-time RGB-D SLAM system for dynamic environments that decouples tracking and mapping through a parallelized framework. A probabilistic anchor is employed to manage Gaussian primitives adaptively, enabling efficient initialization and pruning without hand-crafted thresholds. To robustly filter dynamic regions during tracking, we propose a training-free uncertainty estimator that fuses multi-modal residuals to estimate per-pixel motion uncertainty, achieving open-set dynamic object handling without reliance on semantic labels. Furthermore, a temporal encoder is designed to enhance rendering quality, while a shallow multilayer perception transforms low-dimensional features into DINO features, enriching the Gaussian field and enhancing uncertainty prediction robustness. Extensive experiments on multiple challenging datasets suggest that UP-SLAM outperforms state-of-the-art methods in both localization accuracy (by 59.8%) and rendering quality (by 4.72 dB PSNR), while maintaining real-time performance and producing reusable, artifact-free static maps in dynamic environments.The Project Page: https://aczheng-cai.github.io/up_slam.github.io/

Abstract:
This paper proposes a perceptual no-reference (blind) haptic quality assessment framework for predicting the Quality of Experience (QoE) in teleoperation systems with force feedback. The proposed approach employs a deep neural network that combines semantic and distortion-based channels. The semantic network generates a semantic vector that characterizes the interaction between the robot and its environment. Meanwhile, the distortion network decomposes complex noise introduced by control algorithms and communication artifacts into artificial noise of known types. To train the proposed network, we also construct an augmented dataset for perceptual quality assessment in teleoperation based on the subjective experiments. The dataset augmentation and the model are validated with real-world teleoperation tasks. Our experimental results demonstrate that the performance of our No-Reference (NR) haptic quality assessment model is comparable to or surpasses that of commonly used Full-Reference (FR) methods, achieving Spearmans Rank-Order Correlation scores above 0.85 for QoE prediction.

Abstract:
Collision detection between robotic hands and manipulated objects is crucial to model predictive control (MPC) for contact-rich dexterous manipulation. Based on the Gilbert-Johnson-Keerthi (GJK) algorithm and the expanding polytope algorithm (EPA), the GJK-EPA method has achieved success while requiring iterative optimizations. Recently, a signed distance function (SDF) based collision detection (C-SDF) method is used to estimate the contact information, which avoids iterations at the cost of matrix derivative operations. Inspired by this, in this paper, a simplified nonnegative least squares (NNLS) based quadratic programming (QP) algorithm is used to construct an approximated solution to the QP formulation of collision detection, for estimating collision points. Then, contact distances and Jacobians are calculated via physics computations and differentiable kinematics. Consequently, a C-NNLS method is proposed, which uses NNLS formulation to approximate the collision detection routine in the MPC while avoiding iterative optimizations and matrix derivatives. The C-NNLS method is applied to extensive simulative tasks, achieving lower average error while consuming 45.59% less time on average compared with the C-SDF method. Furthermore, the C-NNLS method is deployed on a real Allegro hand for on-palm reorientation. Results show that the C-NNLS method reduces average task time by 30.33% compared with the C-SDF method while maintaining high-quality dexterous manipulation.

Abstract:
Learning multi-fingered robot policies from humans performing daily tasks in natural environments has long been a grand goal in the robotics community. Achieving this would mark significant progress toward generalizable robot manipulation in human environments, as it would reduce the reliance on labor-intensive robot interaction data collection. Despite substantial efforts, progress toward this goal has been bottle-necked by the embodiment gap between humans and robots, as well as by difficulties in extracting relevant contextual and motion cues that enable learning of autonomous policies from in-the-wild human videos. We claim that with simple yet sufficiently powerful hardware for obtaining human data and our proposed framework Aina, we are now one significant step closer to achieving this dream. Aina enables learning multi-fingered policies from in-the-wild data using Aria Gen 2 glasses. These glasses are lightweight and portable, feature a high-resolution RGB camera, provide accurate on-board 3D head and hand poses, and offer a wide stereo view that can be leveraged for depth estimation of the environment. This setup enables the learning of 3D point-based policies for multi-fingered hands that are robust to background changes and can be deployed directly without requiring any robot data (including online corrections, reinforcement learning, or simulation). We compare our framework against prior human-to-robot policy learning approaches, ablate our design choices, and demonstrate results across five everyday manipulation tasks. Robot rollouts can be best viewed on our website: aina-robot.github.io.

Abstract:
We introduce AWENet(Attention-guided Wavelet Enhancement Network), an efficient self-supervised network for joint interest point detection and description that balances com putational speed with feature accuracy. The network preserves f ine structural details while employing multi-scale attention to enhance the discriminability of descriptors, leading to more precise and reliable interest point correspondences. Evaluations on the HPatches dataset demonstrate that AWENet achieves competitive performance in repeatability, localization accuracy, and matching robustness. Its lightweight design ensures fast processing and low computational cost, making it well-suited for applications where efficiency is critical. Qualitative results show that the network generates dense and accurate correspondences under diverse transformations, including changes in viewpoint and illumination. Overall, AWENet provides a practical and effective solution for learning local features, achieving strong matching performance without relying on heavy computation.

Abstract:
Generative video modeling has emerged as a compelling tool to zero-shot reason about plausible physical interactions for open-world manipulation. Yet, it remains a challenge to translate such human-led motions into the low-level actions demanded by robotic systems. We observe that given an initial image and task instruction, these models excel at synthesizing sensible object motions. Thus, we introduce Dream2Flow, a framework that bridges video generation and robotic control through 3D object flow as an intermediate representation. Our method reconstructs 3D object motions from generated videos and formulates manipulation as object trajectory tracking. By separating the state changes from the actuators that realize those changes, Dream2Flow overcomes the embodiment gap and enables zero-shot guidance from pre-trained video models to manipulate objects of diverse categoriesincluding rigid, articulated, deformable, and granular. Through trajectory optimization or reinforcement learning, Dream2Flow converts reconstructed 3D object flow into executable low-level commands without task-specific demonstrations. Simulation and real-world experiments highlight 3D object flow as a general and scalable interface for adapting video generation models to open-world robotic manipulation. Videos, visualizations, and appendix are available at https://dream2flow.github.io/.

Abstract:
The application of monocular dense Simultaneous Localization and Mapping (SLAM) is often hindered by high latency, large GPU memory consumption, and reliance on camera calibration. To relax this constraint, we propose EC3R-SLAM, a novel calibration-free monocular dense SLAM framework that jointly achieves high localization and mapping accuracy, low latency, and low GPU memory consumption. This enables the framework to achieve efficiency through the coupling of a tracking module, which maintains a sparse map of feature points, and a mapping module based on a feed-forward 3D reconstruction model that simultaneously estimates camera intrinsics. In addition, both local and global loop closures are incorporated to ensure mid-term and long-term data association, enforcing multi-view consistency and thereby enhancing the overall accuracy and robustness of the system. Experiments across multiple benchmarks show that EC3R-SLAM achieves competitive performance compared to state-of-the-art methods, while being faster and more memory-efficient. Moreover, it runs effectively even on resource-constrained platforms such as laptops and Jetson Orin NX, highlighting its potential for real-world robotics applications.

Abstract:
Modern coverage path planning (CPP) for holonomic UAVs in emergency response must contend with diverse environments where regions of interest (ROIs) often take the form of highly irregular polygons, characterized by asymmetric shapes, dense clusters of concavities, and multiple internal holes. Modern CPP pipelines typically rely on decomposition strategies that overfragment such polygons into numerous subregions. This increases the number of sweep segments and connectors, which in turn adds inter-region travel and forces more frequent reorientation. These effects ultimately result in longer completion times and degraded trajectory quality. We address this with a decomposition strategy that applies a recursive dual-axis monotonicity criterion, with cuts guided by a cumulative gap severity metric. This approach distributes clusters of concavities more evenly across subregions and produces a minimal set of partitions that remain sweepable under a parallel-track maneuver. We pair this with a global optimizer that jointly selects sweep paths and inter-partition transitions to minimize total path length, transition overhead, and turn count. We demonstrate that our proposed approach achieves the lowest mean path-length and completion-time overhead among 15 other CPP pipelines.

Abstract:
End-to-end robot manipulation policies offer significant potential for enabling embodied agents to understand and interact with the world. Unlike traditional modular pipelines, end-to-end learning mitigates key limitations such as information loss between modules and feature misalignment caused by isolated optimization targets. Despite these advantages, existing end-to-end neural networks for robotic manipulation--including those based on large VLM/VLA models--remain insufficiently performant for large-scale practical deployment. In this paper, we take a step towards an end-to-end manipulation policy that is generalizable, accurate and reliable. To achieve this goal, we propose a novel Chain of Moving Oriented Keypoints (CoMOK) formulation for robotic manipulation. Our formulation is used as the action representation of a neural policy, which can be trained in an end-to-end fashion. Such an action representation is general, as it extends the standard end-effector pose action representation and supports a diverse set of manipulation tasks in a unified manner. The oriented keypoint in our method enables natural generalization to objects with different shapes and sizes, while achieving sub-centimeter accuracy. Moreover, our formulation can easily handle multi-stage tasks, multi-modal robot behaviors, and deformable objects. Extensive simulated and hardware experiments demonstrate the effectiveness of our method.

Abstract:
Automatic response generation of video comments (RGVC) aims to generate a target reply to the content of the target comment based on the video context. Existing works for RGVC normally rely on large language models (LLMs), and mostly neglect the importance of extracting key information from both linguistic and visual perspectives, thereby limiting the potential to generate fluent and targeted responses in real applications. In this work, we introduce a lightweight response agent with a novel multimodal informative seeking approach (textscMis), which includes a Comment Context Retrieval (CCR) module and a Key Vision Selection (KVS) module to simultaneously seek essential information from both textual and visual modalities. Specifically, the CCR module enriches the dialogue context by retrieving relevant comments from other comment blocks, while the KVS module utilizes a spatial-temporal Transformer with cross-modal attention to highlight the most crucial information in the video. Moreover, we also build a large-scale user-level multimodal chitchat (UMC) dataset with exact comment-response interactions to better investigate RGVC. Extensive experiments demonstrate that our model effectively captures human points of interest and generates more fluent and diverse responses than state-of-the-art methods in both open and closed resources.

Abstract:
Outdoor loop closure detection is essential for mitigating accumulated drift in SLAM and generating a global consistent map. Semantic graph matching methods utilize object-level topology for distinctive scene representation but rely on environments with rich and distinguishable objects. Moreover, accurately matching nodes remains difficult due to ambiguities among same-class semantic nodes. These challenges limit their effectiveness in varied road environments, highlighting the need for representations that are both robust and adaptable. To address this, we introduce SD-SGM, a novel loop closure detection framework combining the powerful context-adaptation capabilities of structural descriptors with the high-level semantic reasoning abilities of semantic graphs. Initially, we extract semantic graphs alongside global structural descriptors from point clouds. Distinctive local graph features are then used to generate candidate node pairs, and the maximal clique algorithm identifies correspondences that are globally consistent. The similarity scores of both methods are then evaluated and a cross-validation mechanism assesses their reliability and adaptively weights them. Extensive loop closure detection experiments on various datasets demonstrate that SD-SGM achieves state-of-the-art (SOTA) performance compared to strong baselines. Additionally, we verify its effectiveness in improving SLAM trajectory accuracy. We provide the code at: https://github.com/BIT-TYJ/SD-SGM.

Abstract:
Mobile manipulation requires the coordinated control of a mobile base and a robotic arm while simultaneously perceiving both global scene context and fine-grained object details. Existing single-view approaches often fail in unstructured environments due to limited fields of view, exploration, and generalization abilities. Moreover, classical controllers, although stable, struggle with efficiency and manipulability near singularities. To address these challenges, we propose M4Diffuser, a hybrid framework that integrates a Multi-View Diffusion Policy with a novel Reduced and Manipulability-aware QP (ReM-QP) controller for mobile manipulation. The diffusion policy leverages proprioceptive states and complementary camera perspectives with both close-range object details and global scene context to generate task-relevant end-effector goals in the world frame. These high-level goals are then executed by the ReM-QP controller, which eliminates slack variables for computational efficiency and incorporates manipulability-aware preferences for robustness near singularities. Comprehensive experiments in simulation and real-world environments show that M4Diffuser achieves 8%56% higher success rates and reducing collisions by 3%32% over baselines. Our approach demonstrates robust performance for smooth whole-body coordination, and strong generalization to unseen tasks, paving the way for reliable mobile manipulation in unstructured environments. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/m4diffuser-anonymous.

Abstract:
We present an experimental validation framework for space robotics that leverages underwater environments to approximate microgravity dynamics. While neutral buoyancy conditions make underwater robotics an excellent platform for space robotics validation, there are still dynamical and environmental differences that need to be overcome. Given a high-level space mission specification, expressed in terms of a Signal Temporal Logic specification, we overcome these differences via the notion of maximal disturbance robustness of the mission. We formulate the motion planning problem such that the original space mission and the validation mission achieve the same disturbance robustness degree. The validation platform then executes its mission plan using a near-identical control strategy to the space mission where the closed-loop controller considers the spacecraft dynamics. Evaluating our validation framework relies on estimating disturbances during execution and comparing them to the disturbance robustness degree, providing practical evidence of operation in the space environment. Our evaluation features a dual-experiment setup: an underwater robot operating under near-neutral buoyancy conditions to validate the planning and control strategy of either an experimental planar spacecraft platform or a CubeSat in a high-fidelity space dynamics simulator.

Abstract:
Recent robot learning methods commonly rely on imitation learning from massive robotic dataset collected with teleoperation. When facing a new task, such methods generally require collecting a set of new teleoperation data and finetuning the policy. Furthermore, the teleoperation data collection pipeline is also tedious and expensive. Instead, human is able to efficiently learn new tasks by just watching others do. In this paper, we introduce a novel two-stage framework that utilizes human demonstrations to learn a generalizable robot policy. Such policy can directly take human demonstration video as a prompt and perform new tasks without any new teleoperation data and model finetuning at all. In the first stage, we train video generation model that captures a joint representation for both the human and robot demonstration video data using cross-prediction. In the second stage, we fuse the learned representation with a shared action space between human and robot using a novel prototypical contrastive loss. Empirical evaluations on real‑world dexterous manipulation tasks show the effectiveness and generalization capabilities of our proposed method.

Abstract:
Robots benefit from sensory information to coordinate body movement, gain robustness against perturbations, and transit between different modes to adapt to various terrains. However, few amphibious robots can sense interactions with both terrestrial and aquatic environments. In this paper, we present a solution that uses Hall-effect sensors to sense foot contact forces and lateral hydrodynamic forces on a salamander-inspired amphibious robot. With two bus lines, the robot can simultaneously acquire this exteroceptive information at more than 500 Hz and proprioceptive information, such as joint positions and loads, at 100 Hz. The Hall-effect sensors used are compact, making them suitable for embedding in multiple positions within a robot, and exhibit high sensitivity to small forces. Moreover, because the sensor can be positioned separately from the measured object, waterproofing can be implemented with relative ease. Our tests demonstrate the robot's capabilities in traversing amphibious environments and its potential in using feedback control for more complex locomotion tasks.

Abstract:
Grasping is a fundamental capability for robots to interact with the physical world. Humans, equipped with two hands, autonomously select appropriate grasp strategies based on the shape, size, and weight of objects, enabling robust grasping and subsequent manipulation. In contrast, current robotic grasping remains limited, particularly in multi-strategy settings. Although substantial efforts have targeted parallel-gripper and single-hand grasping, dexterous grasping for bimanual robots remains underexplored, with data being a primary bottleneck. Achieving physically plausible and geometrically conforming grasps that can withstand external wrenches poses significant challenges. To address these issues, we introduce UltraDexGrasp, a framework for universal dexterous grasping with bimanual robots. The proposed data-generation pipeline integrates optimization-based grasp synthesis with planning-based demonstration generation, yielding high-quality and diverse trajectories across multiple grasp strategies. With this framework, we curate UltraDexGrasp-20M, a large-scale, multi-strategy grasp dataset comprising 20 million frames across 1,000 objects. Based on UltraDexGrasp-20M, we further develop a simple yet effective grasp policy that takes point clouds as input, aggregates scene features via unidirectional attention, and predicts control commands. Trained exclusively on synthetic data, the policy achieves robust zero-shot sim-to-real transfer and consistently succeeds on novel objects with varied shapes, sizes, and weights, attaining an average success rate of 81.2% in real-world universal dexterous grasping. To facilitate future research on grasping with bimanual robots, we open-source the data generation pipeline at https://github.com/InternRobotics/UltraDexGrasp.

Abstract:
Many industrial and commercial manipulators provide only position and velocity control interfaces, making direct regulation of contact forces challenging. In dual-arm manipulation, this limitation prevents stable force closure and consistent control of the object wrench. We present a control framework that combines contact-level admittance and object- level impedance to compute velocity commands for both arms. The contact admittance law maps force errors into velocity corrections, while the object impedance relation regulates the net wrench on the object. Together, these laws generate joint velocities through the stacked Jacobian, ensuring consistent integration of force and motion objectives. Contact compliance is explicitly modeled using linear springdamper elements. The analysis of closed-loop error dynamics shows how the stiffness and damping parameters of the contact compliance influence the frequency response of the error dynamics and explains the origin of high-frequency oscillations in the presence of sensor noise. Experiments with a dual-arm setup with two heteroge- nous velocity-controlled manipulators validate the framework. Results confirm accurate force regulation, disturbance rejection, and stable cooperative lifting under different contact padding conditions. The proposed approach establishes a velocity-based method for dual-arm force closure with contact compliance.

Abstract:
This work revisits two classical closed-loop inverse kinematics (CLIK) formulations for hierarchical control and investigates their differences in the context of articulated intervention-autonomous underwater vehicles (AIAUVs). The class of AIAUVs consists of free-floating, slender, multi-link vehicles with distributed thrusters and no distinct base, allowing the entire vehicle to be modeled and controlled as a manipulator. The concept of body-velocity sharing, a phenomenon where different tasks depend on overlapping body-frame motions, is introduced and formalized through the notion of body-sensitivity subspaces. Changing the location of the systems body-frame is shown to directly affect both controllers closed-loop performance, and it shown that due to body-velocity sharing, tasks for AIAUVs most often fall into an intermediate regime between orthogonal and strictly incompatible tasks, causing the two task-priority formulations to differ. The theory is validated through open-water field trials with the Eelume-M, a 6-meter-long AIAUV, comparing the two control laws. The experiments confirm the theoretical predictions: the projected-residual law improves secondary-task tracking but is more sensitive to algorithmic singularities, whereas the post-projection law remains robust to such singularities at the cost of reduced secondary-task performance. These results provide practical guidelines for selecting kinematic task-priority control laws and body-frame placement for AIAUVs.

Abstract:
Accurate perception is critical for vehicle safety, with LiDAR as a key enabler in autonomous driving. To ensure robust performance across environments, sensor types, and weather conditions without costly re-annotation, domain generalization in LiDAR-based 3D semantic segmentation is essential. However, LiDAR annotations are often noisy due to sensor imperfections, occlusions, and human errors. Such noise degrades segmentation accuracy and is further amplified under domain shifts, threatening system reliability. While noisy-label learning is well-studied in images, its extension to 3D LiDAR segmentation under domain generalization remains largely unexplored, as the sparse and irregular structure of point clouds limits direct use of 2D methods. To address this gap, we introduce the novel task Domain Generalization for LiDAR Semantic Segmentation under Noisy Labels (DGLSS-NL) and establish the first benchmark by adapting three representative noisy-label learning strategies from image classification to 3D segmentation. However, we find that existing noisy-label learning approaches adapt poorly to LiDAR data. We therefore propose DuNe, a dual-view framework with strong and weak branches that enforce feature-level consistency and apply cross-entropy loss based on confidence-aware filtering of predictions. Our approach shows state-of-the-art performance by achieving 56.86% mIoU on SemanticKITTI, 42.28% on nuScenes, and 52.58% on SemanticPOSS under 10% symmetric label noise, with an overall Arithmetic Mean (AM) of 49.57% and Harmonic Mean (HM) of 48.50%, thereby demonstrating robust domain generalization in DGLSS-NL tasks. The code is available at https://github.com/MKong17/DGLSS-NL.git

Abstract:
Efficient localization and high-quality rendering in large-scale scenes remain a significant challenge due to the computational cost involved. While Scene Coordinate Regression (SCR) methods perform well in small-scale localization, they are limited by the capacity of a single network when extended to large-scale scenes. To address these challenges, we propose the Mixed Expert-based Accelerated Coordinate Encoding method (MACE), which enables efficient localization and high-quality rendering in large-scale scenes. Inspired by the remarkable capabilities of MOE in large model domains, we introduce a gating network to implicitly classify and select sub-networks, ensuring that only a single sub-network is activated during each inference. Furtheremore, we present Auxiliary-Loss-Free Load Balancing(ALF-LB) strategy to enhance the localization accuracy on large-scale scene. Our framework provides a significant reduction in costs while maintaining higher precision, offering an efficient solution for large-scale scene applications. Additional experiments on the Cambridge test set demonstrate that our method achieves high-quality rendering results with merely 10 minutes of training.

Abstract:
We present a new approach to the stair-climbing problem for robots that rely on actively articulated tracked arms. The robot in question is considered to have a main locomotion system, such as wheels or tracks, and arms that can be controlled to extend the robot's mobility when needed. Further, we also assume the robot is equipped with a depth sensor (for stair perception) and an IMU (solely for orientation estimates). This paper's proposed key feature is to analyze the robot's differential kinematics as a planar manipulator with a position constraint. For the proposed model, we then present a state feedback control law with stability and convergence properties to move arms, guiding the robot towards the stairs autonomously. The controller fits the tracks to the floor, making the robot perform appropriate maneuvers, like a snake climbing an inclined plane, preventing sudden movements, improving traction, and avoiding collisions with the floor. The presented method seems to be a novel way to interpret the problem. The proposed control scheme is validated in a real robot, and experimental results are presented.

Abstract:
Non-prehensile object manipulation skills are important for real-world robot interactions, enabling highly dynamic tasks such as balancing a glass on a tray or the controlled sliding of items on a table. Among such tasks, those characterised by high-speed manipulation requirements and general sensitivity of the resulting hybrid dynamics are particularly hard to accomplish. Within these, juggling can be seen as a highly challenging maneuver to be solved. The key to robotic juggling is achieving dynamic stabilisation of an underactuated object. Since the object does not possess the ability of self-correction, its stability is entirely dependent on the forces applied to it. This creates a system that is sensitive to control inputs, where timing is critical to continuously counteract deviations and maintain the desired behavior. We develop a systematic method to control a 7-degree-of-freedom manipulator performing non-prehensile ball juggling with a tool. Our primary contribution is a model-based framework for generating juggling trajectories and stabilizing a periodic juggling motion for this hybrid system. The framework incorporates a two-stage optimal control approach to compute the underlying feasible motion patterns required for stable juggling. Offline-computed trajectories are then organised to enable real-time error correction without solving optimal control problems online. We demonstrate the effectiveness of the resulting controller by first evaluating its performance in a simulation environment and performing an experiment using a Franka Emika Panda robot.

Abstract:
Autonomous robots deployed in mass casualty incidents (MCI) face the challenge of making critical decisions based on incomplete and noisy perceptual data. We present an autonomous robotic system for casualty assessment that fuses outputs from multiple vision-based algorithms, estimating signs of severe hemorrhage, visible trauma, or physical alertness, into a coherent triage assessment. At the core of our system is a Bayesian network, constructed from expert-defined rules, which enables probabilistic reasoning about a casualty's condition even with missing or conflicting sensory inputs. The system, evaluated during the DARPA Triage Challenge (DTC) in realistic MCI scenarios involving 11 and 9 casualties, demonstrated a nearly three-fold improvement in physiological assessment accuracy (from 15% to 42% and 19% to 46%) compared to a vision-only baseline. More importantly, overall triage accuracy increased from 14% to 53%, while the diagnostic coverage of the system expanded from 31% to 95% of cases. These results demonstrate that integrating expert-guided probabilistic reasoning with advanced vision-based sensing can significantly enhance the reliability and decision-making capabilities of autonomous systems in critical real-world applications.

Abstract:
Aerial robots interacting with objects must perform precise, contact-rich maneuvers under uncertainty. In this paper, we study the problem of aerial ball juggling using a quadrotor equipped with a racket, a task that demands accurate timing, stable control, and continuous adaptation. We propose JuggleRL, the first reinforcement learningbased system for aerial juggling. It learns closed-loop policies in large-scale simulation using systematic calibration of quadrotor and ball dynamics to reduce the sim-to-real gap. The training incorporates reward shaping to encourage racket-centered hits and sustained juggling, as well as domain randomization over ball position and coefficient of restitution to enhance robustness and transferability. The learned policy outputs mid-level commands executed by a low-level controller and is deployed zero-shot on real hardware, where an enhanced perception module with a lightweight communication protocol reduces delays in high-frequency state estimation and ensures real-time control. Experiments show that JuggleRL achieves an average of 311 hits over 10 consecutive trials in the real world, with a maximum of 462 hits observed, far exceeding a model-based baseline that reaches at most 14 hits with an average of 3.1. Moreover, the policy generalizes to unseen conditions, successfully juggling a lighter 5 g ball with an average of 145.9 hits. This work demonstrates that reinforcement learning can empower aerial robots with robust and stable control in dynamic interaction tasks.

Abstract:
Opening sterile medical packaging is routine for healthcare workers but remains challenging for robots. Learning from demonstration enables robots to acquire manipulation skills directly from humans, and handheld gripper tools such as the Universal Manipulation Interface (UMI) offer a pathway for efficient data collection. However, the effectiveness of these tools depends heavily on their usability. We evaluated UMI in demonstrating a bandage opening task, a common manipulation task in hospital settings, by testing three conditions: distributed load grippers, concentrated load grippers, and bare hands. Eight participants performed timed trials, with task performance assessed by success rate, completion time, and damage, alongside perceived workload using the NASA-TLX questionnaire. Concentrated load grippers improved performance relative to distributed load grippers but remained substantially slower and less effective than hands. These results underscore the importance of ergonomic and mechanical refinements in handheld grippers to reduce user burden and improve demonstration quality, especially for applications in healthcare robotics.

Abstract:
We introduce a novel framework for automatic behavior tree (BT) construction in heterogeneous multi-robot systems, designed to address the challenges of adaptability and robustness in dynamic environments. Traditional robots are limited by fixed functional attributes and cannot efficiently reconfigure their strategies in response to task failures or environmental changes. To overcome this limitation, we leverage large language models (LLMs) to generate and extend BTs dynamically, combining the reasoning and generalization power of LLMs with the modularity and recovery capability of BTs. The proposed framework consists of four interconnected modulestask initialization, task assignment, BT update, and failure node detectionwhich operate in a closed loop. Robots tick their BTs during execution, and upon encountering a failure node, they can either extend the tree locally or invoke a centralized virtual coordinator (Alex) to reassign subtasks and synchronize BTs across peers. This design enables long-term cooperative execution in heterogeneous teams. We validate the framework on 60 tasks across three simulated scenarios and in a real-world café environment with a robotic arm and a wheeled-legged robot. Results show that our method consistently outperforms baseline approaches in task success rate, robustness, and scalability, demonstrating its effectiveness for multi-robot collaboration in complex scenarios.

Abstract:
Imitation learning from human demonstrations offers a promising approach for robot skill acquisition, but egocentric human data introduces fundamental challenges due to the embodiment gap. During manipulation, humans actively coordinate head and hand movements, continuously reposition their viewpoint and use pre-action visual search strategies to locate task-relevant objects. These behaviors create dynamic, task-driven head motions that static robot sensing systems cannot replicate, leading to a significant distribution shift that degrades policy performance. We present EgoMI (Egocentric Manipulation Interface), a framework that captures synchronized end-effector and active head trajectories during manipulation tasks, resulting in data that can be retargeted to compatible semi-humanoid robot embodiments. To handle rapid and wide-spanning head viewpoint changes, we introduce a memory-augmented policy that selectively incorporates context from historical observations. We evaluate our approach on a bimanual robot equipped with an actuated camera head and find that policies with explicit head-motion modeling consistently outperform baseline methods. Results suggest that coordinated handeye learning with EgoMI effectively bridges the human-robot embodiment gap for robust imitation learning on semi-humanoid embodiments. Project page: https://egocentric-manipulation-interface.github.io

Abstract:
Whole-body manipulation (WBM) for humanoid robots presents a promising approach for executing embracing tasks involving bulky objects, where traditional grasping relying on end-effectors only remains limited in such scenarios due to inherent stability and payload constraints. This paper introduces a reinforcement learning framework that integrates a pre-trained human motion prior with a neural signed distance field (NSDF) representation to achieve robust whole-body embracing. Our method leverages a teacher-student architecture to distill large-scale human motion data, generating kinematically natural and physically feasible whole-body motion patterns. This facilitates coordinated control across the arms and torso, enabling stable multi-contact interactions that enhance the robustness in manipulation and also the load capacity. The embedded NSDF further provides accurate and continuous geometric perception, improving contact awareness throughout long-horizon tasks. We thoroughly evaluate the approach through comprehensive simulations and real-world experiments. The results demonstrate improved adaptability to diverse shapes and sizes of objects and also successful sim-to-real transfer. These indicate that the proposed framework offers an effective and practical solution for multi-contact and long-horizon WBM tasks of humanoid robots. The open-source project can be found at https://github.com/Chunx1nZHENG/Embracing-Bulky-Objects-with-Humanoid-Robots.

Abstract:
Small-scale terrestrial robots have a number of applications where operation in confined spaces is required. Because of their low mass (less than five grams) and small size (less than five centimeters), their mechanical design requires careful analysis of multiple subsystems (e.g., actuation, power, fabrication, and assembly). Planar electromagnetic actuators show linear force-displacement behavior, large displacements, and low-voltage operation. Here, we integrate these actuators into the Cornell Micro Terrestrial Robot (COMT), a 1.9 �?g quadrupedal robot that uses a simplified fabrication strategy for the transmissions that takes advantage of the large displacement. Each leg is fabricated using laminate-based techniques, but requires only a single manual fold-and-lock step. The robot (BL = 3 cm) achieves speeds up to 4.36 BL/s and consumes approximately 300 mA during operation. These results provide a path towards a untethered terrestrial robots that can navigate in confined spaces and enable future collectives through simplified manufacturing strategies.

Abstract:
Ultrasound (US)-guided needle insertion is a critical yet challenging procedure due to dynamic imaging conditions and difficulties in needle visualization. Many methods have been proposed for automated needle insertion, but they often rely on hand-crafted pipelines with modular controllers, whose performance degrades in challenging cases. In this paper, a Vision-Language-Action (VLA) model is proposed for adaptive and automated US-guided needle insertion and tracking on a robotic ultrasound (RUS) system. This framework provides a unified approach to needle tracking and needle insertion control, enabling real-time, dynamically adaptive adjustment of insertion based on the obtained needle position and environment awareness. To achieve real-time and end-to-end tracking, a Cross-Depth Fusion (CDF) tracking head is proposed, integrating shallow positional and deep semantic features from the large-scale vision backbone. To adapt the pretrained vision backbone for tracking tasks, a Tracking-Conditioning (TraCon) register is introduced for parameter-efficient feature conditioning. After needle tracking, an uncertainty-aware control policy and an asynchronous VLA pipeline are presented for adaptive needle insertion control, ensuring timely decision-making for improved safety and outcomes. Extensive experiments on both needle tracking and insertion show that our method consistently outperforms state-of-the-art trackers and manual operation, achieving higher tracking accuracy, improved insertion success rates, and reduced procedure time, highlighting promising directions for RUS-based intelligent intervention.

Abstract:
Humanoid robots have demonstrated impressive motor skills in a wide range of tasks, yet whole-body control for humanlike long-time, dynamic fighting remains particularly challenging due to the stringent requirements on agility and stability. While imitation learning enables robots to execute human-like fighting skills, existing approaches often rely on switching among multiple single-skill policies or employing a general policy to imitate input reference motions. These strategies suffer from instability when transitioning between skills, as the mismatch of initial and terminal states across skills or reference motions introduces out-of-domain disturbances, resulting in unsmooth or unstable behaviors. In this work, we propose RPG, a hybrid expert policy framework, for smooth and stable humanoid multi-skills transition. Our approach incorporates motion transition randomization and temporal randomization to train a unified policy that generates agile fighting actions with stability and smoothness during skill transitions. Furthermore, we design a control pipeline that integrates walking/running locomotion with fighting skills, allowing humanlike long-time combat of arbitrary duration that can be seamlessly interrupted or transit action policies at any time. Extensive experiments in simulation demonstrate the effectiveness of the proposed framework, and real-world deployment on the Unitree G1 humanoid robot further validates its robustness and applicability.

Abstract:
Visual Place Recognition (VPR) is a key technology in autonomous driving, robotics, and augmented reality, requiring efficient and robust localization in large-scale environments. However, most existing methods rely on heavy deep models that are computationally expensive and difficult to deploy on edge devices, limiting their practical use. While model compression techniques such as compact model fine-tuning and traditional knowledge distillation have shown some promise, they often fall short in visual retrieval tasks. Inspired by the teaching principle that emphasizes both reinforcing correct knowledge and correcting errors, we propose an online positive-negative sample contrastive distillation framework. This approach enables the student model to learn more discriminative features by simultaneously distilling the relationships among positive and negative samples. We also design a cross-attention based feature alignment operator to better align intermediate feature representations between teacher and student models after feature extraction, improving feature consistency and distillation efficiency. Experimental results demonstrate that our method achieves a favorable trade-off between accuracy and efficiency on multiple visual localization benchmarks, significantly outperforming existing lightweight approaches in several scenarios. These advantages make it well-suited for deployment on resource-constrained edge devices.

Abstract:
Unmanned aerial vehicles (UAVs) require accurate odometryi.e., estimating the position and velocity of the vehicle over timeas well as high-resolution sensing to safely and effectively operate in complex environments. Traditionally, GPS, cameras, and/or lidar sensors have been used to perform these functions. However, GPS can be jammed in contested environments while cameras and lidars fail in visually degraded conditions, limiting UAV operations in these scenarios. In this work, we present UAV-SAR, a unified architecture that utilizes mmWave radars to simultaneously achieve precise odometry measurements and perform high-resolution synthetic-array sensing. Here, UAV-SAR measures a UAVs altitude and velocity from downward- and outward-facing radars and fuses these measurements within a commercially available flight controller to produce accurate odometry estimates. These odometry estimates are then used to dynamically construct synthetic arrays by coherently integrating multiple radar frames together over a duration of 0.5 s, improving the angular resolution by an order of magnitude compared to the physical array alone. Finally, a lightweight deep learning model is utilized to convert high-resolution range-angle responses into 2D point clouds suitable for downstream perception tasks. UAV-SAR is validated on a custom UAV prototype where it is integrated with ROS2 and the PX4 autopilot to demonstrate stable flight, reliable odometry, and high-resolution radar sensing in indoor environments.

Abstract:
Cable-driven serial manipulator (CDSM) has advantages of lightweight structure, high flexibility, and inherent safety, making it suitable for operations in constrained spaces. However, interaction with the environment is inevitable. To address this limitation, we propose CableSense, a novel force-sensing approach that leverages actuation cable tension information exclusively, thereby eliminating the requirement for additional contact sensors. We first develop a high-fidelity MuJoCo simulation model based on the physical system, reducing the sim-to-real gap through careful calibration of physical and mechanical parameters. Leveraging this simulation model, we generate a comprehensive dataset encompassing diverse external force scenarios. We then implement a multi-task deep learning framework CableSense, for both single-point and multi-point force identification. Experiments demonstrate that CableSense achieves over 98% accuracy in contact location estimation, maintaining a mean absolute direction error of 5.96°.

Abstract:
Recent works have shown that tackling offline reinforcement learning (RL) with a conditional policy produces promising results. Decision Transformers (DT) have shown promising results in offline reinforcement learning by leveraging sequence modeling. However, standard DT methods rely on return-to-go (RTG) tokens, which are heuristically defined and often suboptimal for goal-conditioned tasks. In this work, we introduce Quasimetric Decision Transformer (QuaD), a novel approach that replaces RTG with learned quasimetric distances, providing a more structured and theoretically grounded guidance signal for long-horizon decision-making. We explore two quasimetric formulations: interval quasimetric embeddings (IQE) and metric residual networks (MRN), and integrate them into DTs. Extensive evaluations on the AntMaze benchmark demonstrate that QuaD outperforms standard Decision Transformers, achieving state-of-the-art success rates and improved generalization to unseen goals. Our results suggest that quasimetric guidance is a viable alternative to RTG, opening new directions for learning structured distance representations in offline RL.

Abstract:
Post-training algorithms based on deep reinforcement learning can push the limits of robotic models for specific objectives, such as generalizability, accuracy, and robustness. However, Intervention-requiring Failures (IR Failures) (e.g., a robot spilling water or breaking fragile glass) during real-world exploration happen inevitably, hindering the practical deployment of such a paradigm. To tackle this, we introduce Failure-Aware Offline-to-Online Reinforcement Learning (FARL), a framework for minimizing failures during real-world reinforcement learning. We create FailureBench, a benchmark that incorporates common failure scenarios requiring human intervention, and propose an algorithm that integrates a world-model-based safety critic and a recovery policy trained offline to prevent failures during online exploration. Extensive simulation and real-world experiments demonstrate the effectiveness of FARL in significantly reducing IR Failures while improving performance and generalization during online reinforcement learning post-training. FARL reduces IR Failures by 73.1% while elevating performance by 11.3% on average during real-world RL post-training.

Abstract:
Planning long duration robotic manipulation sequences is challenging because of the complexity of exploring feasible trajectories through nonlinear contact dynamics and many contact modes. Moreover, this complexity grows with the problem's horizon length. We propose a search tree method that generates trajectories using the spectral decomposition of the inverse dynamics equation. This equation maps actuator displacement to object displacement, and its spectrum is efficient for exploration because its components are orthogonal and they approximate the reachable set of the object while remaining dynamically feasible. These trajectories can be combined with any search based method, such as Rapidly-Exploring Random Trees (RRT), for long-horizon planning. Our method performs similarly to recent work in model-based planning for short-horizon tasks, and differentiates itself with its ability to solve long-horizon tasks: whereas existing methods fail, ours can generate 45 second duration, 10+ contact mode plans using 15 seconds of computation, demonstrating real-time capability in highly complex domains.

Abstract:
For multi-robot systems operating in dynamic environments, collision-free segregation into a desired set of groups in finite time is an essential task requirement in many applications. This work presents a control framework for such systems, utilizing Finite-time Model Predictive Control. The objective is to guide the robots toward a segregated formation while adhering to leader-follower dynamics and effectively avoiding collisions. To ensure finite-time convergence, the concept of a control invariant set is incorporated. Furthermore, the paper derives an upper bound on the required time steps for the robots to achieve the segregated formation. In order to maintain a smooth motion profile in the face of external state perturbations, this work proposes a data-driven Chernoff bound-based triggering method that enables Asynchronous Motion Smoothing for the robots. To validate the effectiveness of the proposed control framework, both simulations and hardware experiments are conducted, focusing on the segregation of five robots into two distinct groups.

Abstract:
Occupancy prediction provides critical geometric and semantic understanding for robotics but faces efficiency-accuracy trade-offs. Current dense methods suffer computational waste on empty voxels, while sparse query-based approaches lack robustness in diverse and complex indoor scenes. In this paper, we propose DiScene, a novel sparse query-based framework that leverages multi-level distillation to achieve efficient and robust occupancy prediction. In particular, our method incorporates two key innovations: (1) a Multi-level Consistent Knowledge Distillation strategy, which transfers hierarchical representations from large teacher models to lightweight students through coordinated alignment across four levels, including encoder-level feature alignment, query-level feature matching, prior-level spatial guidance, and anchor-level high-confidence knowledge transfer and (2) a Teacher-Guided Initialization policy, employing optimized parameter warm-up to accelerate model convergence. Validated on the Occ-Scannet benchmark, DiScene achieves 23.2 FPS without depth priors while outperforming our baseline method, OPUS, by 36.1% and even better than the depth-enhanced version, OPUS†. With depth integration, DiScene† attains new SOTA performance, surpassing EmbodiedOcc by 3.7% with 1.62× faster inference speed. Furthermore, experiments on the Occ3D-nuScenes benchmark and in-the-wild scenarios demonstrate the versatility of our approach in various environments. Code and models can be accessed at https://github.com/getterupper/DiScene.

Abstract:
Deep Reinforcement Learning (DRL) controllers for quadrupedal locomotion have demonstrated impressive performance on challenging terrains, allowing robots to execute complex skills such as climbing, running, and jumping. However, existing blind locomotion controllers often struggle to ensure safety and efficient traversal through risky gap terrains, which are typically highly complex, requiring robots to perceive terrain information and select appropriate footholds during locomotion accurately. Meanwhile, existing perception-based controllers still present several practical limitations, including a complex multi-sensor deployment system and expensive computing resource requirements. This paper proposes a DRL controller named MAstering Risky Gap Terrains (MARG), which integrates terrain maps and proprioception to dynamically adjust the action and enhance the robot's stability in these tasks. During the training phase, our controller accelerates policy optimization by selectively incorporating privileged information (e.g., center of mass, friction coefficients) that are available in simulation but unmeasurable directly in real-world deployments due to sensor limitations. We also designed three foot-related rewards to encourage the robot to explore safe footholds. More importantly, a terrain map generation (TMG) model is proposed to reduce the drift existing in mapping and provide accurate terrain maps using only one LiDAR, providing a foundation for zero-shot transfer of the learned policy. The experimental results indicate that MARG maintains stability in various risky terrain tasks.

Abstract:
Remembering where object segments were predicted in the past is useful for improving the accuracy and consistency of class-agnostic video segmentation algorithms. Existing video segmentation algorithms typically use either no object-level memory (e.g. FastSAM) or they use implicit memories in the form of recurrent neural network features (e.g. SAM2). In this paper, we augment both types of segmentation models using an explicit 3D memory and show that the resulting models have more accurate and consistent predictions. For this, we develop an online 3D Gaussian Splatting (3DGS) technique to store predicted object-level segments generated throughout the duration of a video. Based on this 3DGS representation, a set of fusion techniques are developed, named FastSAM-Splat and SAM2-Splat, that use the explicit 3DGS memory to improve their respective foundation models' predictions. Ablation experiments are used to validate the proposed techniques' design and hyperparameter settings. Results from both real-world and simulated benchmarking experiments show that models which use explicit 3D memories result in more accurate and consistent predictions than those which use no memory or only implicit neural network memories.

Abstract:
Magnetically actuated soft continuum robots (MSCRs), which offer remote and wireless control via external magnetic fields along with high flexibility, have recently emerged as a promising technology for minimally invasive surgery (MIS). However, the magnetic actuation forces of MSCRs are generally limited, resulting in inherent workspace constraints. To overcome these limitations, various design strategies have been explored, including the development of an asymmetric magnetized soft continuum robot (AMSCR). Although AMSCRs have demonstrated a significantly larger workspace than conventional MSCRs, a quantitative relationship between the magnetization patterns of embedded magnetic particles and the resulting workspace has not yet been fully clarified. In this study, an energy-based kinematic analysis of AMSCR was conducted to address this issue. Specifically, the equilibrium posture of the AMSCR was determined by minimizing the total potential energy, considering different combinations of external magnetic field directions and internal magnetization patterns. Based on the resulting potential energy graph, the workspace of the AMSCR was quantitatively analyzed, and an optimal linear asymmetric magnetization pattern was identified. Furthermore, the proposed energy-based kinematic model was validated through finite element analysis (FEA) conducted using COMSOL Multiphysics, as well as through experiments performed on a fabricated AMSCR prototype. As a result, an optimal magnetization design method for linearly asymmetric AMSCRs was proposed and experimentally confirmed. The proposed approach is expected to be further applicable to the kinematic performance evaluation and design optimization of AMSCRs with various other magnetization pat

Abstract:
In this paper, we propose a method to align and place a fabric piece on top of another using a dual-arm manipulator and a grayscale camera, so that their surface textures are accurately matched. We propose a novel control scheme that combines Transformer-driven visual servoing with dual-arm impedance control. This approach enables the system to simultaneously control the pose of the fabric piece and place it onto the underlying one while applying tension to keep the fabric piece flat. Our transformer-based network incorporates pre-trained backbones and a newly introduced Difference Extraction Attention Module (DEAM), which significantly enhances pose difference prediction accuracy. Trained entirely on synthetic images generated using rendering software, the network enables zero-shot deployment in real-world scenarios without requiring prior training on specific fabric textures. Real-world experiments demonstrate that the proposed system accurately aligns fabric pieces with different textures.

Abstract:
In factory distribution processes, autonomous mobile robots must dock precisely at base stations. However, this task is challenging due to the dynamic and unstructured nature of factory environments, as well as the sparse point clouds caused by sensor occlusions and distance limitations. To address these challenges, we propose a geometric registration approach designed to handle sparse point clouds in changing, unstructured settings. Our method utilizes the Hough transform to detect lines, describes the point cloud based on the relationships between these lines, filters out lines that do notcorrespond to the geometric features of the target base station, and estimates the pose of both the station and the robot using global registration techniques. We evaluated our system in four typical factory scenarios across 72 trials. Results show the robot achieved docking accuracy within ±5.06 mm and ±1.11°, with a 100% success rate in docking and correctly identifying the target cart from surrounding objects. This represents a 70% reduction in errors and an 86% increase in success rate compared to existing methods.

Abstract:
Continuum robots are widely employed in spatially constrained environments with narrow passages. However, achieving general curves with non-constant curvature remains challenging, as existing systems typically rely on multiple flexible segments arranged in series, coupled with complex drive systems requiring numerous actuators. This paper proposes a novel continuum robot design that features a programmable wire layout capable of simulating desired curves. The system integrates modular joints and a single-actuator drive unit, enabling the generation of spatial curves with non-constant curvature. By strategically designing the arrangement of modular joints to control rotational direction and angular deflection at each joint, the system achieves a substantially expanded design workspace compared to conventional continuum robots. Simulation and prototype experiments validate the proposed design method. The relative mean distance between simulated and desired curve remains below 3.12%, while the prototype demonstrates a relative mean distance of 6.67% from desired curve. This approach offers a promising pathway to advance continuum robot by improving configurational adaptability while simultaneously achieving complex curve generation and reduced drive system complexity.

Abstract:
End-to-end navigation via deep reinforcement learning has become a key approach for vision-based tasks. However, the sim-to-real gap remains a challenge, especially for aerial robots, where policies trained in simulation often fail in real-world environments. In this work, we propose a novel navigation paradigm -- volumetric depth-based safe navigation(VDS-Nav), which trains a policy to infer linear velocities and yaw rate directly from a sequence of depth images, bypassing the need for a pre-trained latent space encoder. We enhance safety with a depth-based reward design, enabling the seamless incorporation of system constraints via logarithmic barrier function methods. Most importantly, using explicit sensor information in our reward design leads to seamless sim-to-real transfer by strengthening the correlation between state-action pairs and received rewards. To evaluate the effectiveness of VDS-Nav, we compare it to a baseline that first trains a variational autoencoder to encode depth images into a latent space for policy training. The simulation results show that VDS-Nav outperforms the baseline in terms of success rate. Furthermore, real-world experiments validate the policy, with real-time performance closely matching simulation results, suggesting an effective sim-to-real transfer

Abstract:
Complex embodied systems, whether biological or robotic, must continuously generate goal-directed behaviors while preserving coherence between motor intention and physical feasibility. In parallel robots, this link between intention and mechanics becomes particularly challenging due to their nonlinear, over-constrained kinematics and the absence of intuitive motor primitives. This letter introduces a passive motion paradigm for parallel robots using self-supervised physics-informed neural networks, which reformulates motion generation as the dynamic unfolding of motor primitives driven by attractor fields in actuator space. Unlike traditional forward or optimization-based formulations, the framework integrates analytical kinematics with neural fields to ensure both physical consistency and adaptive motion generation. The method estimates the Jacobian matrix as a physically constrained neural field, merging analytical structure with data-driven learning to achieve robust and interpretable behavior without relying on iterative numerical solvers. Theoretical analysis, simulations, and physical experiments demonstrate the frameworks accuracy, stability, and adaptability across different parallel mechanisms.

Abstract:
Oral and maxillofacial surgery (OMS) imposes an increasing workload on even the most experienced surgeons due to long operation time, high skill requirements, limited observation field, constrained workspace, and fast-growing patient population. Robot-assisted OMS is particularly challenging, requiring technological advancements to replicate complex surgical workflows executed by human surgeons and novel working concepts to properly address human-machine relationships. We introduced a Surgeon Supervised Autonomous Surgical System (SSASS) aiming to solve emerging bottlenecks in OMS. SSASS custom develops a deep-learning-assisted virtual planning module, a teeth-based monocular camera navigation module, and a six-degree-of-freedom compact robot module to function as surgeons auxiliary brain, eye, and hand, respectively. These three modules are further seamlessly integrated to autonomously complete most labor-intensive procedures, while prioritizing surgeons to supervise and be responsible for the overall procedure. Le Fort I experiments on five human head models demonstrated that the surgical results of SSASS closely matched the preoperative plan, with high drilling accuracy and acceptable cutting accuracy under a significantly simplified surgical workflow. SSASS integrates the deep learning, medical 3D printing, markerless navigation, virtual reality, and collaborative robotics, providing a comprehensive surgical solution for encompassing the entire OMS loop.

Abstract:
Electric arc noise around energized power lines corrupts drone LiDAR measurements, accumulating in occupancy grids and producing spurious obstacles that degrade navigation reliability. Existing filters designed for environmental clutter such as snow, dust, and rain fail to consistently reject these short-lived arc transients and remain difficult to deploy on resource-limited platforms. We propose a dual-structure filtering framework that dynamically separates transient arc noise from persistent environmental features. Instead of filtering scan-by-scan, the proposed filter leverages spatio-temporal neighborhood consistency across consecutive LiDAR frames to suppress short-duration particles. A transient k-d tree accelerates neighborhood queries and removes arc noise around valid structures, while a persistent octree integrates only enduring features into the global map. Experiments show up to 10 times faster filtering and mapping precision of 92.27% with F1-scores up to 95%. Real-world inspection flights over energized power lines confirm that the approach maintains accurate, up-to-date maps and robust performance in the presence of electric arc noise.

Abstract:
Visuomotor policies trained on human expert demonstrations have recently shown strong performance across a wide range of robotic manipulation tasks. However, these policies remain highly sensitive to domain shifts stemming from background or robot embodiment changes, which limits their generalization capabilities. In this paper, we present ARRO, a novel visual representation that leverages zero-shot open-vocabulary segmentation and object detection models to efficiently mask out task-irrelevant regions of the scene in real time without requiring additional training, modeling of the setup, or camera calibration. By filtering visual distractors and overlaying virtual guides during both training and inference, ARRO improves robustness to scene variations and reduces the need for additional data collection. We extensively evaluate ARRO with Diffusion Policy on a range of tabletop manipulation tasks in both simulation and real-world environments, and further demonstrate its compatibility and effectiveness with generalist robot policies, such as Octo, OpenVLA and Pi0 . Across all settings in our evaluation, ARRO yields consistent performance gains, allows for selective masking to choose between different objects, and shows robustness even to challenging segmentation conditions. Videos showcasing our results are available at: https://augmented-reality-for-robots.github.io/

Abstract:
Drawing inspiration from living nature for soft robots enables scientist to develop bioinspired systems with more efficient motion and control schemes in comparison to classical robotic system. Because of their inherent compliance due to bodies from flexible materials are soft robots ideal for human machine interaction. With novel electronic free pneumatic logic control systems these robots can be built entirely soft and cope with changing or extreme environments, in which classical electronic robots would fail. Such electronic free control systems allow the control to be integrated directly into the body of the soft robot. Widely still lacking are feedback systems enabling the robot to change its behavior in response to environmental cues. In study we highlight an advanced integrated pneumatic control system that coupled with a soft pneumatic sensor is able to change the walking behavior of our turtle walker. Our novel bioinspired soft robot with an integrated pneumatic logic control system capable of computing sensory inputs marks the first step towards integrating autonomy into electronic free soft robots.

Abstract:
Side-scan sonar (SSS) is a particularly attractive sensing modality for underwater simultaneous localization and mapping (SLAM), offering wide-area seabed coverage and reliable acoustic measurements over long ranges. Many existing SSS-based SLAM approaches rely on image-domain processing, which depends on sufficiently rich image features and can struggle in feature-poor or homogeneous seabed environments. Furthermore, even when image-domain features are present, range-dependent intensity variations and speckle noise inherent in SSS measurements can degrade the reliability of feature extraction and data association. This study proposes a ping-level SSS SLAM framework that directly exploits raw backscatter intensity profiles without relying on image formation. By characterizing the nominal seafloor response and identifying structurally salient deviations in the acoustic intensity profiles, reliable landmark measurements are extracted at the ping level and incorporated into a landmark-based SLAM framework. This formulation preserves the native sensing geometry of SSS measurements and enables robust landmark extraction even in feature-sparse environments. The proposed approach is validated through real-world field experiments, demonstrating improved robustness and localization accuracy in challenging seabed conditions.

Abstract:
The automation of Deformable Linear Object (DLO) manipulation remains a key challenge in industrial production. While prior works demonstrated reliable wire terminal insertion using vision and tactile sensing, they typically assume a fixed connector pose. This paper presents a dual-arm robotic system for fully autonomous connector assembly. A stereo vision system enables robust 6D pose estimation of the wire terminal, while a custom mechatronic gripper with integrated tactile sensing supports accurate insertion monitoring. In parallel, the second arm performs connector grasping. By combining complementary visual and tactile feedback across both manipulators, the system achieves the precision required for tight-tolerance insertion without fixed fixtures.

Abstract:
The escalating demand for precision in maritime missions has led to the development of collaborative heterogeneous multi-robot systems, specifically pairing Autonomous Surface Vehicles (USVs) with Autonomous Underwater Vehicles (AUVs). Autonomous docking is essential for mission persistence, allowing AUVs to use USVs for recharging and data offloading, yet achieving reliable docking is difficult because these underactuated platforms are highly susceptible to wind and current disturbances. This paper introduces a specialized simulation framework utilizing a MATLAB-based Graphical User Interface (GUI) and 6-DOF equations of motion to evaluate docking success rates in real-time by analyzing measured environmental vectors. Through a scoring framework incorporating the Continuous Ranked Probability Score (CRPS), the system identifies optimal docking headings where environmental forces are minimized or exhibit a force-offsetting effect. To ensure kinematic feasibility, the trajectory planning logic integrates minimum turning radii of USV and AUV, while temporal synchronization is maintained via Estimated Time of Arrival (ETA) calculations at each waypoint. The proposed algorithm was implemented in C++ within the ROS2 framework and validated through stationary and collaborative docking scenarios under stochastic loads. Experimental results confirm that aligning the docking axis with optimized directions allows for stable docking performance.

Abstract:
This paper presents the experimental validation of a digital twin model for the Cyber Physics Opertation System(CPOS) ROVER, a tracked underwater robotic platform designed for seabed operations. Based on a high-fidelity multibody dynamics model, the digital twin incorporates trackground interactions and external underwater forces to simulate locomotion under ground-contact conditions. To evaluate the accuracy of the model, a series of water-tank experiments were conducted. The robot performed a round-trip trajectory while its motion was recorded using an external vision-based tracking system. The simulation results were then quantitatively compared with the experimental measurements to assess position and trajectory tracking performance. The comparison demonstrates a high degree of agreement between the digital twin and the experimental data. These findings validate the effectiveness of the proposed digital twin framework for representing tracked underwater locomotion, highlighting its potential for controller development and performance evaluation prior to field deployment.

Abstract:
This paper presents K-URSim (KRISO Underwater Robot Simulator), a ROS2-based modular simulation platform that serves as the backbone of the Saemangeum Digital Marine Testbed for unmanned marine systems (UMS). Unlike conventional simulators, K-URSim is tightly coupled with the real inshore test site, integrating in-situ ocean data such as currents, waves, and bathymetry into the simulation loop for realistic environment reproduction and data-driven validation. The platform adopts a modular architecture (KRISO Extensions) supporting vehicle modeling, physics-based dynamics, sensing, planning, control, and external interfaces within a unified ROS2 framework. By bridging simulation and real-world experiments, K-URSim enables pre-validation of control algorithms and mission scenarios prior to deployment, reducing cost and risk. It also supports reinforcement learning-based autonomy and synthetic data generation for sim-to-real transfer. The system can also integrate with NVIDIA Omniverse for digital--physical hybrid testing and sim-to-real validation.

Abstract:
Model predictive control (MPC) faces significant limitations when applied to systems evolving on nonlinear manifolds, such as robotic attitude dynamics and constrained motion planning, where traditional Euclidean formulations struggle with singularities, over-parameterization, and poor convergence. To overcome these challenges, this paper introduces FactorMPC, a factor-graph based MPC toolkit that unifies system dynamics, constraints, and objectives into a modular, user-friendly, and efficient optimization structure. Our approach natively supports manifold-valued states with Gaussian uncertainties modeled in tangent spaces. By exploiting the sparsity and probabilistic structure of factor graphs, the toolkit achieves real-time performance even for high-dimensional systems with complex constraints. The design of velocity-extended on-manifold control barrier function (CBF)-based obstacle avoidance factors are derived for safety-critical applications. By bridging graphical models with safety-critical MPC, our work offers a scalable and geometrically consistent framework for integrated planning and control. The simulations and experimental results on quadrotor platform demonstrate superior trajectory tracking and obstacle avoidance performance compared to baseline methods. To foster research reproducibility, we have provided open-source implementation offering plug-and-play factors.

Abstract:
Performance in automated driving tasks improves significantly with the incorporation of location-specific prior knowledge. This is because agent behavior usually strongly correlates with location features. A common example is the strong tendency of vehicles to follow their lane, but less obvious interactions exist as well. To this end, high definition (HD) map information is typically collected and made available during both training and inference to act as a location prior. In this paper, we propose to aggregate location-specific information in a data-driven way. Specifically, we learn a global latent grid that acts as a behavior prior to a learned occupancy prediction model. Since the prediction loss function is directly backpropagated into the latent grid, no additional labels are required beyond the already available future agent locations. We use the large real-world Lyft Level 5 motion prediction dataset to empirically demonstrate the merit of our learned location-specific latent behavior prior. Applied to two different prediction models, our approach achieves performance comparable to or exceeding baseline models that rely on HD maps, without requiring an HD map. Additional experiments reveal that the latent behavior prior is able to distill geometric and semantic information purely from agent behavior. These results indicate that directly learning location-specific priors is a promising direction towards automated driving without costly HD maps.

Abstract:
This paper addresses the challenge of allocating heterogeneous resources among multiple agents in a decentralized manner. Our proposed method, Liquid-Graph-Time Clustering-IPPO, builds upon Independent Proximal Policy Optimization (IPPO) by integrating dynamic cluster consensus, a mechanism that allows agents to form and adapt local sub-teams based on resource demands. This decentralized coordination strategy reduces reliance on global information and enhances scalability. We evaluate LGTC-IPPO against standard multi-agent reinforcement learning baselines and a centralized expert solution across a range of team sizes and resource distributions. Experimental results demonstrate that LGTC-IPPO achieves more stable rewards, better coordination, and robust performance even as the number of agents or resource types increases. Additionally, we illustrate how dynamic clustering enables agents to reallocate resources efficiently also for scenarios with discharging resources.

Abstract:
Executing pre-planned paths in multi-agent systems is challenging, as a lack of synchronization can lead to collisions or live-/deadlocks, while enforcing strict synchronization may cause a widespread team delay in reaching goals. An Action Dependency Graph (ADG) solves this problem by identifying and synchronizing only the necessary robots during execution by post-processing all agents paths into a static directed graph with actions as nodes and edges representing the execution precedence order between actions. However, ADG's creation phase currently requires an exhaustive search of the action space that inflates both computation and communication (O(R^2 T^2), where R is the number of robots and T is the maximum path length). This makes ADG the bottleneck for current state-of-the-art MAPF planners, which can solve for up to 10000 agents, and lifelong MAPF, which needs constant replanning during execution. Moreover, this biquadratic scaling also limits the extension of ADG to continuous space scenarios, where high-frequency updates typically effectively result in long path lengths. Addressing these limitations, in this work, we propose three improved execution graphs to cater to different needs and scenarios: SAGE, which adds edges based on the sequence in which robots visit a position; MAGE, which takes the execution graph of SAGE as input and prunes its redundant edges, reducing communication burden during execution; and FORTED which combines both approaches with a reduced complexity of O(RT), making it the overall best in discrete scenarios, trading-off scalability to continuous space scenarios. All three methods achieve speedups of 300-8000 folds over ADG, with MAGE and FORTED reducing communication overhead by more than 14 times. By integr

Abstract:
We present a method for the unattended gray-box identification of sensor models commonly used by localization algorithms in the field of robotics. The objective is to determine the most likely sensor model for a time series of unknown measurement data, given an extendable catalog of predefined sensor models. Sensor model definitions may require states for rigid-body calibrations and dedicated reference frames to replicate a measurement based on the robot's localization state. A health metric is introduced, which verifies the outcome of the selection process in order to detect false positives and facilitate reliable decision-making. In a second stage, an initial guess for identified calibration states is generated, and the necessity of sensor world reference frames is evaluated. The identified sensor model with its parameter information is then used to parameterize and initialize a state estimation application, thus ensuring a more accurate and robust integration of new sensor elements. This method is helpful for inexperienced users who want to identify the source and type of a measurement, sensor calibrations, or sensor reference frames. It will also be important in the field of modular multi-agent scenarios and modularized robotic platforms that are augmented by sensor modalities during runtime. Overall, this work aims to provide a simplified integration of sensor modalities to downstream applications and circumvent common pitfalls in the usage and development of localization approaches.

Abstract:
This work presents a FusionGS-SLAM, a robust framework for simultaneous localization and real-time photorealistic mapping leveraging the power of sensor fusion techniques. To achieve this, the proposed method employs a tightly-coupled technique to effectively combine multiple factors from improved subsystems, thereby generating a robust odometry for the downstream tasks. Moreover, a dense 3D Gaussian map is constructed by leveraging geometric information across sensor modalities, with real-time mapping strategies designed to enhance robustness and rendering quality in large-scale and challenging environments. Experimental evaluation of various challenging scenes, including the public and self-collected datasets, showcases the superior performance compared to the current state-of-the-art 3DGS SLAM.

Abstract:
This paper presents a novel hydraulic-driven two- finger robotic gripper designed to handle objects of various shapes and sizes. To meet the demands of field robotics and heavy industrial environments, a self-adaptive finger mechanism was integrated with hydraulic actuation. However, this integration leads to increased structural volume, as hydraulics produce linear motion and require additional hydraulics components. Additionally, precise force control becomes challenging, as harsh environments limit the use of other sensing devices for fine control. These issues are addressed by employing an offset slider- crank mechanism, which efficiently converts linear motion into rotational motion. Additionally, a newly designed double-acting bi-piston cylinder allows both fingers to operate using a single cylinder, reducing the number of hydraulic components. To enable pressure-based force control, kinematic and static analyses of the mechanism were conducted. A prototype was developed and experimentally validated for its grasping performance and analysis. It demonstrated high performance in lifting heavy objects, such as an 18 kg tire, and delicately handling fragile items like eggs and paper cups. cups.

Abstract:
DynamicMovement Primitives (DMPs) form a robust framework for trajectory generation based on imitation learning, aiming to replicate the shape of reference trajectories from demonstrations closely. DMPs have been extensively employed for trajectory planning in robotic systems. However, they cannot safely guarantee complex nonlinear constraints, which is essential at the control level. On the other hand, Control Barrier Functions (CBFs) are used to modulate the input of control-affine dynamic systems subject to state-dependent constraints, guaranteeing that the system remains within predefined safe sets while converging towards target states. This letter proposes Constrained Movement Primitives (CMPs), a novel framework that integrates DMPs with CBFs to generate safe-by-construction trajectories subject to nonlinear constraints. We represent DMPs in control-affine form and combine them with the closed-form input provided by CBFs, overcoming the limitations of existing iterative optimisation methods for constrained DMPs. We demonstrate that CBFs preserve the goal convergence guarantees of DMPs. Moreover, we validate our approach in simulation and on a realmobile robot subject to nonlinear kinodynamic constraints, concerning maximum Cartesian velocity, obstacle avoidance, andmaximum centrifugal acceleration to avoid slippery over curved trajectories.

Abstract:
Enabling robots to robustly follow leaders supports tasks such as carrying supplies or guiding customers. While existing methods often fail to generalize to arbitrary leaders, and struggle when the leader temporarily leaves the robots field of view, this work presents a unified framework to address both challenges. First, a segmentation model replaces traditional category-specific detection models, allowing the leader to be of any shape or type. To improve robustness, a distance frame buffer is designed to store high-confidence leader embeddings across distance intervals, accounting for the unique characteristics of leader-following tasks. Second, a goal-aware adaptation mechanism is designed to govern robot planning states based on the leader's visibility and motion, complemented by a graph-based planner that generates candidate trajectories for each state, ensuring efficient following with obstacle avoidance. Simulations and real-world experiments with a legged robot follower and diverse leaders in indoor and outdoor settings demonstrate the highest follow success rate of 96.9%, the lowest visual loss of 10.7%, the lowest collision rate of 1.8%, and the shortest leader-follower distance of 2.0 m. Visit follow-everything.github.io for more details.

Abstract:
Estimating the relative poses of multi-camera systems is a fundamental problem in computer vision, with critical applications in autonomous vehicles, mobile devices, and unmanned aerial vehicles (UAVs). However, existing solutions often suffer from high computational complexity or rely on an excessive number of point correspondences, limiting their real-world applicability. To address these limitations, we propose two efficient minimal solvers for estimating the relative poses of multi-camera systems using a novel parameterization. The first solver leverages the vertical direction prior provided by Inertial Measurement Units (IMUs), while the second utilizes the rotation axis direction prior from IMUs. Our methods require only four point correspondences and reduce the problem of multi-camera relative pose estimation to solving a univariate 6th-degree polynomiala significant improvement over existing approaches, which typically involve 8th-degree polynomials. This reduction in computational complexity and correspondence requirements makes our solvers particularly effective when integrated into RANSAC frameworks, demonstrating strong potential for visual odometry applications. Through rigorous evaluations on synthetic data and the KITTI benchmark, our methods achieved superior computational efficiency and competitive accuracy compared to state-of-the-art algorithms.

Abstract:
Despite significant advancements in the research of aquatic-aerial robots, existing configurations struggle to efficiently perform underwater, surface, and aerial movement. In this paper, we propose a novel multimodal surfing aquatic-aerial vehicle, SurfAAV, which efficiently integrates underwater navigation, surface gliding, and aerial flying capabilities. Thanks to the design of the novel differential thrust vectoring hydrofoil, SurfAAV can achieve efficient surface gliding and underwater navigation without the need for a buoyancy adjustment system. This design provides flexible operational capabilities for both surface and underwater tasks, enabling the robot to quickly carry out underwater monitoring activities. Additionally, when it is necessary to reach another water body, SurfAAV can switch to aerial mode through a gliding takeoff, flying to the target water area to perform corresponding tasks. The main contribution of this letter lies in proposing a new solution for underwater, surface, and aerial movement, designing a novel hybrid prototype concept, developing the required control laws, and validating the robot's ability to successfully perform surface gliding and gliding takeoff. SurfAAV achieves a maximum surface gliding speed of 7.96 m/s and a maximum underwater speed of 3.1 m/s. The prototypes max surface gliding speed and max underwater cruising speed both exceed those of existing aquatic-aerial vehicles.

Abstract:
Motion planning involves determining a sequence of robot configurations to reach a desired pose, subject to movement and safety constraints. Traditional motion planning finds collision-free paths, but this is overly restrictive in clutter, where it may not be possible for a robot to accomplish a task without contact. In addition, contacts range from relatively benign (e.g. brushing a soft pillow) to more dangerous (e.g. toppling a glass vase), making it difficult to characterize which may be acceptable. In this paper, we propose IMPACT, a novel motion planning framework that uses Vision-Language Models (VLMs) to infer environment semantics, identifying which parts of the environment can best tolerate contact based on object properties and locations. Our approach uses Stochastic Belief Propagation to generate an anisotropic cost map that encodes directional push safety. We pair this map with a novel contact-aware A planner to find stable, contact-rich paths. We perform experiments using 20 simulation and 10 real-world scenes and assess using task success rate, object displacements, and feedback from human evaluators. Our results over 3200 simulation and 200 real-world trials suggest that IMPACT enables efficient contact-rich motion planning in cluttered settings while outperforming alternative methods and ablations. Supplementary material is available at https://impact-planning.github.io/.

Abstract:
Deep reinforcement learning (DRL) has been widely applied to various applications, but improving exploration remains a key challenge. Recently, multi-actor DRL has emerged as a promising approach that enhances exploration by simultaneously deploying multiple actors for learning. Among these methods, actor diversity helps actors discover better policies. However, existing multi-actor DRL methods still lack effective techniques to promote actor diversity, leading to homogeneous, redundant actors and suboptimal policies. To address this, this work proposes a generic solution that can be seamlessly integrated into existing multi-actor DRL methods to promote actor diversity, thereby enabling better policy learning. Specifically, we decompose each actor into a representation module and a decision-making module, where the representation module receives the environment state and outputs a representation vector for the decision module to generate actions. We then compute the difference between each actors representation vector and those of all other actors as an additional loss, referred to as representation distinguishability regularization, and train the actor alongside its original loss to promote actor diversity. We demonstrate that our method effectively improves the performance of nine state-of-the-art (SOTA) multi-actor DRL methods across eight benchmark tasks, in terms of return.

Abstract:
Robots can throw objects to distant targets using gravity, with applications ranging from material transport to firefighting. Existing approaches typically adopt a singleton throw formulation, where the carrier must reach a specific positionvelocity configuration at the moment of throw. This reliance on a single throw point makes target-hitting highly sensitive to release delays. To address this limitation, we introduce the throw maneuver: a carrier trajectory that guarantees target hitting for objects released at any time along the trajectory. By differentiating the governing projectile equations, we derive the throw maneuver in its exact representation as ordinary differential equations, with analytical solutions available in special cases. Simulation results verify its invariant target-hit property and show that throw maneuvers achieve longer available throw time and ranges without target miss compared with a strong baseline throw method. Outdoor quadrotor experiments further demonstrate throw maneuver's improved accuracy and precision under realistic flight conditions compared with several baseline throw methods.

Abstract:
Autonomous exploration in complex environments is frequently hindered by inefficient back-and-forth movements and repetitive revisits to previously explored areas. To address these drawbacks, we propose a two-mode hybrid dynamic exploration strategy that detects isolated frontier clusters and adaptively switches between two modes: global exploration mode (GEM) and local clearance mode (LCM). The GEM generates sequences for frontier exploration access, while the LCM employs a flight-time greedy approach to select and clear isolated clusters, thereby avoiding redundant visits. In addition, to achieve adaptive yaw planning, the proposed exploration strategy generates a reference yaw sequence based on the frontiers near the path trajectory. The reference yaw sequence is then used to perform yaw optimization, with non-uniform B-spline time adjustments ensuring feasible yaw trajectories, fully leveraging the UAV's maneuverability and perception capabilities, and providing a plug-and-play solution for exploration research. Extensive simulations compared to state-of-the-art methods demonstrate that our approach significantly reduces both exploration time and distance, with real-world experiment confirming its practical effectiveness.

Abstract:
Robot-Assisted Surgery (RAS) represents a major frontier in the robotics community, blending precision automation with human skill in high-stakes clinical environments. Evaluating surgeon performance in RAS is critical for training and certification, yet current methods rely heavily on video analysis or subjective manual scoring. This study presents a neuro-robotic interaction framework that uses Electroencephalography (EEG)-derived brain connectivity features to classify surgeons skill levels during RAS tasks. The high dimensionality of EEG data imposes substantial computational cost. Therefore, we first apply Harris Hawks Optimization (HHO) to select an optimal EEG-channel subset, reducing computational cost. Then, functional connectivity feature metrics are extracted from the reduced EEG channel set and used to construct brain graphs, which serve as input to a Self-Supervised Graph Transformer (SSGT). The SSGT model is pre-trained via masked edge reconstruction to capture structural dependencies and finetuned for downstream skill-level classification. The proposed SSGT model achieves a classification accuracy of 96.60%, significantly outperforming both traditional machine learning and deep learning baselines. The label-efficient, structurally aware design of SSGT enables scalable and real-time assessment of surgical proficiency. This framework provides a foundation for intelligent robotic tutoring systems and generalizes to broader cognitive monitoring tasks in high-stakes human-robot interaction domains using EEG.

Abstract:
The ability to achieve and maintain inverted poses is essential for unlocking the full agility of miniature blimp robots (MBRs). However, developing reliable inverted control strategies for MBRs remains challenging due to their complex and underactuated dynamics. To address this challenge, we propose a novel framework that enables robust control policy learning for inverted pose on MBRs. The proposed framework consists of three core stages. First, a high-fidelity three-dimensional (3D) simulation environment is constructed and calibrated using real-world MBR motion data. Second, a robust inverted control policy is trained in simulation using a modified Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm combined with a domain randomization strategy. Third, a mapping layer is designed to bridge the sim-to-real gap and facilitate real-world deployment of the learned policy. Comprehensive evaluations in the simulation environment demonstrate that the learned policy achieves a higher success rate compared to the energy-shaping controller. Furthermore, experimental results confirm that the learned policy with a mapping layer enables an MBR to achieve and maintain a fully inverted pose in real-world settings.

Abstract:
We present a decentralized, agent agnostic, and safety-aware control framework for humanrobot collaboration based on Virtual Model Control (VMC). In our approach, both humans and robots are embedded in the same virtual-component-shaped workspace, where motion is the result of the interaction with virtual springs and dampers rather than explicit trajectory planning. A decentralized, force-based stall detector identifies deadlocks, which are resolved through negotiation. This reduces the probability of robots getting stuck in the block placement task from up to 61.2% to zero in our experiments. The framework scales without structural changes thanks to the distributed implementation: in experiments we demonstrate safe collaboration with up to two robots and two humans, and in simulation up to four robots, maintaining inter-agent separation at around 20 cm. Results show that the method shapes robot behavior intuitively by adjusting control parameters and achieves deadlock-free operation across team sizes in all tested scenarios.

Abstract:
We present a hybrid robotic skin that combines electrical impedance tomography (EIT) with pneumatic tactile sensing to improve force reconstruction capability. The developed robotic skin is fabricated entirely by 3D printing and spray coating, making it affordable and easy to build. A Tikhonov-regularized inverse reconstruction, paired with per-pad pneumatic calibration, enables accurate large-area tactile sensing with a simple measurement scheme. For validation, we conducted load-cell indentation experiments; the results showed consistent force reconstruction across locations within a pad. sgCompared with an EIT-only baseline, sensitivity non-uniformity was also reduced, with the coefficient of variation decreasing from 0.31 to 0.14, indicating that the proposed approach addresses a longstanding limitation of EIT. We further demonstrated chest-mounted integration on a humanoid robot and found that the pneumatic signals remained reliable across diverse contact scenarios, including multiple simultaneous contacts on the same sensing pad. These results indicate a practical path toward accurate, scalable whole-body tactile sensing in real robotic systems.

Abstract:
Agricultural labor shortages have increased the demand for automation in farming. In cabbage harvesting, automated harvesters rely on a side-mounted camera for detection to control harvesting height, but occlusion from outer leaves can cause errors and lead to failures. This paper presents a robust detection and control framework that integrates YOLO-based cabbage detection, trajectory tracking, LSTM-based motion classification, and LiDAR point cloud analysis. The system functions as a fail-safe while also providing redundancy, enabling recovery when side-mounted camera detection fails, and addresses two critical failure modes: cabbage jamming during extraction and low harvesting height. Temporal motion features are classified by an LSTM, while LiDAR-based trajectory analysis of the cabbage head point cloud centroid identifies low harvesting height. When both jamming and low harvesting height are detected, the system issues a raising command to the harvester. Experiments on real-world data demonstrated 95.3% accuracy in jamming detection and 95% in low harvesting height detection. Field experiments confirmed real-time operation at 10 Hz and effective prevention of severe blockages, achieving an overall control accuracy of 97.0%. These results demonstrate the feasibility of the proposed method for robust automated cabbage harvesting.

Abstract:
Indoor traversability segmentation aims to identify safe, navigable free space for autonomous agents, which is critical for robotic navigation. Pure vision-based models often fail to detect thin obstacles, such as chair legs, which can pose serious safety risks. We propose a multi-modal segmentation framework that leverages RGB images and sparse 1D laser depth information to capture geometric interactions and improve the detection of challenging obstacles. To reduce the reliance on large labeled datasets, we adopt the few-shot segmentation (FSS) paradigm, enabling the model to generalize from limited annotated examples. Traditional FSS methods focus solely on positive prototypes, often leading to overfitting to the support set and poor generalization. To address this, we introduce a negative contrastive learning (NCL) branch that leverages negative prototypes (obstacles) to refine free-space predictions. Additionally, we design a two-stage attention depth module to align 1D depth vectors with RGB images both horizontally and vertically. Extensive experiments on our custom-collected indoor RGB-D traversability dataset demonstrate that our method outperforms state-of-the-art FSS and RGB-D segmentation baselines, achieving up to 9% higher mIoU under both 1-shot and 5-shot settings. These results highlight the effectiveness of leveraging negative prototypes and sparse depth for robust and efficient traversability segmentation.

Abstract:
Exoskeletons, as wearable humanrobot collaborative devices, can effectively reduce muscle fatigue caused by prolonged material handling and overhead tasks. However, most existing active exoskeletons adopt tightly coupled serial structures, which generally suffer from insufficient wearing comfort, limited muscle coverage, and restricted workspace. To address these issues, this paper presents a novel loosely coupled, parallel upper-body exoskeleton (6.9kg). The proposed exoskeleton is connected only at the waist and elbow, providing assistance not only to the small muscle groups of the arms and shoulders but also to the larger muscle groups of the waist, back, and chest. Moreover, heavy components of the exoskeleton (approximately 78% of the total mass), such as actuators are located near the wearers waist, which places the center of mass close to the human center of mass, improving comfort and control reliability. To validate the feasibility of the design, kinematic models of both the exoskeleton and the human upper body were established. Analysis showed that the end-effector workspace of the exoskeleton exceeds that of the human elbow. Prototype experiments were conducted, allowing the wearer to perform arbitrary postures without constraining spinal motion. This indicates that the exoskeleton holds potential in work assistance scenarios such as long-term heavy lifting and overhead work.

Abstract:
Relational object rearrangement (ROR) tasks require a robot to manipulate objects with precise semantic and geometric reasoning. Existing approaches either rely on pre-collected demonstrations that struggle to capture complex geometric constraints, or generate goal-state observations to capture semantic and geometric knowledge but fail to explicitly couple object transformation with action prediction, leading to errors from generative noise. To address these limitations, we propose Imagine2Act, a 3D imitation-learning framework that incorporates semantic and geometric constraints of objects into policy learning to tackle high-precision manipulation tasks. We first generate imagined goal images conditioned on language instructions and reconstruct corresponding 3D point clouds to provide robust semantic and geometric priors. These imagined goal point clouds serve as additional inputs to the policy model, while an objectaction consistency strategy with soft pose supervision explicitly aligns predicted action motion with object transformation. This design enables Imagine2Act to reason about object relational goals and achieve accurate, high-precision manipulation across diverse tasks. Experiments in both simulation and real world demonstrate that Imagine2Act outperforms previous state-of-the-art policies.

Abstract:
Object goal navigation aims to guide an agent to find a specific target object in an unseen environment using only first-person visual observations. It requires the agent to enhance scene understanding and train a robust navigation policy. To address this, we proposed two complementary techniques, commonsense-guided object graph reasoning (COGR) and policy regularization (PR). Specifically, COGR improves the agent's scene understanding by integrating object relationships, including category proximity and spatial correlation. It extracts co-occurrence embeddings of the target object from a large language model (LLM) as commonsense knowledge to guide object graph reasoning, enabling the agent to reason beyond visual co-occurrence observed in training environments. PR is a knowledge distillation-inspired regularization mechanism, where a commonsense-free model is used to regularize the navigation policy of the commonsense-guided model. We propose PR to mitigate potential performance degradation caused by knowledge bias from the LLM, enabling the training of a more robust navigation policy. Experiments in the AI2-Thor and RoboThor environments demonstrate the effectiveness and efficiency of our proposed method, and real-world deployment further validates its transferability.

Abstract:
Existing trajectory prediction methods exhibit significant performance degradation under distribution shifts during test time. Although test-time training techniques have been explored to enable adaptation, current approaches rely on an offline pre-trained predictor that lacks online learning flexibility. Moreover, they depend on fixed online model updating rules that do not accommodate the specific characteristics of test data. To address these limitations, we first propose a novel meta-learning framework to directly optimize the pre-trained predictor for fast and accurate online adaptation, which performs bi-level optimization on the performance of simulated test-time adaptation tasks during pre-training. Furthermore, at test time, we introduce a data-adaptive model updating mechanism that dynamically adjusts the predefined learning rates and updating frequencies based on online partial derivatives and hard sample selection. This mechanism makes the learning rate suit test data, and focuses on informative hard samples to enhance efficiency. Experiments are conducted on various challenging cross-dataset distribution shift scenarios, including nuScenes, Lyft, and Waymo. Results demonstrate that our method achieves superior forecasting accuracy, surpassing state-of-the-art test-time training methods for trajectory prediction. Additionally, our method excels under suboptimal learning rates and high FPS demands, showcasing its robustness and practicality.

Abstract:
We introduce an elastic-driven self-folding approach that fabricates robots directly from flat 3D-printed conductive PLA nets. Elastic bands routed through printed hooks store energy that folds the sheet into programmed 3D geometries, while the flat state allows accurate placement of electronics and magnets before deployment. The same substrate doubles as electrodes for capacitive touch and supports a reusable platform I/O palette with Hall sensors and eccentric rotating mass (ERM) motors for docking detection and vibration actuation. We also derive a closed-form folding model that balances hinge stiffness with elastic band moment to predict equilibrium fold angles; experiments validate the model and yield a design map linking hinge thickness, band size, and hook spacing to target angles. Using this workflow we realize multiple polyhedral modules and demonstrate three applications: a cube that highlights the potential of self-folding for scalable modular robot collectives, a deployable gripper, and a tendon-driven finger. The method is low cost, stimulus-free, and integrates actuation and sensing.

Abstract:
Braincomputer interfaces (BCIs) provide a hands-free control modality for mobile robotics, yet decoding user intent during real-world navigation remains challenging. This work presents a brainrobot control framework for offline decoding of driving commands during robotic rover operation. A 4WD Rover Pro platform was remotely operated by 12 participants who navigated a predefined route using a joystick, executing the following commands: forward, reverse, left, right, and stop. Electroencephalogram (EEG) signals were recorded with a 16-channel OpenBCI cap and aligned with motor actions at �?= 0 ms and eight future prediction horizons (�?> 0 ms). After data preprocessing, eleven deep learning (DL) models were benchmarked for the task of intent classification, across the Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Transformer architectural families. ShallowConvNet achieved the highest performance for both action prediction (F1-score 67% at �?= 0 ms) and intent prediction (F1-score 66% at �?= 300 ms), maintaining robust performance at future horizons. By combining real-world robotic control with multi-horizon EEG intention decoding, this study introduces a reproducible benchmark and reveals key design insights for predictive, DL-based BCI systems.

Abstract:
The Vision-Language-Action models (VLA) have achieved significant advances in robotic manipulation recently. However, vision-only VLA models create fundamental limitations, particularly in perceiving interactive and manipulation dynamic processes. This paper proposes Audio-VLA, a multimodal manipulation policy that leverages contact audio to perceive contact events and dynamic process feedback. Audio-VLA overcomes the vision-only constraints of VLA models. Additionally, this paper introduces the Task Completion Rate (TCR) metric to systematically evaluate dynamic operational processes. Audio-VLA employs pre-trained DINOv2 and SigLIP as visual encoders, AudioCLIP as the audio encoder, and Llama2 as the large language model backbone. We apply LoRA fine-tuning to these pre-trained modules to achieve robust cross-modal understanding of both visual and acoustic inputs. A multimodal projection layer aligns features from different modalities into the same feature space. Moreover RLBench and LIBERO simulation environments are enhanced by adding collision-based audio generation to provide realistic sound feedback during object interactions. Since current robotic manipulation evaluations focus on final outcomes rather than providing systematic assessment of dynamic operational processes, the proposed TCR metric measures how well robots perceive dynamic processes during manipulation, creating a more comprehensive evaluation metric. Extensive experiments on LIBERO, RLBench, and two real-world tasks demonstrate Audio-VLAs superior performance over vision-only comparative methods, while the TCR metric effectively quantifies dynamic process perception capabilities. The source code and pre-trained models are publicly available at https://wxone.github.io/AudioVLA.

Abstract:
Autonomous underwater vehicles (AUVs) are increasingly used to survey coral reefs, yet efficiently locating specific coral species of interest remains difficult: target species are often sparsely distributed across the reef, and an AUV with limited battery life cannot afford to search everywhere. When detections of the target itself are too sparse to provide directional guidance, the robot benefits from an additional signal to decide where to look next. We propose using the visual environmental context -- the habitat features that tend to co-occur with a target species -- as that signal. Because context features are spatially denser and often vary more smoothly than target detections, we hypothesize that a reward function targeted at broader environmental context will enable adaptive planners to make better decisions on where to go next, even in regions where no target has yet been observed. Starting from a single labeled image, our method uses patch-level DINOv2 embeddings to perform one-shot detections of both the target species and its surrounding context online. We validate our approach using real imagery collected by an AUV at two reef sites in St. John, U.S. Virgin Islands, simulating the robot's motion offline. Our results demonstrate that one-shot detection combined with adaptive context modeling enables efficient autonomous surveying, sampling up to 75% of the target in roughly half the time required by exhaustive coverage when the target is sparsely distributed, and outperforming search strategies that only use target detections.

Abstract:
Reliable localization in robotics requires robust handling of sensor outliers, particularly in environments where acoustic or bearing measurements are noisy. We propose a replicator-dynamics-based approach for weighted group-k consistent set maximization (rGkCM) to identify the densest subsets of mutually consistent measurements in hypergraphs. To complement existing range-based consistency metrics, we introduce a k = 3 azimuth-elevation consistency check for bearing measurements to static landmarks. Our method efficiently identifies cliques in weighted k-uniform hypergraphs, leveraging the fitness of nodes to guide both pruning and recovery. We evaluate rGkCM on simulated trajectories with varying outlier levels and demonstrate significant computational speedup over the heuristic unweighted GkCM (uGkCM) method while maintaining comparable accuracy. Finally, we validate the approach on a WAM-V autonomous surface vessel equipped with an acoustic beacon and GNSS ground truth, showing effective outlier rejection in a shallow, multipath-prone marina. Results indicate that rGkCM enables robust and efficient outlier rejection for real-world bearing-based localization tasks.

Abstract:
Text-to-motion generation, which converts motion language descriptions into coherent 3D human motion sequences, has attracted increasing attention in fields, such as avatar animation and humanoid robotic interaction. Though existing models have achieved significant fidelity, they still suffer from two core limitations: (i) They treat motion periodicity and keyframe saliency as independent factors, overlooking their coupling and causing generation drift in long sequences. (ii) They are fragile to semantically equivalent paraphrases, where minor synonym substitutions distort textual embeddings, propagating through the decoder and producing unstable or erroneous motions. In this work, we propose T2M Mamba to address these limitations by (i) proposing Periodicity-Saliency Aware Mamba, which utilizes novel algorithms for keyframe weight estimation via enhanced Density Peaks Clustering and motion periodicity estimation via FFT-accelerated autocorrelation to capture coupled dynamics with minimal computational overhead, and (ii) constructing a Periodic Differential Cross-modal Alignment Module (PDCAM) to enhance robust alignment of textual and motion embeddings. Extensive experiments on HumanML3D and KIT-ML datasets have been conducted, confirming the effectiveness of our approach, achieving an FID of 0.068 and consistent gains on all other metrics.

Abstract:
We present FlightDiffusion, a diffusion-based framework for training autonomous drones from first-person-view (FPV) video. The model generates FPV video sequences from a single frame and a text prompt, and derives corresponding state-action trajectories for task-conditioned navigation. FlightDiffusion leverages generative modeling to synthesize diverse FPV trajectories and corresponding state-action pairs, enabling scalable dataset generation without the high cost of real-world data collection. These datasets support not only learning pipeline but also the training of autonomous navigation systems. Our evaluation shows that the generated trajectories are physically feasible and executable, with a mean positional error of 0.25 m (RMSE 0.28 m) and a mean orientation error of 0.19 rad (RMSE 0.24 rad). This approach enables scalable dataset generation and supports reliable navigation performance. Results in simulated environments indicate stable trajectory planning and consistent behavior across varying conditions. An ANOVA revealed no statistically significant difference between performance in simulation and reality (F(1, 16) = 0.394, p = 0.541), with success rates of M = 0.628 (SD = 0.162) and M = 0.617 (SD = 0.177), respectively, indicating effective sim-to-real transfer. The generated datasets provide a useful resource for future UAV research. This work introduces diffusion-based video generation as a promising mechanism for coupling task-level reasoning with executable trajectory synthesis in aerial robotics.

Abstract:
Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack introspective mechanisms for anticipating failures and requesting help from a human supervisor. We present INSIGHT, a learning framework for leveraging token-level uncertainty signals to predict when a VLA should request help. Using Pi0-FAST as the underlying model, we extract per-token entropy, log-probability, and Dirichlet-based estimates of aleatoric and epistemic uncertainty, and train compact transformer classifiers to map these sequences to help triggers. We explore supervision regimes for strong or weak supervision, and extensively compare them across in-distribution and out-of-distribution tasks. Our results show a trade-off: strong labels enable models to capture fine-grained uncertainty dynamics for reliable help detection, while weak labels, though noisier, still support competitive introspection when training and evaluation are aligned, offering a scalable path when dense annotation is impractical. Crucially, we find that modeling the temporal evolution of token-level uncertainty signals with transformers provides far greater predictive power than static sequence-level scores. This study provides the first systematic evaluation of uncertainty-based introspection in VLAs, opening future avenues for active learning and for real-time error mitigation through selective human intervention.

Abstract:
Autonomous aircraft must safely operate in non-towered airspace, where coordination relies on voice-based communication among human pilots. Safe operation requires an aircraft to predict the intent, and corresponding goal location, of other aircraft. This paper introduces a multimodal framework for aircraft goal prediction that integrates natural language understanding with spatial reasoning to improve autonomous decision-making in such environments. We leverage automatic speech recognition and large language models to transcribe and interpret pilot radio calls, identify aircraft, and extract discrete intent labels. These intent labels are fused with observed trajectories to condition a temporal convolutional network and Gaussian mixture model for probabilistic goal prediction. Our method significantly reduces goal prediction error compared to baselines that rely solely on motion history, demonstrating that language-conditioned prediction increases prediction accuracy. Experiments on a real-world dataset from a non-towered airport validate the approach and highlight its potential to enable socially aware, language-conditioned robotic motion planning.

Abstract:
Most existing robotic surgery systems adopt a human-in-the-loop paradigm, often with the surgeon directly teleoperating the robotic system. Adding intelligence to these robots would enable higher-level control, such as supervised autonomy or even full autonomy. However, artificial intelligence (AI) requires large amounts of training data, which is currently lacking. This work proposes SurgSync, a multi-modal data collection framework with offline and online synchronization to support training and real-time inference, respectively. The framework is implemented on a da Vinci Research Kit (dVRK) and introduces (1) dual-mode (online/offline-matching) synchronized recorders, (2) a modern stereo endoscope to achieve image quality on par with clinical systems, and (3) additional sensors such as a side-view camera and a novel capacitive contact sensor to provide ground truth contact data. The framework also incorporates a post-processing toolbox for tasks such as depth estimation, optical flow, and a practical kinematic reprojection method using Gaussian heatmap. User studies with participants of varying skill levels are performed with ex-vivo tissue to provide clinically realistic data, and a network for surgical skill assessment is employed to demonstrate utilization of the collected data. Through the user study experiments, we obtained a dataset of 214 validated instances across multiple canonical training tasks. All software and data are available at surgsync.github.io.

Abstract:
Generalization remains a fundamental challenge in robotic manipulation. To tackle this challenge, recent Vision-Language-Action (VLA) models build policies on top of Vision-Language Models (VLMs), seeking to transfer their open-world semantic knowledge. However, their zero-shot capability lags significantly behind the base VLMs, as the instruction-vision-action data is too limited to cover diverse scenarios, tasks, and robot embodiments. In this work, we present textbfname, a zero-shot framework that leverages Image-Generative VLMs as world models to generate desired goal states, from which the target object pose is derived to enable generalizable manipulation. The key insight is that object state representation is the golden interface, naturally separating a manipulation system into high-level and low-level policies. This representation abstracts away explicit action annotations, allowing the use of highly generalizable VLMs while simultaneously providing spatial cues for training-free low-level control. To further improve robustness, we introduce a Reflection-through-Synthesis process that iteratively validates and refines the generated goal image before execution. Both simulated and real-world experiments demonstrate that our name achieves strong performance and inspiring generalizability in manipulation tasks. Supplementary materials are available at https://nus-lins-lab.github.io/goalvlaweb/.

Abstract:
Gaussian Splatting SLAM methods have exhibited impressive high-fidelity rendering performance. Existing methods maintain high rendering quality around the current camera viewpoint, but the rendering quality degrades in previously observed regions as the camera moves away, particularly in real-world scenarios. We identify two core factors for high-quality rendering: keyframes should efficiently cover the entire scene while minimizing redundancy, and the mapping strategy should effectively select critical keyframes for full scene optimization. To address these issues, we propose SupGS-SLAM to improve rendering quality across the entire scene. For effective keyframe management, we propose an efficient keyframe strategy, which reduces redundant keyframe selection and prioritizes the optimization of critical keyframes by assigning high weights. For enhanced mapping, we propose a supplementary mapping strategy comprising three components: supplementary densification, supplementary global mapping, and supplementary depth mapping. In supplementary densification, we add supplementary Gaussian primitives to previous regions with insufficient representation. In supplementary global mapping, we select keyframes globally to optimize the full scene. In supplementary depth mapping, we use estimated depth to optimize regions without ground-truth depth. Extensive experiments demonstrate that SupGS-SLAM achieves excellent performance on both synthetic and real-world datasets. The project page is available at https://github.com/rucliushuai/SupGS-SLAM.

Abstract:
Safe navigation often relies on well-defined conditions based on the shape of robots and obstacles, and can be challenging when they have irregular geometries. While Control Barrier Functions (CBFs) offer an efficient mechanism to enforce safe set forward invariance, common shape surrogates (e.g., spheres or super-ellipsoids) either are overly conservative in unstructured scenes or require many local primitives, which inflates constraint counts and degrades real-time performance. In this paper, we introduce a novel geometry-aware Control Barrier Function (CBF) based on BernsteinPolynomial Signed Distance Fields (BP-SDFs). It provides a unified way to represent the obstacles and robots, so as to represent the barrier function with a unified minimum distance. Benefiting from the differentiability of the Bernstein polynomials, one can easily enforce the control constraints in a closed loop. We validate the method's efficiency and performance to guarantee safety in single-robot navigation and heterogeneous multi-robot collision avoidance via simulations under different environments.

Abstract:
Current 3D-aware pretraining methods for embodied perception and manipulation are largely built on differentiable rendering frameworks, producing either fully implicit neural fields or fully explicit geometric primitives. Implicit representations, while expressive, lack explicit structural cues, whereas explicit ones preserve geometry but suffer from resolution limits and weak generalization. To address these limitations, we propose a novel pretraining framework that learns a hybrid representation-structural latent points. Specifically, we insert a point-wise latent variational autoencoder into the latent space of a point-cloud autoencoder, jointly regularizing point-wise features and coordinates toward a Gaussian prior. The resulting compact latent preserves coarse structural tendencies, which do not encode precise geometry but capture richer rough shape and semantic information, effectively combining the expressiveness of implicit representations with the structural priors of explicit ones. In addition, informed by shared design choices in prior work, we develop a streamlined, efficient 3DGS-based rendering pipeline that is deliberately kept lightweight, improving efficiency while leaving greater representational capacity to the front-end latent module. Extensive evaluations on RLBench, ManiSkill2, and a real-robot platform demonstrate consistent gains in task success, sample efficiency, and robustness to viewpoint and scene variations over strong baselines. Ablation studies further confirm that each component of our framework is critical to overall performance.

Abstract:
Tactile manipulation is a prominent and growing field where most research focuses on developing generalized manipulation policies. However, tactile execution monitoring - the ability to reliably evaluate manipulation at the skill level - is often overlooked, despite being critical for unsupervised deployment in both human-centered environments and industry, where strict safety and quality requirements apply. We propose the Tactile Predictive Encoding Model (TPEM), a time-series tactile perception framework inspired by human predictive encoding that enables real-time anomaly detection from skilllevel sensory data. TPEM extends predictive coding concepts from global task modeling to precise monitoring of contact-rich manipulation beyond the capabilities of visual sensing. We evaluate TPEM on three representative tasks: key insertion and turning, peg-in-hole insertion, and screw insertion and tightening using an industrial assembly model. Experiments on a tactile-enabled Franka Emika robot under realistic noise conditions show robust anomaly detection with zero false positives. Comparison with baseline methods - including Support Vector Machines (SVM), Hidden Markov Models (HMM), and recurrent generative models such as LSTM-VAE demonstrates that TPEM consistently outperforms state-of-the-art approaches in contact-rich skill-level execution monitoring.

Abstract:
Benefiting from mobility and dexterity, Mobile Manipulation (MM) systems are expected to assist humans with diverse tasks in everyday life. However, since MM tasks (e.g., tidying up a room) require learning multi-stage heterogeneous behaviors (e.g., picking, placing, and opening), existing Reinforcement Learning (RL) agents often face sample inefficiency and progress reversal issues. In addition, such MM agents are limited to learning customized tasks, thus not allowing for the extrapolation to new tasks and real-world scenes. In this work, we propose a Hierarchical Policy Distillation (HPD)-based RL framework to explicitly address these issues, which outperforms existing curriculum learning-based and hierarchical RL-based methods. Specifically, Sub-Skill Distillation (SSD) allows learning both the main MM task and easier sub-skills in a single training loop, facilitating exploration and mitigating process reversal by distilling the relevant sub-skills' experience into the main task. Self-boosting Policy Distillation (SPD) is designed to enhance generalization and address the information asymmetry between MM tasks in a principled way, i.e., distilling the experience of a prior task to a new one. Comparative and ablation studies on different robotic platforms demonstrate that our method significantly outperforms existing methods. Finally, real-world experiments validate the practicality of our method.

Abstract:
Vehicle detection and localization in complex traffic scenarios pose significant challenges due to the interference of moving objects. Traditional methods often rely on outlier exclusions or semantic segmentations, which suffer from low computational efficiency and accuracy. The proposed SSF-PAN can achieve the functionalities of LiDAR point cloud based object detection/localization and SLAM (Simultaneous Localization and Mapping) with high computational efficiency and accuracy, enabling map-free navigation frameworks. The novelty of this work is threefold: 1) developing a neural network which can achieve segmentation among static and dynamic objects within the scene flows with different motion features, that is, semantic scene flow (SSF); 2) developing an iterative framework which can further optimize the quality of input scene flows and output segmentation results; 3) developing a scene flow-based navigation platform which can test the performance of the SSF perception system in the simulation environment. The proposed SSF-PAN method is validated using the SUScape-CARLA and the KITTI datasets, as well as on the CARLA simulator. Experimental results demonstrate that the proposed approach outperforms traditional methods in terms of scene flow computation accuracy, moving object detection accuracy, computational efficiency, and autonomous navigation effectiveness.

Abstract:
Stop-rotor aircraft are a class of vertical takeoff and landing (VTOL) vehicle that offer improved efficiency across flight modes through the usage of a single central lifting surface. In VTOL, the central lifting surface rotates like a helicopter blade to achieve an upward force. In forward flight, the central lifting surface locks in place like a conventional fixed-wing aircraft and achieves lift from airflow over the surface. The improved efficiency across flight modes enables more complex mission profiles that balance flight time in VTOL and forward flight, such as package delivery and inspection over a large area. Despite the promise of stop-rotor aircraft, challenges in modeling and control, particularly due to the nonlinear rotor dynamics across flight modes, have limited practical implementation. To this end, this paper presents two types of models: 1. Analytical models, derived from first principles physics, provide insight into the stability and control of the vehicle and demonstrate closed-loop stability of yaw and altitude using classical PID control, 2. Computational models, based on numerical integration of the systems ordinary differential equations, provide full-state dynamics of the vehicle. Validation against bench-top constrained flight tests shows that the analytical models capture over 97% of the variance in the computational results, while the computational models account for up to 40% of the variance observed in experimental data.

Abstract:
Model Predictive Path Integral control is a powerful sampling-based approach suitable for complex robotic tasks due to its flexibility in handling nonlinear dynamics and non-convex costs. However, its applicability in real-time, high-frequency robotic control scenarios is limited by computational demands. This paper introduces Feedback-MPPI (F-MPPI), a novel framework that augments standard MPPI by computing local linear feedback gains derived from sensitivity analysis inspired by Riccati-based feedback used in gradient-based MPC. These gains allow for rapid closed-loop corrections around the current state without requiring full re-optimization at each timestep. We demonstrate the effectiveness of F-MPPI through simulations and real-world experiments on two robotic platforms: a quadrupedal robot performing dynamic locomotion on uneven terrain and a quadrotor executing aggressive maneuvers with onboard computation. Results illustrate that incorporating local feedback significantly improves control performance and stability, enabling robust, high-frequency operation suitable for complex robotic systems.

Abstract:
This paper presents a new visual localization framework for complex indoor environments under dynamic scene change conditions. Conventional visual localization methods often struggle to maintain accuracy and robustness in such environments, where frequent scene changes, occlusions, diverse object categories, and intricate scene structures significantly affect feature consistency and matching reliability. These challenges highlight the need for a more adaptive and semantically aware localization approach. By proposing an algorithm that integrates semantic information with a Gaussian map as input, the method enhances the algorithms environmental awareness. This allows robust objects to be identified and extracted, thereby improving feature extraction performance and consequently enhancing pose estimation precision. Furthermore, a new coarse-to-fine matching strategy has been developed that takes an overview of the Gaussian map, from which suitable viewpoints are generated to produce the best matching images. Rendered images produced from the Gaussian map are employed in subsequent stages to improve comparison effectiveness, thereby enabling the determination of the most accurate camera pose. Finally, the capability of the proposed methodology is confirmed through experiments on different types of datasets.

Abstract:
Humans throw projectiles with high speed and accuracy, yet robots still lag behind despite precise control and low latency. A key obstacle is the lack of high-fidelity, tractable models for transient release dynamics, where momentum is exchanged via friction over ~50 ms. We first show that the conventional model combining rigid body dynamics and patch friction (LS model) suffers from pathological behaviors, resulting in poor prediction accuracy. While this can be mitigated using viscous smoothing and implicit integration (ILS model), it incurs a high computational cost. Motivated by the dominant effect of in-hand pivoting during release, we propose a Sliding Pivot (SP) model that simplifies the dynamics by capturing stickingpivotingsliding under vanishing normal force. The SP model offers comparable accuracy (within 10% of ILS) while running over 20× faster. Compared to the conventional LS model, SP reduces horizontal velocity error by 40% and angular velocity error by 63%, achieving 2.4 cm and 15.4 degrees mean absolute error for landing position and orientation. These results provide a robust, physically grounded foundation for scalable throwing robots.

Abstract:
Surgical automation holds immense potential to improve the outcome and accessibility of surgery. Recent studies use reinforcement learning to automate various surgical tasks. However, these policies are developed independently, and their reusability is limited when applied to other scenarios, making it more time-consuming for robots to incrementally solve tasks. Inspired by how human surgeons build their expertise, we propose Surgical Incremental Reinforcement Learning (SurgIRL). SurgIRL aims to (1) acquire new skills by referring to external policies (knowledge) and (2) build an expandable knowledge base and reuse it to solve multiple unseen tasks incrementally (incremental learning). Our SurgIRL framework includes three major components. We first define an expandable knowledge set containing heterogeneous policies that can be helpful for surgical tasks. Then, we propose Knowledge Inclusive Attention Network with mAximum Coverage Exploration (KIAN-ACE), which enhances learning performance through extensive navigation of the knowledge base. Finally, we develop incremental learning pipelines to expand and reuse a knowledge base and solve multiple surgical tasks sequentially. Our simulation experiments show that SurgIRL efficiently learns to automate ten surgical tasks separately or incrementally. We also demonstrate successful sim-to-real transfers of SurgIRL's policies on the da Vinci Research Kit (dVRK). The results represent an initial step towards lifelong robot learning for surgical automation.

Abstract:
Visuotactile sensors can provide rich contact information for robots. However, how to build a high-fidelity visuotactile simulator that supports multi-mode tactile imprints and various sensor configurations remains a challenging problem. In this paper, we present TacFlex, a flexible simulator for visuotactile sensors, which physically simulates the elastomer deformation using Finite Element Methods, and focuses on linking the deformed elastomer mesh to diverse tactile imprints, including tactile images with arbitrary coating patterns and tactile 3D point clouds. We further propose a ray tracing-based rectification method to deal with multi-medium refraction effects to make the simulated tactile images more realistic. Extensive experiments are conducted to show the effectiveness of TacFlex on various sensors. Furthermore, we explore the Sim2Real performance of different tactile imprints provided by TacFlex in tactile perception and manipulation tasks, such as cylindrical object pose estimation and peg-in-hole. The perception/policy models trained in simulation are successfully deployed in real world. Finally, we discuss TacFlex's potential in robot learning.

Abstract:
This work presents a closed-loop experimental framework for connectivity-aware urban search and rescue (SAR) using heterogeneous unmanned ground vehicles (UGVs) and unmanned aerial vehicles (UAVs). The setup couples a physics-based urban digital twin in NVIDIA Isaac Sim with Robot Operating System 2 (ROS2) orchestration, a Proximal Policy Optimization (PPO) multi-agent reinforcement learning (MARL) controller, and a fifth-generation (5G) link evaluation pipeline based on ns-3/5G-LENA key performance indicators (KPIs). Two UGVs execute mission-directed navigation toward a hazard region, while two UAV relays and a gNB-like aerial anchor adapt their positions to sustain end-to-end service under line-of-sight and non-line-of-sight transitions induced by urban occlusions. Preliminary simulation results validate end-to-end operability and provide quantitative evidence of simultaneous mission progress and network continuity. Across a representative episode, the minimum distance to the hazard-region center decreases from 27.9 m to 1.55 m (final 1.80 m), while latency remains in a low regime (mean 4.88 ms, p95 8.17 ms). Packet loss is bounded (mean 3.5% and 2.2% for the two UGVs), and outages are sparse (101 steps over 9000), even during partial traversal of building-dense areas. The platform enables systematic diagnosis of mobilityconnectivity coupling and supports transfer-oriented refinement of relay control and coordination policies.

Abstract:
In this paper, we present a hierarchical simultaneous localization and mapping (SLAM) system that leverages point-level features, mid-level geometric organized edge representations, and high-level object semantics within a unified framework. While object-level SLAM provides semantic information and improves long-term data association, it often suffers from coarse geometric constraints and unreliable detections. In contrast, organized edge representations capture rich structural and textural information, offering stable geometric cues in low-texture or challenging environments. By hierarchically integrating these complementary representations, the proposed system achieves robust camera tracking, reliable data association, and consistent mapping.

Abstract:
A dominant paradigm for teaching humanoid robots complex skills is to retarget human motions as kinematic references to train reinforcement learning (RL) policies. However, existing retargeting pipelines often struggle with the significant embodiment gap between humans and robots, producing physically implausible artifacts like foot-skating and penetration. More importantly, common retargeting methods neglect the rich human-object and human-environment interactions essential for expressive locomotion and loco-manipulation. To address this, we introduce OmniRetarget, an interaction-preserving data generation engine based on an interaction mesh that explicitly models and preserves the crucial spatial and contact relationships between an agent, the terrain, and manipulated objects. By minimizing the Laplacian deformation between the human and robot meshes while enforcing kinematic constraints, OmniRetarget generates kinematically feasible trajectories. Moreover, preserving task-relevant interactions enables efficient data augmentation, from a single demonstration to different robot embodiments, terrains, and object configurations. We comprehensively evaluate OmniRetarget by retargeting motions from OMOMO, LAFAN1, and our in-house MoCap datasets, generating over 8-hour trajectories that achieve better kinematic constraint satisfaction and contact preservation than widely used baselines. Such high-quality data enables proprioceptive RL policies to successfully execute long-horizon (up to 30 seconds) parkour and loco-manipulation skills on a Unitree G1 humanoid, trained with only 5 reward terms and simple domain randomization shared by all tasks, without any learning curriculum. All code, retargeted datasets, and result videos can be found at https://omniretarget.github.io.

Abstract:
Fish achieve efficient swimming across varied speeds through active modulation of their body flexibility. To explore the effects of tunable stiffness on swimming performance, we present a bio-inspired freely-swimming fish robot with a rapidly tunable particle jamming body. This design enables rapid stiffness adjustments with negligible changes in shape or volume, achieving a 54% variation in flexural rigidity across vacuum pressures 0 to 40 kPa. We visualize the midline of the oscillating body under both low and high stiffness conditions, and the comparison confirms that the body curvature varies with stiffness. We further experimentally evaluate the tunable stiffness body's effects on swimming performance using velocity and cost of transport (CoT) measurements obtained via a motion tracking system. Results show that active stiffness tuning is essential for sustaining efficient and high-speed swimming across beating frequencies of 13 Hz. At low frequencies (1-1.5 Hz), a softer body (0 kPa) maximizes velocity and minimizes CoT, whereas at high frequencies (2.5-3 Hz), a stiffer body (40 kPa) delivers superior velocity and reduced transport cost. These findings highlight stiffness modulation as a key strategy for adaptive and efficient propulsion in bio-inspired robotic swimmers.

Abstract:
Despite rapid commercialization of surgical robots, their autonomy and real-time decision-making remain limited in practice. To address this gap, we propose ArthroCut, an autonomous policy learning framework that upgrades knee arthroplasty robots from assistive execution to context-aware action generation. ArthroCut fine-tunes a Qwen-VL backbone on a self-built, time-synchronized multimodal dataset from 21 complete cases (23,205 RGB-D pairs), integrating preoperative CT/MR, intraoperative NDI tracking of bones and end effector, RGB-D surgical video, robot state, and textual intent. The method operates on two complementary token families---Preoperative Imaging Tokens (PIT) to encode patient-specific anatomy and planned resection planes, and Time-Aligned Surgical Tokens (TAST) to fuse real-time visual, geometric, and kinematic evidence---and emits an interpretable action grammar under grammar/safety-constrained decoding. In bench-top experiments on a knee prosthesis across seven trials, ArthroCut achieves an average success rate of 86% over the six standard resections, significantly outperforming strong baselines trained under the same protocol. Ablations show that TAST is the principal driver of reliability while PIT provides essential anatomical grounding, and their combination yields the most stable multi-plane execution. These results indicate that aligning preoperative geometry with time-aligned intraoperative perception and translating that alignment into tokenized, constrained actions is an effective path toward robust, interpretable autonomy in orthopedic robotic surgery.

Abstract:
Cardiac ultrasound diagnosis is critical for cardiovascular disease assessment, but acquiring standard views remains highly operator-dependent. Existing medical segmentation models often yield anatomically inconsistent results in images with poor textural differentiation between distinct feature classes, while autonomous probe adjustment methods either rely on simplistic heuristic rules or black-box learning. To address these issues, our study proposed an anatomical prior (AP)-driven framework integrating cardiac structure segmentation and autonomous probe adjustment for standard view acquisition. A YOLO-based multi-class segmentation model augmented by a spatial-relation graph (SRG) module is designed to embed AP into the feature pyramid. Quantifiable anatomical features of standard views are extracted. Their priors are fitted to Gaussian distributions to construct probabilistic APs. The probe adjustment process of robotic ultrasound scanning is formalized as a reinforcement learning (RL) problem, with the RL state built from real-time anatomical features and the reward reflecting the AP matching. Experiments validate the efficacy of the framework. The SRG-YOLOv11s improves mAP50 by 11.3% and mIoU by 6.8% on the Special Case dataset, while the RL agent achieves a 92.5% success rate in simulation and 86.7% in phantom experiments.

Abstract:
This letter introduces a novel method of non-rigid motion compensation for in situ bioprinting. Most bioprinting platforms use open-loop systems, but it raises concerns about patient safety and suboptimal wound coverage in case of patient motion. To handle these issues, our method integrates an RGB-D camera to manage orientation and to predict deformations, along with a laser telemeter to regulate deposited material thickness. The proposed approach has been evaluated on a moving silicone platform that deforms at 0.8 Hz with a 4 mm in-plane amplitude and a 20 mm elevation amplitude. Our method resulted in a wound coverage error of less than 1 %. Comparative analysis demonstrates a 73.0% enhancement in deforming path following compared to existing methods. Additionally, by predicting surface motion, the method enables more precise control of layer height, with an error inferior to 0.1 mm.

Abstract:
Efficient target localization and autonomous navigation in complex environments are fundamental to real-world embodied applications. While recent advances in multimodal foundation models have enabled zero-shot object goal navigation, allowing robots to search for arbitrary objects without fine-tuning, existing methods face two key limitations: (1) heavy reliance on ground-truth depth and pose information, which restricts applicability in real-world scenarios; and (2) lack of visual in-context learning (VICL) capability to extract geometric and semantic priors from environmental context, as in a short traversal video. To address these challenges, we propose RANGER, a novel zero-shot, open-vocabulary semantic navigation framework that operates using only a monocular camera. Leveraging powerful 3D foundation models, RANGER eliminates the dependency on depth and pose while exhibiting strong VICL capability. By simply observing a short video of the target environment, the system can also significantly improve task efficiency without requiring architectural modifications or task-specific retraining. The framework integrates several key components: keyframe-based 3D reconstruction, semantic point cloud generation, vision-language model (VLM)-driven exploration value estimation, high-level adaptive waypoint selection, and low-level action execution. Experiments on the HM3D benchmark and real-world environments demonstrate that RANGER achieves competitive performance in terms of navigation success rate and exploration efficiency, while showing superior VICL adaptability, with no previous 3D mapping of the environment required.

Abstract:
In urban pedestrian zones where autonomous vehicles (AVs) increasingly operate alongside humans, clear communication between AVs and pedestrians is essential for safety and trust. This study conducted an exploratory research on pedestrian reactions to an autonomous shuttle bus (AutoBus) operating on a university campus. Using a real-world deployment, the effectiveness of visual and auditory communication cues in a real-world setting was evaluated. The AutoBus continuously looped a 480-meter path on campus during lunchtime, and pedestrians who walked toward and crossed the bus were invited to complete an online survey following this interaction. Data was collected from 58 participants at a technical university through behavioral observations and post-interaction surveys. The results reveal that visual cues were more consistently recognized than auditory ones, influencing pedestrian awareness and response. Trust in AV's safety was shaped more by its perceived safety than by prior experiences with the AutoBus. Moreover, willingness to yield was positively associated with the perceived social status of the AV, but not whether it was perceived as an autonomous robot or as representing its passengers. These findings offer practical insights for improving AV communication design to support safer, more intuitive interactions in shared spaces.

Abstract:
Can a robot navigate a cluttered environment without an explicit map? Reactive methods that use only the robots current sensor data and local information are fast and flexible, but prone to getting stuck in local minima. Is there a middle-ground between reactive methods and map-based path planners? In this paper, we investigate feed forward and recurrent networks to augment a purely reactive sensor-based navigation algorithm, which should give the robot "geometric intuition" about how to escape local minima. We train on a large number of extremely cluttered simulated worlds, auto-generated from primitive shapes, and show that our system zero-shot transfers to worlds based on real data, 3D man-made environments, and can handle up to 30% sensor noise without degradation of performance. We also offer a discussion of what role network memory plays in our final system, and what insights can be drawn about the nature of reactive vs. map-based navigation. The implementation of the planners and all experiments is made available open-source.

Abstract:
We report a novel stable Model-Based Adaptive Velocity Tracking Controller (AVTC) for ground vehicles capable of asymptotically exactly tracking longitudinal and yaw reference velocities and simultaneously adaptively identifying unknown plant parameters and actuator parameters. The reported AVTC is developed for velocity control of the commonly accepted three degree-of-freedom second-order dynamic bicycle model for ground vehicles. A Lyapunov analysis shows asymptotic stability of the velocity tracking error in longitudinal and yaw velocities, boundedness of all signals, and convergence of the adaptive parameter estimates. A performance evaluation of the proposed AVTC is reported including numerical simulation evaluation and experimental evaluation that corroborates the analytical predictions of stability and tracking, and compares its performance to its non-adaptive counterpart and two alternative controllers. AVTC only requires body-frame velocities and control input signals and robustly detects, quantitatively identifies, and compensates dynamically in real-time for faults arising from changes to plant, actuator, and environmental parameters during operations.

Abstract:
This paper proposes a consensus-based sliding mode controller (CSMC) for multi-robot formation control. The framework integrates Laplacian-based consensus with sliding mode robustness and adaptive formation scaling to simultaneously achieve accurate formation tracking and high formation consistency, while ensuring flexibility in constrained environments. The approach is validated in NVIDIA Isaac Sim and real-world experiments with Mecanum-wheeled robots. Compared with conventional sliding mode control (SMC), CSMC achieves consistent improvements in formation consistency, tracking accuracy, and overall performance in both simulation and real-world experiments. When compared with flocking based approaches, CSMC provides substantially improved tracking performance and achieves better overall performance under consistency-prioritized evaluation metrics. These results demonstrate the effectiveness of CSMC in achieving reliable formation tracking, consistent coordination, and adaptive formation scaling for multi-robot navigation.

Abstract:
Healthcare robotics requires robust multimodal perception and reasoning to ensure safety in dynamic clinical environments. Current Vision-Language Models (VLMs) demonstrate strong general-purpose capabilities but remain limited in temporal reasoning, uncertainty estimation, and structured outputs needed for robotic planning. We present a lightweight agentic multimodal framework for video-based scene understanding. Combining the Qwen2.5-VL-3B-Instruct model with a SmolAgent-based orchestration layer, it supports chain-of-thought reasoning, speechvision fusion, and dynamic tool invocation. The framework generates structured scene graphs and leverages a hybrid retrieval module for interpretable and adaptive reasoning. Evaluations on the Video-MME benchmark and a custom clinical dataset show competitive accuracy and improved robustness compared to stateof-the-art VLMs, demonstrating its potential for applications in robot-assisted surgery, patient monitoring, and decision support.

Abstract:
Bioinspired event cameras, with their high temporal resolution, low power consumption, and inherent motion responsiveness, have been widely adopted for fundamental vision tasks in robotics, notably optical flow estimation. Recent studies have shown that incorporating complementary frame data can significantly enhance the performance of event-based optical flow estimation. However, two major challenges hinder the real-time deployment of such methods on robotic platforms: (1) the asynchronous nature of events and frames makes it difficult to generalize across varying input temporal offsets; and (2) reliance on computationally expensive correlation volume construction and iterative refinement results in high inference latency on embedded systems. To address these issues, we propose a novel method that takes asynchronous event and frame streams as input and predicts high-quality dense flow in a single forward pass. Our approach temporally encodes both intra- and inter-sensor features and efficiently integrates them into a lightweight correlation volume to enhance flow prediction. Experimental results on real-world scenes demonstrate that our method improves flow accuracy by up to 22% over state-of-the-art hybrid event-frame methods, while being 3x faster on embedded GPUs. Furthermore, our approach maintains strong performance and generalizes well across diverse frame-event temporal offsets, introducing a novel paradigm for fusing asynchronous frame and event streams for continuous-time optical flow estimation.

Abstract:
We present UniFuture, a unified 4D Driving World Model designed to simulate the dynamic evolution of the 3D physical world. Unlike existing driving world models that focus solely on 2D pixel-level video generation (lacking geometry) or static perception (lacking temporal dynamics), our approach bridges appearance and geometry to construct a holistic 4D representation. Specifically, we treat future RGB images and depth maps as coupled projections of the same 4D reality and model them jointly within a single framework. To achieve this, we introduce a Dual-Latent Sharing (DLS) scheme, which maps visual and geometric modalities into a shared spatio-temporal latent space, implicitly entangling texture with structure. Furthermore, we propose a Multi-scale Latent Interaction (MLI) mechanism, which enforces bidirectional consistency: geometry constrains visual synthesis to prevent structural hallucinations, while visual semantics refine geometric estimation. During inference, UniFuture can forecast high- fidelity, geometrically consistent 4D scene sequences (image- depth pairs) from a single current frame. Extensive experiments on the nuScenes and Waymo datasets demonstrate that our method outperforms specialized models in both future generation and geometry perception, highlighting the potential of unified 4D modeling for autonomous driving. The code is available at https://github.com/dk-liang/UniFuture.

Abstract:
Reliable localization is crucial for navigation in forests, where GPS is often degraded and LiDAR measurements are repetitive, occluded, and structurally complex. These conditions weaken the assumptions of traditional urban-centric localization methods, which assume that consistent features arise from unique structural patterns, necessitating forest-centric solutions to achieve robustness in these environments. To address these challenges, we propose TreeLoc, a LiDAR-based global localization framework for forests that handles place recognition and 6-DoF pose estimation. We represent scenes using tree stems and their diameter at breast height (DBH), which are aligned to a common reference frame via their axes and summarized using the tree distribution histogram (TDH) for coarse matching, followed by fine matching with a 2D triangle descriptor. Finally, pose estimation is achieved through a two-step geometric verification. On diverse forest benchmarks, TreeLoc outperforms baselines, achieving precise localization. Ablation studies validate the contribution of each component. We also propose applications for long-term forest management using descriptors from a compact global tree database. TreeLoc is open-sourced for the robotics community at https://github.com/minwoo0611/TreeLoc.

Abstract:
3DGS has shown outstanding performance in multi-view geometry, driving its adoption in visual SLAM. However, real-time semantic 3DGS mapping faces challenges. Current methods typically treat semantics as external priors, making it hard to integrate them into SLAM tracking or loop closure correction. Moreover, traditional semantic SLAM corrects accumulated drift by applying rigid adjustments to dense point clouds, which is costly for 3DGS maps and limits loop closure performance. We propose GauSem-SLAM, which uses a Gaussian semantic submap representation with a progressive allocation strategy, integrating semantics into tracking, mapping, loop detection, and submap management. We fully exploit semantic information by designing a robust loop detection module that combines DINOv2 semantic features with 3D semantic landmarks. Furthermore, we introduce Semantic-Guided Registration (SGR), a method for computing inter-submap loop constraints. Through intra-submap and inter-submap loop correction, followed by a two-stage global map refinement, our system achieves globally consistent pose estimation and mapping. Experiments on three public datasets demonstrate that our method outperforms prior methods in both tracking and mapping.

Abstract:
Large language models (LLMs) can translate natural language instructions into executable action plans for robotics, autonomous driving, and other domains. Yet, deploying LLM-driven planning in the physical world demands strict adherence to safety and regulatory constraints, which current models often violate due to hallucination or weak alignment. Traditional data-driven alignment methods, such as Direct Preference Optimization (DPO), require costly human labeling, while recent formal-feedback approaches still depend on resource-intensive fine-tuning. In this paper, we propose LAD-VF, a fine-tuning-free framework that leverages formal verification feedback for automated prompt engineering. By introducing a formal-verification-informed text loss integrated with LLM-AutoDiff, LAD-VF iteratively refines prompts rather than model parameters. This yields three key benefits: (i) scalable adaptation without fine-tuning; (ii) compatibility with modular LLM architectures; and (iii) interpretable refinement via auditable prompts. Experiments in robot navigation and manipulation tasks demonstrate that LAD-VF substantially enhances specification compliance, improving success rates from 60% to over 90%. Our method thus presents a scalable and interpretable pathway toward trustworthy, formally-verified LLM-driven control systems.

Abstract:
Wheeled-legged robots hold promise for traversing complex terrains and offer superior mobility compared to legged robots. However, wheeled-legged robots must effectively balance both wheeled driving and legged control. Furthermore, due to noisy proprioceptive sensing and real-world motor constraints, realizing robust and adaptive locomotion at peak performance of motors remains challenging. We propose the Multi-skill Unified Joint Integration of Control Architecture (MUJICA), a unified, fully proprioceptive control framework for wheeled-legged robots that integrates diverse low-level skillsincluding omnidirectional moving, high platform climbing, and fall recoverywithin a single policy. All skills, distinguished by unique indicator variables, are trained jointly with accurate DC-motor constraint modeling. Additionally, a high-level skill selector is learned to dynamically choose the optimal skill based solely on proprioceptions, enabling adaptive responses to the surrounding environment. Therefore, MUJICA enhances sim-to-real robustness and enables seamless transitions across diverse locomotion modes, facilitating autonomous adjustment to the environment. We validate our framework in both simulation and real-world experiments on the Unitree Go2-W robot, demonstrating significant improvements in adaptability and task success in unstructured environments.

Abstract:
Generative models have advanced significantly in realistic image synthesis, with diffusion models excelling in quality and stability. Recent multi-view diffusion models improve 3D-aware street view generation, but they struggle to produce place-aware and background-consistent urban scenes from text, BEV maps, and object bounding boxes. This limits their effectiveness in generating realistic samples for place recognition tasks. To address these challenges, we propose DiffPlace, a novel framework that introduces a place-ID controller to enable place-controllable multi-view image generation. The place-ID controller employs linear projection, perceiver transformer, and contrastive learning to map place-ID embeddings into a fixed CLIP space, allowing the model to synthesize images with consistent background buildings while flexibly modifying foreground objects and weather conditions. Extensive experiments, including quantitative comparisons and augmented training evaluations, demonstrate that DiffPlace outperforms existing methods in both generation quality and training support for visual place recognition. Our results highlight the potential of generative models in enhancing scene-level and place-aware synthesis, providing a valuable approach for improving place recognition. Code will be released.

Abstract:
Humans instinctively adjust their viewpoints to resolve occlusions and infer spatial relationships, enabling effective perception and navigation in cluttered environments. This capability, however, remains a significant challenge for robotic systems. To address this, we propose GPD-AP, a novel active perception framework that leverages grasp pose estimation and associated scoring to systematically tackle grasping tasks in occluded and cluttered settings. The core innovation lies in an end-to-end system where a computationally efficient grasp pose estimation module directly informs a Next-Best-View (NBV) planner. This integration shifts the focus from generic scene exploration to a grasp-oriented visual search, guiding the robot to viewpoints that minimize uncertainty about potential grasps. To train and validate GPD-AP, we introduce a simulation reset method capable of generating highly challenging scenes with partially or fully occluded target objects. Experimental results demonstrate that GPD-AP improves grasping success rates by 30% in dense obstacle environments, effectively enabling the transition of target objects from invisible to visible and graspable states. This work marks a significant step towards autonomous and intelligent robotic manipulation in unstructured real-world scenarios.

Abstract:
Relative pose estimation is crucial for coordinated multi-robot navigation. However, robots in close proximity often face intra-team occlusions, where teammates partially block each other's field of view, while dynamic environments further introduce environmental occlusions. Classical relative pose estimation methods degrade under occlusion and texture scarcity, whereas learning-based methods often lack explicit geometric consistency, which limits their accuracy during real deployments. To address multi-robot relative pose estimation in complex 3D environments, we introduce Geometric-Aware Diffusion Matching (GADM), which enables a team of robots to estimate relative 6-DoF poses using only RGB-D sensors, even under occlusions. GADM uses a diffusion model to progressively exploit global and higher-order structural constraints encoded by a graph network, guiding smoother optimization and faster convergence to robust correspondence distributions under noise and occlusions. By integrating geometric consistency, GADM explicitly addresses occlusions by producing geometrically consistent matches suitable for real-time deployment on physical robots. The resulting correspondences are then used with geometry-based solvers to estimate 6-DoF relative poses, providing robustness even under partial view overlap and limited keypoint visibility. We conducted experiments using both robotics simulations and physical robot teams, and our results show that GADM achieves robust 6-DoF pose estimation performance in occluded scenarios.

Abstract:
Multi-robot systems are increasingly deployed in high-risk missions such as reconnaissance, disaster response, and subterranean operations. Protecting a human operator while navigating unknown and adversarial environments remains a critical challenge, especially when the communication among the operator and robots is restricted. Unlike existing collaborative exploration methods that aim for complete coverage, this work focuses on task-oriented exploration to minimize the navigation time of the operator to reach its goal while ensuring safety under adversarial threats. A novel escorting framework BodyGuards, is proposed to explicitly integrate seamlessly collaborative exploration, inter-robot-operator communication and escorting. The framework consists of three core components: (I) a dynamic movement strategy for the operator that maintains a local map with risk zones for proactive path planning; (II) a dual-mode robotic strategy combining frontier-based exploration with optimized return events to balance exploration, threat detection, and intermittent communication; and (III) multi-robot coordination protocols that jointly plan exploration and information sharing for efficient escorting. Extensive human-in-the-loop simulations and hardware experiments demonstrate that the method significantly reduces operator risk and mission time, outperforming baselines in adversarial and constrained environments.

Abstract:
Freespace detection in unstructured off-road environments is critical for safe autonomous navigation but remains highly challenging due to ambiguous boundaries, diverse terrains, and long-tail safety-critical cases. Constructing large annotated datasets in such environments is prohibitively costly, which makes active learning essential to maximize model robustness under limited annotation budgets. However, conventional uncertainty or diversity-based strategies are unreliable in these complex settings, often failing to capture rare yet important scenarios. To address this, we propose FALCO, a foundation model guided active learning framework for cost-effective off-road freespace detection. FALCO integrates three complementary criteria: prediction deviation from a vision foundation model, model uncertainty, and semantic evaluation from a vision-language model to form a reliable sample criticality score. In addition, we introduce a semantic grid based sampling strategy that balances coverage across scene conditions while prioritizing challenging cases. Extensive experiments show that FALCO substantially improves robustness on rare and difficult scenarios, achieving significant gains in low-percentile IoU compared to state-of-the-art baselines, while maintaining competitive overall performance.

Abstract:
Assembly hinges on reliably forming connections between parts; yet most robotic approaches plan assembly sequences and part poses while treating connectors as an afterthought. Connections represent the foundational physical constraints of assembly execution; while task planning sequences operations, the precise establishment of these constraints ultimately determines assembly success. In this paper, we treat connections as explicit, primary entities in assembly representation, directly encoding connector types, specifications, and locations for every assembly step. Drawing inspiration from how humans learn assembly tasks through step-by-step instruction manuals, we present Manual2Skill++, a vision-language framework that automatically extracts structured connection information from assembly manuals. We encode assembly tasks as hierarchical graphs where nodes represent parts and sub-assemblies, and edges explicitly model connection relationships between components. A large-scale vision-language model parses symbolic diagrams and annotations in manuals to instantiate these graphs, leveraging the rich connection knowledge embedded in human-designed instructions. We curate a dataset containing over 20 assembly tasks with diverse connector types to validate our representation extraction approach, and evaluate the complete task understanding-to-execution pipeline across four complex assembly scenarios in simulation, spanning furniture, toys, and manufacturing components with real-world correspondence. More detailed information can be found at https://nus-lins-lab.github.io/Manual2SkillPP/

Abstract:
Autonomous vehicles (AVs) offer a cost-effective solution for scientific missions such as underwater tracking. Reinforcement learning (RL) has emerged as a powerful method for controlling AVs, but scaling to fleets (essential for multi-target tracking or rapidly moving targets) is challenging. Multi-Agent RL (MARL) is notoriously sample-inefficient, and while high-fidelity simulators like Gazebos LRAUV provide up to 100× faster-than-real-time single-robot simulations, they offer little speedup in multi-vehicle scenarios, making MARL training impractical. Yet, high-fidelity simulation is crucial to test complex policies and close the sim-to-real gap. To address these limitations, we develop a GPU-accelerated environment that achieves up to 30,000× speedup over Gazebo while preserving its dynamics. This enables fast, end-to-end GPU training and seamless transfer to Gazebo for evaluation. We also introduce a Transformer-based architecture (TransfMAPPO) that learns policies invariant to fleet size and number of targets, enabling curriculum learning to train larger fleets on increasingly complex scenarios. After large-scale GPU training, we perform extensive evaluations in Gazebo, showing our method maintains tracking errors below 5m even with multiple fast-moving targets.

Abstract:
This paper proposes a novel nonlinear disturbance observer (NDO) based dual quaternion dynamics modeling and control framework for a drone with a cable-suspended load. Leveraging dual quaternions, a compact and singularity-free mathematical representation, we derive a unified dynamic model that captures the coupled translational and rotational dynamics of both the drone and the slung load. NDOs are designed to estimate and compensate for uncertainties and external disturbances affecting the drone and the load. Building on this framework, we develop a robust control strategy that ensures precise trajectory tracking of the slung load while maintaining stable drone attitude control. The effectiveness of the proposed approach is validated through comprehensive simulations and real-world experiments on a cargo drone platform. The results highlight the robustness and reliability of the system in practical scenarios, demonstrating its potential application in cargo transportation.

Abstract:
In human-robot collaboration, shared autonomy enhances human performance through precise, intuitive support. Effective robotic assistance requires accurately inferring human intentions and understanding task structures to determine optimal support timing and methods. In this paper, we present SUBTA, a supported teleoperation system for bimanual assembly that couples learned intention estimation, scene-graph task planning, and context-dependent motion assists. We validate our approach through a user study (N=12) comparing standard teleoperation, motion-support only, and SUBTA. Linear mixed-effects analysis revealed that SUBTA significantly outperformed standard teleoperation in position accuracy (p<0.001, d=1.18) and orientation accuracy (p<0.001, d=1.75), while reducing mental demand (p=0.002, d=1.34). Post-experiment ratings indicate clearer, more trustworthy visual feedback and predictable interventions in SUBTA. The results demonstrate that SUBTA greatly improves both effectiveness and user experience in teleoperation.

Abstract:
Reliable 3D instance segmentation is fundamental to language-grounded robotic manipulation. Its critical application lies in cluttered environments, where occlusions, limited viewpoints, and noisy masks degrade perception. To address these challenges, we present Clutt3R-Seg, a zero-shot pipeline for robust 3D instance segmentation for language-grounded grasping in cluttered scenes. Our key idea is to introduce a hierarchical instance tree of semantic cues. Unlike prior approaches that attempt to refine noisy masks, our method leverages them as informative cues: through cross-view grouping and conditional substitution, the tree suppresses over- and under-segmentation, yielding view-consistent masks and robust 3D instances. Each instance is enriched with open-vocabulary semantic embeddings, enabling accurate target selection from natural language instructions. To handle scene changes during multi-stage tasks, we further introduce a consistency-aware update that preserves instance correspondences from only a single post-interaction image, allowing efficient adaptation without rescanning. Clutt3R-Seg is evaluated on both synthetic and real-world datasets, and validated on a real robot. Across all settings, it consistently outperforms state-of-the-art baselines in cluttered and sparse-view scenarios. Even on the most challenging heavy-clutter sequences, Clutt3R-Seg achieves an AP@25 of 61.66, over 2.2× higher than baselines, and with only four input views it surpasses MaskClustering with eight views by more than 2×. The code is available at: https://github.com/jeonghonoh/clutt3r-seg.

Abstract:
We present HAND, a simple and time-efficient method for teaching robots new manipulation tasks through human hand demonstrations. Instead of relying on task-specific robot demonstrations collected via teleoperation, HAND uses easy-to-provide hand demonstrations to retrieve relevant behaviors from task-agnostic robot play data. Using a visual tracking pipeline, HAND extracts the motion of the human hand from the hand demonstration and retrieves robot sub-trajectories in two stages: first filtering by visual similarity, then retrieving trajectories with similar behaviors to the hand. Fine-tuning a policy on the retrieved data enables real-time learning of tasks in under four minutes, without requiring calibrated cameras or detailed hand pose estimation. Experiments also show that HAND outperforms retrieval baselines by over 2x in average task success rates on real robots. Videos can be found at our project website: https://liralab.usc.edu/handretrieval.

Abstract:
Recent progress in dense SLAM has primarily targeted monocular setups, often at the expense of robustness and geometric coverage. We present MCGS-SLAM, the first purely RGB-based multi-camera SLAM system built on 3D Gaussian Splatting (3DGS). Unlike prior methods relying on sparse maps or inertial data, MCGS-SLAM fuses dense RGB inputs from multiple viewpoints into a unified, continuously optimized Gaussian map. A multi-camera bundle adjustment (MCBA) jointly refines poses and depths via dense photometric and geometric residuals, while a scale consistency module enforces metric alignment across views using low-rank priors. The system supports RGB input and maintains real-time performance at large scale. Experiments on synthetic and real-world datasets show that MCGS-SLAM consistently yields accurate trajectories and photorealistic reconstructions, usually outperforming monocular baselines. Notably, the wide field of view from multi-camera input enables reconstruction of side-view regions that monocular setups miss, critical for safe autonomous operation. These results highlight the promise of multi-camera Gaussian Splatting SLAM for high-fidelity mapping in robotics and autonomous driving.

Abstract:
Footstep planning involves a challenging combinatorial search. Traditional A approaches require discretising reachability constraints, while Mixed-Integer Programming (MIP) supports continuous formulations but quickly becomes intractable, especially when rotations are included. We present CASSR, a novel framework that recursively propagates convex, continuous formulations of a robots kinematic constraints within an A search. Combined with a new cost-to-go heuristic based on the EPA algorithm, CASSRefficiently plans contact sequences of up to 30 footsteps in under 125 ms. Experiments on biped locomotion tasks demonstrate that CASSR outperforms traditional discretised A by up to a factor of 100, while also surpassing a commercial MIP solver. These results show that CASSR enables fast, reliable, and real-time footstep planning for biped robots.

Abstract:
Robotic quality inspection is emerging as a key enabler in intelligent manufacturing, allowing robots to transcend human limitations in endurance, consistency, and access to complex structures. By detecting subtle defects with speed and precision, robotic inspection enhances efficiency while elevating production quality. While most existing approaches emphasize 2D image-based surface defect detection, they often overlook geometric defects, which are more prevalent and challenging in industrial inspection. To overcome this gap, we formulate geometric defect detection as anomaly detection in 3D point clouds and propose a novel framework that integrates contrastive learning with spatially aware comparisons of local geometries. Specifically, we partition point cloud surfaces into patches and employ contrastive learning to train a neural network-based feature extractor capable of capturing rich geometric representations. An anomaly detection algorithm is then introduced to identify defects by comparing patch-level features in a spatially consistent manner. Evaluated on the recent Real3D-AD benchmark, our method achieves a mean area under the ROC curve of 0.901, establishing a new state of the art and demonstrating the potential of robotic inspection systems to move beyond human limitations in detecting subtle geometric anomalies

Abstract:
Sequences of interdependent geometric constraints are central to many multi-agent Task and Motion Planning (TAMP) problems. However, existing methods for handling such constraint sequences struggle with partially ordered tasks and dynamic agent assignments. They typically assume static assignments and cannot adapt when disturbances alter task allocations. To overcome these limitations, we introduce Graph-of-Constraints Model Predictive Control (GoC-MPC), a generalized sequence-of-constraints framework integrated with MPC. GoC-MPC naturally supports partially ordered tasks, dynamic agent coordination, and disturbance recovery. By defining constraints over tracked 3D keypoints, our method robustly solves diverse multi-agent manipulation taskscoordinating agents and adapting online from visual observations alone, without relying on training data or environment models. Experiments demonstrate that GoC-MPC achieves higher success rates, significantly faster TAMP computation, and shorter overall paths compared to recent baselines, establishing it as an efficient and robust solution for multi-agent manipulation under real-world disturbances. Our supplementary video and code can be found at https://sites.google.com/view/goc-mpc/home.

Abstract:
Trajectory sampling is a key component of sampling-based control mechanisms. Trajectory samplers rely on control input samplers, which generate control inputs u from a distribution p(u | x) where x is the current state. We introduce the notion of Free Configuration Space Uniformity (C-Free-Uniform for short) which has two key features: (i) the generated control input can be used to uniformly sample the free configuration space, and (ii) in contrast to previously introduced trajectory sampling mechanisms where the distribution p(u | x) is independent of the environment, C-Free-Uniform is explicitly conditioned on the current local map. Next, we integrate this sampler into a new Model Predictive Path Integral (MPPI) Controller, CFU-MPPI. Experiments show that CFU-MPPI outperforms existing methods in terms of success rate in challenging navigation tasks in cluttered polygonal environments while requiring a much smaller sampling budget. Code: https://github.com/ogpoyrazoglu/cuniform_sampling.

Abstract:
Model-based reinforcement learning (MBRL) is a promising approach to enabling robots to learn directly from a limited number of real-world interactions. Model-based reinforcement learning (MBRL) is notoriously difficult in settings without full state observability because algorithms must simultaneously infer state from incomplete observations and use these inferences to learn environment dynamics. Toward the use of MBRL for autonomous robots, we introduce EMBRL, an expectation-maximization framework that combines classical Bayesian state estimation with deep MBRL to jointly infer states and learn neural network state transition models. This framework takes advantage of the rich theory and practice of state estimation from the field of robotics, while enabling behavior learning without a priori known robot dynamics. Though conceptually straightforward, our instantiation of this framework for deep MBRL reveals several key challenges when using a learned transition model both for state inference and policy learning. We introduce a practical implementation of EMBRL using both particle and extended Kalman filters and smoothers and discuss key design choices necessary for effective implementation. Finally, we evaluate different instantiations of the EMBRL framework on both simulated and real-robot tasks and show that our methods learn higher performing policies compared to strong MBRL baselines using recurrent neural networks.

Abstract:
How to teach sensorimotor skills in haptic virtual environments is a classic research question and has been investigated with different target skills and strategies. In this study, we studied how to assist users by modulating haptic sensations in the learning environment, presented via a force-feedback haptic device. We developed a haptic amplification method and evaluated its effectiveness on skill training with the target skill of needle felting. To this end, we initially collected the force profile data captured from an expert's job and amplified the magnitude of force to be felt clearly. Then, the augmented haptic sensations were rendered in the virtual learning environment. We assessed the usefulness of our method by conducting a user study with 24 participants performing virtual needle felting tasks involving many micro-movements. As a result, amplified force profile feedback significantly improved the novice participants' learning performance. Based on the results, we then discussed how we can provide an adequate haptic feedback method on learning tasks, especially in fields requiring precise dexterous or tool movements.

Abstract:
Accurate visual localization is crucial for autonomous driving, yet existing methods face a fundamental dilemma: While high-definition (HD) maps provide high-precision localization references, their costly construction and maintenance hinder scalability, which drives research toward standard-definition (SD) maps like OpenStreetMap. Current SD-map-based approaches primarily focus on Bird's-Eye View (BEV) matching between images and maps, overlooking a ubiquitous signalnoisy GPS. Although GPS is readily available, it suffers from multipath errors in urban environments. We propose DiffVL, the first framework to reformulate visual localization as a GPS denoising task using diffusion models. Our key insight is that noisy GPS trajectory, when conditioned on visual BEV features and SD maps, implicitly encode the true pose distribution, which can be recovered through iterative diffusion refinement. DiffVL, unlike prior BEV-matching methods (e.g., OrienterNet) or transformer-based registration approaches, learns to reverse GPS noise perturbations by jointly modeling GPS, SD map, and visual signals, achieving sub-meter accuracy without relying on HD maps. Experiments on multiple datasets demonstrate that our method achieves state-of-the-art accuracy compared to BEV-matching baselines. Crucially, our work proves that diffusion models can enable scalable localization by treating noisy GPS as a generative priormarking a paradigm shift from traditional matching-based methods.

Abstract:
Planetary exploration using aerial assets has the potential for unprecedented scientific discoveries on Mars. While NASA's Mars helicopter Ingenuity proved flight in Martian atmosphere is possible, future Mars rotorcraft will require advanced navigation capabilities for long-range flights. One such critical capability is Map-based Localization (MbL) which registers an onboard image to a reference map during flight to mitigate cumulative drift from visual odometry. However, significant illumination differences between rotorcraft observations and a reference map prove challenging for traditional MbL systems, restricting the operational window of the vehicle. In this work, we investigate a new MbL system and propose Geo-LoFTR, a geometry-aided deep learning model for image registration that is more robust under large illumination differences than prior models. The system is supported by a custom simulation framework that uses real orbital maps to produce large amounts of realistic images of the Martian terrain. Comprehensive evaluations show that our proposed system outperforms prior MbL efforts in terms of localization accuracy under significant lighting and scale variations. Furthermore, we demonstrate the validity of our approach across a simulated Martian day, and on real Mars imagery. Code and datasets are available at: https://dpisanti.github.io/geo-loftr/.

Abstract:
Autonomous systems often must predict the motions of nearby agents from partial and noisy data. This paper asks and answers the question: "Can we learn, in real-time, a nonlinear predictive model of another agent's motions?" Our online framework denoises and forecasts such dynamics using a modified sliding-window Hankel Dynamic Mode Decomposition (Hankel-DMD). Partial noisy measurements are embedded into a Hankel matrix, while an associated Page matrix enables singular-value hard thresholding (SVHT) to denoise the Hankel matrix and estimate its rank. A Cadzow projection enforces structured low-rank consistency, yielding a denoised trajectory and local noise variance estimates. From this representation, a time-varying Hankel-DMD lifted linear predictor is constructed for multi-step forecasts. The residual analysis provides variance-tracking signals that can support downstream estimators and risk-aware planning. We validate the approach in simulation under Gaussian and heavy-tailed noise, and experimentally on a dynamic crane testbed. Results show that the method achieves stable variance-aware denoising and short-horizon prediction suitable for integration into real-time control frameworks.

Abstract:
Recent advances in Vision Transformers (ViTs) have significantly improved the performance of Visual Place Recognition (VPR), but their high computational costdue to the quadratic complexity of self-attentionlimits their practical deployment in real-world scenarios. To address this challenge, we propose PAGTM (Positional- and Attention-Guided Token Merging), a training-free token reduction framework designed specifically for ViT-based VPR models. In VPR, preserving the spatial layout of a scene (e.g. road alignment, building structures) and focusing on semantically meaningful regions are both critical for robust matching under viewpoint and appearance variations. However, existing token reduction methods often overlook these aspects, leading to degraded recognition performance. To address this, PAGTM incorporates two key cues. The first is positional proximity, which merges spatially adjacent tokens to maintain the scenes structural layout. The second is attention-based token protection, which retains tokens that receive high attention because they represent regions important for distinguishing places, such as signs or distinctive structures. Without requiring any fine-tuning, PAGTM can be directly applied at inference time and consistently outperforms existing token reduction methods such as ToMe and ToFu across multiple ViT-based VPR models and datasets, achieving a better trade-off between computational efficiency and retrieval accuracy.

Abstract:
Vision-language-action (VLA) models have shown strong generalization in robotic manipulation through large-scale vision-language pretraining. However, most existing models rely solely on RGB cameras, limiting their perception and, consequently, manipulation capabilities. We present OmniVLA, an omni-modality VLA model that integrates novel sensing modalities to enable beyond-RGB robotic perception and manipulation. The core of our approach is the sensor-masked image, a unified representation that overlays physically meaningful, spatially grounded masks onto the RGB images. These masks are derived from sensors including an infrared camera, a mmWave radar, and a microphone array. This image-native unification keeps sensor input close to RGB statistics to facilitate training, provides a uniform interface across sensor hardware, and enables data-efficient learning with lightweight per-sensor projectors. Building on this, we design a multimodal vision-language-action model architecture and train OmniVLA by extending an RGB-pretrained VLA backbone. We evaluate OmniVLA on challenging real-world tasks that require sensor-modality perception to guide the manipulation. OmniVLA achieves an average task success rate of 84%, significantly outperforms both RGB-only and raw-sensor-input baseline models by 59% and 28% respectively, meanwhile showing higher learning efficiency and stronger generalization capability.

Abstract:
With the increasing deployment of robots in dynamic and unpredictable scenarios, it becomes necessary for robots to acquire not only contact-based but also non-contact tactile signals to enhance environmental understanding. However, current non-contact tactile sensors are largely limited to detecting or coarsely recognizing external stimuli, while achieving high spatial resolution typically entails increased sensor density and complex fabrication. This work presents a flexible sparse 2D sensor array, in conjunction with a tailored deep learning model called adaptive spatial-temporal graph convolutional network (ASTGCN), facilitating 3D spatial super-resolution (SR) perception. Built on single-electrode triboelectric nanogenerators with an optimized layout, the sensor array achieves spatial perception while providing a large perception space at low sensor density. Enhanced by the ASTGCN model, this system achieves an average spatial positioning error of 3.11 mm with a physical resolution of only 23 sensors. This research provides novel insights into non-contact haptic perception systems, enabling spatial super-resolution tasks, including spatial trajectory tracking and non-contact gesture classification with 99.33% accuracy, where the gesture classification is used to control a dexterous hand for human-robot interaction.

Abstract:
Realistic animation of real trees is challenging due to the difficulty in accurately capturing and simulating their movements under varying environmental conditions. Most of real tree reconstruction methods focus on the static modeling of trees from RGB images or LiDAR point clouds. Rather than RGB images, RGB-D (RGB+Depth) sensors provide a low-cost solution for faithful reconstruction of dynamic tree models in 3D. However, it is difficult to capture and reconstruct a complete dynamic tree with complex branch structures using a single RGB-D sensor. In this paper, we propose Simulation-Ready Tree, a dynamic tree reconstruction framework that synthesizes simulation-ready trees by reconstructing 3D tree models and extracting material properties of tree branches from only a single RGB-D sensor. It starts by pre-scanning multi-view RGB-D images around an outdoor tree. For creating a complete static tree point cloud, we presented a coarse-to-fine registration method by considering the skeleton features of main branches of tree points from multi-views. Then, a static tree model is reconstructed from the registered point cloud using an improved space colonization algorithm. Subsequently, a DeT (deep RGB-D tracking) model is employed to track the movements of tree branches during pull-testing, and the material properties of the tree are approximated by Fourier transform and half-power bandwidth methods. Next, a simulation-ready tree is created by constructing its hierarchical structures with corresponding material properties. Finally, the modal analysis method of curved cantilever beams is applied to the simulation-ready tree for animating trees under static load. We demonstrate realistic animation results of our framework by comparing with the ground truth RGB-D data sequences for various tree species.

Abstract:
For robots to operate autonomously in densely cluttered environments, they must reason about and potentially physically interact with obstacles to clear a path. Safely clearing a path on challenging terrain, such as a cluttered staircase, requires controlled interaction. For example, a quadrupedal robot that pushes objects out of the way with one leg while maintaining a stable stance with its three other legs. However, tightly coupled physical actions, such as one-legged pushing, create new constraints on the system that can be difficult to predict at design time. In this work, we present a new method that addresses one such constraint, wherein the object being pushed by a quadrupedal robot with one of its legs becomes occluded from the robot's sensors during manipulation. To address this challenge, we present a tightly coupled perception-action framework that enables the robot to perceive clutter, reason about feasible push paths, and execute the clearing maneuver. Our core contribution is an interaction-aware state estimation loop that uses proprioceptive feedback regarding foot contact and leg position to predict an object's displacement during the occlusion. This prediction guides the perception system to robustly re-detect the object after the interaction, closing the loop between action and sensing to enable accurate tracking even after partial pushes. Using this feedback allows the robot to learn from physical outcomes, reclassifying an object as immovable if a push fails due to it being too heavy. We present results of implementing our approach on a Boston Dynamics Spot robot that show our interaction-aware approach achieves higher task success rates and tracking accuracy in pushing objects on stairs compared to open-loop baselines.

Abstract:
Long-horizon dynamical prediction is fundamental in robotics and control, underpinning canonical methods like model predictive control. Yet, many systems and disturbance phenomena are difficult to model due to effects like nonlinearity, chaos, and high-dimensionality. Koopman theory addresses this by modeling the linear evolution of embeddings of the state under an infinite-dimensional linear operator that can be approximated with a suitable finite basis of embedding functions, effectively trading model nonlinearity for representational complexity. However, explicitly computing a good choice of basis is nontrivial, and poor choices may cause inaccurate forecasts or overfitting. To address this, we present Kalman-Implicit Koopman Operator (KALIKO) Learning, a method that leverages the Kalman filter to implicitly learn embeddings corresponding to latent dynamics without requiring an explicit encoder. KALIKO produces interpretable representations consistent with both theory and prior works, yielding high-quality reconstructions and inducing a globally linear latent dynamics. Evaluated on wave data generated by a high-dimensional PDE, KALIKO surpasses several baselines in open-loop prediction and in a demanding closed-loop simulated control task: stabilizing an underactuated manipulator's payload by predicting and compensating for strong wave disturbances.

Abstract:
Novel view synthesis is a key task for dynamic scene reconstruction, where high rendering speed is essential for applications such as virtual reality. Existing deformable Gaussian Splatting methods achieve high-fidelity dynamic scene modeling, but still face limitations in memory usage and rendering efficiency due to large numbers of redundant Gaussians. To address these challenges, we propose Geometry-Aware Redundancy Optimization (GARO), a unified redundancy measurement framework in the adaptive density control stage of the traditional dynamic scene reconstruction pipeline. This framework first selects low-gradient candidates using an optimization activity assessment strategy, and then evaluates geometric complexity through low curvature analysis to further filter and prune redundant points, resulting in a compact and expressive Gaussian representation. Extensive experiments on synthetic and real-world datasets demonstrate that GARO achieves robust trade-offs between quality and speed, with PSNR remaining stable and rendering speed improved by 2times, validating the efficiency and effectiveness of GARO.

Abstract:
Recent advances in 3D Large Multimodal Models (LMMs) built on Large Language Models (LLMs) have established the alignment of 3D visual features with LLM representations as the dominant paradigm. However, the inherited Rotary Position Embedding (RoPE) introduces limitations for multimodal processing. Specifically, applying 1D temporal positional indices disrupts the continuity of visual features along the column dimension, resulting in spatial locality loss. Moreover, RoPE follows the prior that temporally closer image tokens are more causally related, leading to long-term decay in attention allocation and causing the model to progressively neglect earlier visual tokens as the sequence length increases. To address these issues, we propose C^2RoPE, an improved RoPE that explicitly models local spatial Continuity and spatial Causal relationships for visual processing. C^2RoPE introduces a spatio-temporal continuous positional embedding mechanism for visual tokens. It first integrates 1D temporal positions with Cartesian-based spatial coordinates to construct a triplet hybrid positional index, and then employs a frequency allocation strategy to encode spatio-temporal positional information across the three index components. Additionally, we introduce Chebyshev Causal Masking, which determines causal dependencies by computing the Chebyshev distance of image tokens in 2D space. Evaluation results across various benchmarks, including 3D scene reasoning and 3D visual question answering, demonstrate C^2RoPE's effectiveness. The code is be available at https://github.com/ErikZ719/C2RoPE.

Abstract:
Reinforcement Learning (RL)-based methods have significantly improved the locomotion performance of legged robots. However, these motion policies face significant challenges when deployed in the real world. Robots operating in uncertain environments struggle to adapt to payload variations and external disturbances, resulting in severe degradation of motion performance. In this work, we propose a novel Hybrid Force-Position Locomotion Policy (HFPLP) learning framework, where the action space of the policy is defined as a combination of target joint positions and feedforward torques, enabling the robot to rapidly respond to payload variations and external disturbances. In addition, the proposed Disturbance-Aware Adaptive Compensation (DAAC) provides compensation actions in the torque space based on external disturbance estimation, enhancing the robot's adaptability to dynamic environmental changes. We validate our approach in both simulation and real-world deployment, demonstrating that it outperforms existing methods in carrying payloads and resisting disturbances.

Abstract:
Autonomous aerial robots are increasingly being deployed in real-world scenarios, where transparent obstacles present significant challenges to reliable navigation and mapping. These materials pose a unique problem for traditional perception systems because they lack discernible features and can cause conventional depth sensors to fail, leading to inaccurate maps and potential collisions. To ensure safe navigation, robots must be able to accurately detect and map these transparent obstacles. Existing methods often rely on large, expensive sensors or algorithms that impose high computational burdens, making them unsuitable for low Size, Weight, and Power (SWaP) robots. We present a resource-constrained sensing pipeline for detecting and mapping transparent planar obstacles onboard a sub-300g quadrotor. By exploiting Time-of-Flight (ToF) speckle morphology and sonar-gated fusion, our system identifies specular reflections and reprojects their depth into empty space regions in real-time, with safety margins analytically validated for indoor flight speeds. The entire pipeline operates onboard an embedded processor using approximately 20% of a single CPU core at 2 Hz. We validate our system through experiments in controlled and real-world environments, confirming its ability to accurately render transparent obstacles visible. To our knowledge, this is the first CPU-only, real-time demonstration of transparent plane reprojection on a sub-300g quadrotor.

Abstract:
In this work, we propose HE-VPR, a visual place recognition (VPR) framework that incorporates height estimation. Our system decouples height inference from place recognition, allowing both modules to share a frozen DINOv2 backbone. Two lightweight bypass adapter branches are integrated into our system. The first estimates the height partition of the query image via retrieval from a compact height database, and the second performs VPR within the corresponding heightspecific sub-database. The adaptation design reduces training cost and significantly decreases the search space of the database. We also adopt a center-weighted masking strategy to further enhance the robustness against scale differences. Experiments on two self-collected challenging multi-altitude datasets demonstrate that HE-VPR achieves up to 6.1% Recall@1 improvement over state-of-the-art ViT-based baselines and reduces memory usage by up to 90%. These results indicate that HEVPR offers a scalable and efficient solution for height-aware aerial VPR, enabling practical deployment in GNSS-denied environments. All the code and datasets for this work have been released on https://github.com/hmf21/HE-VPR.

Abstract:
Reinforcement learning (RL) has enabled robots to develop complex skills, but its success in image-based tasks often depends on effective representation learning. Prior works have primarily focused on 2D representations, often overlooking the inherent 3D geometric structure of the world, or have attempted to learn 3D representations that require extensive resources such as synchronized multi-view images even during deployment. To address these issues, we propose a novel RL framework that extracts 3D-aware representations from single-view RGB input, without requiring camera pose or synchronized multi-view images during the downstream RL. Our method employs an autoencoder architecture, using a masked Vision Transformer (ViT) as the encoder and a latent-conditioned Neural Radiance Fields (NeRF) as the decoder, trained with cross-view completion to implicitly capture fine-grained, 3D geometry-aware representations. Additionally, we utilize a time contrastive loss that further regularizes the learned representation for consistency across different viewpoints, which enables viewpoint-robust robot manipulations. Our method significantly enhances the RL agents performance both in simulation and real-world experiments, demonstrating superior effectiveness compared to prior 3D-aware representation-based methods, even when using only single-view RGB images during deployment.

Abstract:
Unthinking execution of human instructions in robotic manipulation can lead to severe safety risks, such as poisonings, fires, and even explosions. In this paper, we present responsible robotic manipulation, which requires robots to consider potential hazards in the real-world environment while completing instructions and performing complex operations safely and efficiently. However, such scenarios in real world are variable and risky for training. To address this challenge, we propose Safety-as-policy, which includes (i) a world model to automatically generate scenarios containing safety risks and conduct virtual interactions, and (ii) a mental model to infer consequences with reflections and gradually develop the cognition of safety, allowing robots to accomplish tasks while avoiding dangers. Additionally, we create the SafeBox synthetic dataset, which includes one hundred responsible robotic manipulation tasks with different safety risk scenarios and instructions, effectively reducing the risks associated with real-world experiments. Experiments demonstrate that Safety-as-policy can avoid risks and efficiently complete tasks in both synthetic dataset and real-world experiments, significantly outperforming baseline methods. Our SafeBox dataset shows consistent evaluation results with real-world scenarios, serving as a safe and effective benchmark for future research. Our code, data, and supplementary materials are available at: https://sites.google.com/view/safety-as-policy.

Abstract:
Efficient transmission of 3D point cloud data is critical for advanced perception in centralized and decentralized multi-agent robotic systems, especially nowadays with the growing reliance on edge and cloud-based processing. However, the large and complex nature of point clouds creates challenges under bandwidth constraints and intermittent connectivity, often degrading system performance. We propose a deep compression framework based on semantic scene graphs. The method decomposes point clouds into semantically coherent patches and encodes them into compact latent representations with semantic-aware encoders conditioned by Feature-wise Linear Modulation (FiLM). A folding-based decoder, guided by latent features and graph node attributes, enables structurally accurate reconstruction. Experiments on the SemanticKITTI and nuScenes datasets show that the framework achieves state-of-the-art compression rates, reducing data size by up to 98% while preserving both structural and semantic fidelity. In addition, it supports downstream applications such as multi-robot pose graph optimization and map merging, achieving trajectory accuracy and map alignment comparable to those obtained with raw LiDAR scans.

Abstract:
Open access to publication, software and hardware is central to robotics: it lowers barriers to entry, supports reproducible science and accelerates reliable system development. However, openness also exacerbates the inherent dual-use risks associated with research and innovation in robotics. It lowers barriers for states and non-state actors to develop and deploy robotics systems for military use and harmful purposes. Compared to other fields of engineering where dual-use risks are present e.g., those that underlie the development of weapons of mass destruction (chemical, biological, radiological, and nuclear weapons) and even the field of AI, robotics offers no specific regulation and little guidance as to how research and innovation may be conducted and disseminated responsibly. While other fields can be used for guidance, robotics has its own needs and specificities which have to be taken into account. The robotics community should therefore work toward its own set of sector-specific guidance and possibly regulation. To that end, we propose a roadmap focusing on four practices: a) education in responsible robotics; b) incentivizing risk assessment; c) moderating the diffusion of high-risk material; and d) developing red lines.

Abstract:
LiDAR depth cameras are widely used for accurate depth measurement in various applications. However, when multiple cameras operate simultaneously, mutual interference causes artifacts in the captured depth data, which existing image restoration methods struggle to handle. In this paper, we propose DRIM, a novel approach for real-time depth restoration under multi-device interference. Our method begins by distinguishing interference-induced artifacts, then predicts and leverages these artifacts to guide the restoration process. Since there is no existing dataset for learning interference in multiple LiDAR depth cameras, we create and provide the first depth interference dataset. Our experiments demonstrate superior depth restoration performance compared to other image restoration methods, achieving real-time processing speeds (�?3 FPS) that are significantly faster than existing approaches while showing the capability to restore depth in challenging scenarios. These results demonstrate that our proposed method effectively restores interfered depth in multiple LiDAR depth cameras with practical real-time performance. Datasets and codes are available at DRIM project page.

Abstract:
This letter presents a hierarchical planning approach to the vehicle routing and scheduling problem (VRSP) for marsupial robotic systems, a specialized class of heterogeneous robotic systems in which one type of mobile robot is capable of carrying another. While traditional VRSPs have been widely studied, the marsupial variant (MVRSP) has received relatively little attention. To address the NP-hard nature of MVRSP, this work introduces a hierarchical planning structure that decomposes the problem into two subproblems with reduced complexity: a high-level routing problem, formulated as a mixed-integer linear program (MILP), and a low-level scheduling problem, modeled in the Planning Domain Definition Language (PDDL). These subproblem solutions are integrated to generate complete mission plans. The proposed approach is validated through qualitative plan visualizations and quantitative Monte Carlo simulations in an autonomous subsea mapping scenario, where an unmanned surface vehicle carries multiple underwater vehicles. Results show that the hierarchical planner significantly improves both planning efficiency and solution quality compared to baseline methods.

Abstract:
This study focuses on modeling the transmission torque of two coaxial electrorheological (ER) fluid clutches through a data-driven approach. Instead of simplifying the viscosity term in the Bingham model to be a constant as shown in conventional methods, we propose the method of introducing electric field-dependent nonlinearity into the viscosity term to better capture the complex rheological behavior of ER fluids. Based on this framework, we developed a heuristic explicit model (HEM) and a radial basis function model (RBFM) that incorporate the mechanical characteristics of the coaxial clutch structure. Furthermore, we explored direct estimation methods using a radial basis function network (RBFN) and a feedforward neural network (FNN) without relying on the Bingham model. Comparative evaluations with traditional ER models validated the effectiveness of our nonlinear formulations. Notably, the FNN approach demonstrated superior accuracy even with a single hidden layer containing only a few neurons, making it well-suited for real-time implementation with minimal computational overhead. Real-time validation across diverse operating conditions further confirmed the feasibility and robustness of the FNN-based method. These findings contribute new insights into ER fluid applications.

Abstract:
Control Barrier Function (CBF) based quadratic programs (QPs) have become an effective method for enforcing safety in safety-critical systems and robotics. However, these methods often suffer from infeasibility or overly conservative relaxations when handling multiple constraints, potentially compromising safety. In this paper, we propose a hierarchical framework called ``Safety-first" for control design, which simultaneously incorporates performance objectives formulated using Control Lyapunov Functions (CLFs), and safety guarantees via CBFs with input constraints. Unlike existing approaches, the proposed method guarantees solution feasibility while achieving improved performance, and it is scalable to an arbitrary number of CBF constraints. This scalability enables more precise and flexible representation of complex safety requirements using multiple simple CBFs. For application to mobile robot navigation, we employ Constrained Delaunay Triangulation (CDT) to construct multiple CBFs that approximate irregularly-shaped obstacles. Real-world experiments in cluttered and dynamic environments demonstrate that the Safety-first algorithm achieves safe navigation, validating both the theoretical guarantee and practical advantages over existing methods.

Abstract:
Grasp pose detection (GPD) is a fundamental capability for robotic autonomy, but its reliance on large, diverse datasets creates significant data privacy and centralization challenges. Federated Learning (FL) offers a privacy-preserving solution, but its application to GPD is hindered by the substantial communication overhead of large models, a key issue for resource-constrained robots. To address this, we propose a novel module-wise FL framework that begins by analyzing the learning dynamics of the GPD model's functional components. This analysis identifies slower-converging modules, to which our framework then allocates additional communication effort. This is realized through a two-phase process: a standard full-model training phase is followed by a communication-efficient phase where only an adaptively identified subset of slower-converging modules is trained and their partial updates are aggregated. Extensive experiments on the GraspNet-1B dataset demonstrate that our method outperforms standard FedAvg and other baselines, achieving higher accuracy for a given communication budget. Furthermore, real-world experiments on a physical robot validate our approach, showing a superior grasp success rate compared to baseline methods in cluttered scenes. Our work presents a communication-efficient framework for training robust, generalized GPD models in a decentralized manner, effectively improving the trade-off between communication cost and model performance.

Abstract:
The allure of lunar surface exploration and development has recently captured widespread global attention. Robots have proved to be indispensable for exploring uncharted terrains, uncovering and leveraging local resources, and facilitating the construction of future human habitats. In this article, we introduce the modular and on-demand reconfigurable robot (MoonBot), a modular and reconfigurable robotic system engineered to maximize functionality while operating within the stringent mass constraints of lunar payloads and adapting to varying environmental conditions and task requirements. This article details the design and development of MoonBot and presents a preliminary field demonstration that validates the proof of concept through the execution of milestone tasks simulating the establishment of lunar infrastructure. These tasks include essential civil engineering operations, infrastructural component transportation and deployment, and assistive operations with inflatable modules. Furthermore, we systematically summarize the lessons learned during testing, focusing on the connector design and providing valuable insights for the advancement of modular robotic systems in future lunar missions.

Abstract:
Humanoid dexterous hands have significant potential in prosthetics, service robotics, and high-performance manipulation. However, existing designs often struggle to balance the challenging requirements of lightweight design, high biomimicry, personalized customization, and low cost. To address these challenges, we present the CYJ Hand (Customize-Your-Joy Hand), an innovative 22-DOF humanoid dexterous hand. Featuring a highly biomimetic structure, the CYJ system weighs only 750 grams (forearm included). Its modular design supports user-oriented customization while simplifying assembly, maintenance, and functional expansion. Inspired by Da Vinci' s mechanics, the CYJ Hand integrates a novel, controllable tendon mechanism that allows for reconfigurable tendon routing and actuation system to meet diverse needs. Constructed with 3D printing and affordable commercial materials, the hardware cost for the CYJ Hand structure (excluding actuators) is under 60. Experimental results demonstrate that the CYJ Hand achieves a 100% success rate in both the Kapandji Test and GRASP taxonomy, and further exhibits dynamic grasping, in-hand manipulation, sub-millimeter motion repeatability (~0.7 mm), and reliable load-bearing performance, validating its exceptional dexterity and biomimetic design. With its comprehensive advantages and innovations, the CYJ Hand provides a versatile platform for the future applications and research in personalized prosthetics and dexterous robotic manipulation, bridging the gap between high dexterity and accessibility in humanoid robotics. Related files and methods are open-sourced at GitHub repository.

Abstract:
Coordinate Measuring Machines (CMMs) are widely used for high-precision inspection of industrial parts, particularly in scenarios where visual systems are ineffective or cost-prohibitive. However, conventional CMMs rely on CAD model priors and user-defined probing paths, which limit their applicability and efficiency in measuring freeform parts. To overcome these limitations, we present a fully autonomous, CAD model-free, tactile-based framework that enables dense 3D shape reconstruction to facilitate subsequent measurements. Our approach leverages a dual Gaussian Process Implicit Surface architecture, termed Exploration-Reconstruction GPIS (ER-GPIS), which enables both high-fidelity shape reconstruction and uncertainty estimation on the objects surface. A hybrid exploration motion planner is then employed to adaptively sample surface geometries by integrating local surface exploration, global exploration, and contact recovery policies for robust shape estimation. Extensive real-world experiments demonstrate that the proposed method effectively reconstructs object geometries across diverse shapes, highlighting its ability to autonomously reconstruct and measure both surfaces and internal features without relying on CAD model priors.

Abstract:
The Mobile Object Manipulation Operator (MOMO) is an innovative and reconfigurable robotic system that transforms traditional serving robots into mobile manipulators. Leveraging the form factor and mobility of serving robots, MOMO integrates up to three pluggable devices, including 6-DoF manipulators of varying sizes and a 3-DoF sensor head. Its design incorporates two independent shoulder lifts to enhance vertical reach. The adaptability of the system tailors its capabilities to tasks beyond simple object transportation. Opposed to current food delivery robots, MOMO showcases its ability to remove obstructions from the floor and deliver items to recipients without human intervention. This paper provides a comprehensive analysis of MOMOs hardware and software components, emphasizing its modular design and adaptability for complex applications.

Abstract:
Existing end-to-end autonomous driving methods typically rely on imitation learning (IL) but face a key challenge: the misalignment between open-loop training and closed-loop deployment. This misalignment often triggers driver-initiated takeovers and system disengagements during closed-loop execution. How to leverage those expert takeover data from disengagement scenarios and effectively expand the IL policy's capability presents a valuable yet unexplored challenge. In this paper, we propose TakeAD, a novel preference-based post-optimization framework that fine-tunes the pre-trained IL policy with this disengagement data to enhance the closed-loop driving performance. First, we design an efficient expert takeover data collection pipeline inspired by human takeover mechanisms in real-world autonomous driving systems. Then, this post optimization framework integrates iterative Dataset Aggregation (DAgger) for imitation learning with Direct Preference Optimization (DPO) for preference alignment. The DAgger stage equips the policy with fundamental capabilities to handle disengagement states through direct imitation of expert interventions. Subsequently, the DPO stage refines the policy's behavior to better align with expert preferences in disengagement scenarios. Through multiple iterations, the policy progressively learns recovery strategies for disengagement states, thereby mitigating the open-loop gap. Experiments on the closed-loop Bench2Drive benchmark demonstrate our method's effectiveness compared with pure IL methods, with comprehensive ablations confirming the contribution of each component.

Abstract:
This paper introduces a control framework that leverages Lagrangian neural networks (LNNs) for computed torque control (CTC) of robotic systems with unknown dynam- ics. Unlike prior LNN-based controllers that are placed outside the feedback-linearization framework (e.g., feedforward), we embed an LNN inverse-dynamics model within a CTC loop, thereby shaping the closed-loop error dynamics. This strategy, referred to as LNN-CTC, ensures a physically consistent model and improves extrapolation, requiring neither prior model knowledge nor extensive training data. The ap- proach is experimentally validated on a robotic arm with four degrees of freedom and compared with conventional model- based CTC, physics-informed neural network (PINN)-CTC, deep neural network (DNN)-CTC, an LNN-based feedforward controller, and a PID controller. Results demonstrate that LNN- CTC significantly outperforms model-based baselines by up to 30 % in tracking accuracy, achieving high performance with minimal training data. In addition, LNN-CTC outperforms all other evaluated baselines in both tracking accuracy and data efficiency, attaining lower joint-space RMSE for the same training data. The findings highlight the potential of physics-informed neural architectures to generalize robustly across various operating conditions and contribute to narrowing the performance gap between learned and classical control strategies.

Abstract:
The development of bionic underwater robots has brought new vitality to ocean exploration. Motion control is crucial for the stability of underwater robots due to significant differences in flow field characteristics at various swimming speeds. This study focuses on vertical-plane motion and proposes a model predictive control method to achieve integrated control of depth position and pitch attitude for bionic robotic fish. First, based on a robotic tuna system, high-maneuverability vertical-plane motion configuration elements are analyzed and summarized, laying the foundation for motion stability and controllability. Second, through hydrodynamic sampling in aquatic environments, a system model covering the range of swimming speeds is established. Regarding the control method, the proposed motion planning approach converts the desired motion sequence into an equivalent pitch-depth trajectory curve. A nonlinear model predictive controller (NMPC) is then designed to track the trajectory curve, ultimately achieving the desired vertical-plane motion. Experimental results validate that the proposed method not only ensures control accuracy under both low and high-speed conditions, but also enables the execution of complex motion sequence control. This study provides a fresh perspective on the motion instability analysis of robotic fish at high swimming speed and a novel control framework for regulating continuous posture sequences in the vertical plane. Note to PractitionersThe

Abstract:
We propose M3Bench, a new benchmark for whole-body motion generation in mobile manipulation tasks. Given a 3D scene context, M3Bench requires an embodied agent to reason about its configuration, environmental constraints, and task objectives to generate coordinated whole-body motion trajectories for object rearrangement. M3Bench features 30,000 object rearrangement tasks across 119 diverse scenes, providing expert demonstrations generated by our newly developed M3BenchMaker, an automatic data generation tool that produces whole-body motion trajectories from high-level task instructions using only basic scene and robot information. Our benchmark includes various task splits to evaluate generalization across different dimensions and leverages realistic physics simulation for trajectory assessment. Extensive evaluation analysis reveals that state-of-the-art models struggle with coordinating base-arm motion while adhering to environmental and task-specific constraints, underscoring the need for new models to bridge this gap. By releasing M3Bench and M3BenchMaker at https://zeyuzhang.com/papers/m3bench, we aim to advance robotics research toward more adaptive and capable mobile manipulation in diverse, real-world environments.

Abstract:
Omnidirectional aerial manipulators (OAMs) must coordinate a floating base and onboard arm to track end-effector trajectories under coupled geometric and dynamic constraints. In cluttered long-horizon tasks, collision-free motions may still be dynamically inadmissible and dense time layered planning graphs can become disconnected. We present a GPU-accelerated whole-body planning framework that combines reverse-chain node sampling, collision and wrench-aware feasibility filtering, guide-based connectivity refinement, and dynamic programming on a time-layered directed acyclic graph. At each timestep, the target end-effector pose is converted to a wrist-anchored representation, from which large batches of whole-body candidates are generated in parallel on the GPU. Sampled nodes are pruned using self/environment collision checks and rotor-allocation-based wrench-feasibility tests. When local disconnections remain, sparse guides trigger targeted dense resampling to recover connectivity. On drawing and peg-in-hole tasks, the proposed method achieves 91.7% and 100.0% dynamic-safe runs, versus 0.0% and 8.3% for a wrench-unconstrained baseline. Guide refinement reduces mean broken layers by 200.2 and 49.6, yielding 100 percentage-point success gains over no refinement. GPU batching keeps planning practical for long-horizon cluttered OAM problems.

Abstract:
This paper presents a novel ankle rehabilitation platform based on an inclined dual-cylinder mechanism that provides 2-DoF motion through geometric coupling, without complex multi-link structures. Two cylinders sharing a 9° inclined contact surface are driven by two stepper motors, enabling simultaneous dorsiflexion/plantarflexion and inver sion/eversion of up to 18° in each axis. The platform provides both a passive mode, which follows predefined trajectories, and an active mode, which captures user intent through center-of pressure estimation using a force-sensing resistorbased insole. A Particle Swarm Optimizationtuned PD controller is used in both modes, achieving an RMS tracking error below 0.35°in experimental validation. An IMU-integrated gamification environment further demonstrates the feasibility of the platform as an interactive active training system.

Abstract:
A multi-modal dataset was constructed in a real orchard environment under leaf-off conditions using an RGB-D camera and LiDAR, enabling clear observation of branch and trunk structures. The complementary geometric information from both sensors allows for more precise 3D structural reconstruction. Dense point clouds obtained from the RGB-D camera are fused with LiDAR point clouds via ICP registration, followed by ground removal and DBSCAN clustering to segment individual trees. AdTree is then applied to each segmented tree to extract the 3D skeletal structure and generate Ground Truth. The constructed GT explicitly represents the hierarchical branch structure of each tree, and additional data collection under leaf-on conditions is planned to enable quantitative evaluation of skeleton extraction performance across varying foliage conditions. Furthermore, the constructed dataset will be utilized for training and evaluation of a Flow Matching-based generative model for tree skeletonization. Flow Matching enables stable skeleton reconstruction even from noisy and heavily occluded point clouds in real orchard environments, and the dataset is expected to facilitate quantitative analysis of performance differences between leaf-off and leaf-on conditions.

Abstract:
This paper proposes a Planar-Sector Line-of-Sight (PS-LOS) guidance law and an accompanying control stack for lifting-wing quadcopters, enabling robust image-based interception of agile targets. The PS-LOS relaxes conventional conical constraints, preserving maneuverability while reducing aerodynamic penalties. For perception, we employ a delay-compensated Extended Kalman Filter (EKF) to provide low-latency, continuous target estimates. The controller is tailored to lifting-wing quadcopter dynamics and includes coordinated-turn compensation. Outdoor flight experiments demonstrate interceptions against unpredictable agile targets up to 138 m, validating the method's range and robustness.

Abstract:
Humanoid robots have recently achieved impressive progress in locomotion and whole-body control, yet they remain constrained in tasks that demand rapid interaction with dynamic environments through manipulation. Table tennis exemplifies such a challenge: with ball speeds exceeding 5 m/s, players must perceive, predict, and act within sub-second reaction times, requiring both agility and precision. To address this, we present a hierarchical framework for humanoid table tennis that integrates a model-based planner for ball trajectory prediction and racket target planning with a reinforcement learning-based whole-body controller. The planner determines striking position, velocity and timing, while the controller generates coordinated arm and leg motions that mimic human strikes and maintain stability and agility across consecutive rallies. Moreover, to encourage natural movements, human motion references are incorporated during training. We validate our system on a general-purpose humanoid robot, achieving up to 106 consecutive shots with a human opponent and sustained exchanges against another humanoid. These results demonstrate real-world humanoid table tennis with sub-second reactive control, marking a step toward agile and interactive humanoid behaviors.

Abstract:
Multi-step manipulation in dynamic environments remains challenging. Imitation learning (IL) is reactive but lacks compositional generalization, since monolithic policies do not decide which skill to reuse when scenes change. Classical task-and-motion planning (TAMP) offers compositionality, but its high planning latency prevents real-time failure recovery. We introduce SymSkill, a unified framework that jointly learns predicates, operators, and skills from unlabeled, unsegmented demonstrations, combining compositional generalization with real-time recovery. Offline, SymSkill learns symbolic abstractions and goal-oriented skills directly from demonstrations. Online, given a conjunction of learned predicates, it uses a symbolic planner to compose and reorder skills to achieve symbolic goals while recovering from failures at both the motion and symbolic levels in real time. Coupled with a compliant controller, SymSkill supports safe execution under human and environmental disturbances. In RoboCasa simulation, SymSkill executes 12 single-step tasks with 85% success and composes them into multi-step plans without additional data. On a real Franka robot, it learns from 5 minutes of play data and performs 12-step tasks from goal specifications. Code and additional analysis are available at https://sites.google.com/view/symskill.

Abstract:
Inverse Reinforcement Learning (IRL) typically involves inferring a reward function from expert demonstrations to enable agents to imitate the demonstrated behavior. However, real-world settings often provide suboptimal and heterogeneous demonstrations, where human demonstrators use diverse strategies and imperfect actions. Yet, we are unaware of any prior work that simultaneously addresses the challenges of IRL, of which demonstrations are both heterogeneous and suboptimal. In this work, we propose a novel approach, REPRESENT (Reward dEcomPosition fRom hEterogeneous Suboptimal dEmoNstraTion), that disentangles the latent intrinsic task reward and the strategy-specific reward from suboptimal and diverse strategies. Our method learns to identify a shared task reward component that generalizes across varying demonstrator preferences while also modeling distinct strategy-specific rewards. By decomposing the common task reward across varied demonstrations, REPRESENT extracts the core objectives shared by all strategies, enabling the agent to perform better than the demonstrators while preserving individual strategy preferences. We validate our approach on three robotic domains, showing a higher correlation with the true task reward and improved policy performance compared to baselines. These results suggest that REPRESENT can effectively handle suboptimality and heterogeneity, providing a solution for real-world LfD applications to better learn from demonstrations varied in quality and strategy.

Abstract:
We aim to enable humanoid robots to efficiently solve new manipulation tasks from a few video examples. In-context learning (ICL) is a promising framework for achieving this goal due to its test-time data efficiency and rapid adaptability. However, current ICL methods rely on labor-intensive teleoperated data for training, which restricts scalability. We propose using human play videoscontinuous, unlabeled videos of people interacting freely with their environmentas a scalable and diverse training data source. We introduce MimicDroid, which enables humanoids to perform ICL using human play videos as the only training data. MimicDroid extracts trajectory pairs with similar manipulation behaviors and trains the policy to predict the actions of one trajectory conditioned on the other. Through this process, the model acquired ICL capabilities for adapting to novel objects and environments at test time. To bridge the embodiment gap, MimicDroid first retargets human wrist poses estimated from RGB videos to the humanoid, leveraging kinematic similarity. It also applies random patch masking during training to reduce overfitting to human-specific cues and improve robustness to visual differences. To evaluate few-shot learning for humanoids, we introduce an open-source simulation benchmark with increasing levels of generalization difficulty. MimicDroid outperformed state-of-the-art methods and achieved nearly a twofold higher success rate in the real world. Additional materials can be found on: ut-austin-rpl.github.io/MimicDroid

Abstract:
Humans naturally swing their arms during locomotion to regulate whole-body dynamics, reduce angular momentum, and help maintain balance. Inspired by this principle, we present a limb-level multi-agent reinforcement learning (RL) framework that enables coordinated whole-body control of humanoid robots through emergent arm motion. Our approach employs separate actor-critic structures for the arms and legs, trained with centralized critics but decentralized actors that share only base states and centroidal angular momentum (CAM) observations, allowing each agent to specialize in task-relevant behaviors through modular reward design. The arm agent guided by CAM tracking and damping rewards promotes arm motions that reduce overall angular momentum and vertical ground reaction moments, contributing to improved balance during locomotion or under external perturbations. Comparative studies with single-agent and alternative multi-agent baselines further validate the effectiveness of our approach. Finally, we deploy the learned policy on mithumanoid, achieving robust performance across diverse locomotion tasks, including flat-ground walking, rough terrain traversal, and stair climbing.

Abstract:
VisualInertial Odometry (VIO) provides smooth, high-rate state estimates and has been widely used for robotic navigation in both terrestrial and planetary applications. However, its performance is typically dependent on the frequency of visual updates, which is a challenge for planetary rovers operating under extreme resource constraints and low frame rates. This work investigates enabling reliable VIO with very sparse visual updates for lunar rover applications, addressing both day and night-time operations where feature associations become especially difficult under self-illumination conditions. We propose a Birds Eye View (BEV)based image matching scheme that remains robust to larger inter-frame motions and more reliable feature matching despite significant visual appearance changes. We extensively evaluate our proposed approach, BEVIO, through high-fidelity photorealistic lunar and real-time robotic experiments conducted using a half-scale lunar rover, in a long-term daynight deployment at Plaster City, CA, USA. The results demonstrate that our method enables reliable day and nighttime self-illuminated traverses at visual update rates as low as 0.25 Hz, underscoring its suitability for navigation on power- and compute-limited lunar rovers.

Abstract:
机器�?快速掌握作技能是一项重大挑战，受制于物理设备的寿命和安�?要求。目前，强化学习技术在解决涉及丰富接触的动态、无结构问题场景。然而，这些算法的收敛率通常�?由于机器人状�?动作映射的维度较高，速度较慢空间以及广泛的初始政策搜索空间。与此同时，大型语言模型（LLM）的进步赋予了这些模�?具有一定的逻辑推理能力，使他们能够接受机器人初期阶段的主动目标导向行动任务。这些模型可以隐式生成状态的特征�?揭示轨迹生成中的潜在模式。然而，复杂地说涉及丰富接触场景的作性任务，LLM依然会失�?短暂。因此，整合�?的强大交互功�?&

Abstract:
We propose an online iterative algorithm to optimize a convex cover to under-approximate the free space for autonomous navigation to delineate Safe Flight Corridors (SFC). The convex cover consists of a set of polytopes such that the union of the polytopes represents obstacle-free space, allowing us to find trajectories for robots that lie within the convex cover. In order to find the SFC that facilitates trajectory optimization, we iteratively find overlapping polytopes of maximum volumes that include specified waypoints initialized by a geometric or kinematic planner. Constraints at waypoints appear in two alternating stages of a joint optimization problem, which is solved by a method inspired by the Alternating Direction Method of Multipliers (ADMM) with partially distributed variables. We validate the effectiveness of our proposed algorithm using a range of parameterized environments and show its applications for two-stage motion planning.

Abstract:
Despite a large body of research on robot learning, it has not yet been thoroughly studied how collaborating humans and robots learn reciprocally. In such situations, both humans and robots continuously learn about each other and the task through interaction. This paper addresses the research question: How can human-robot co-learning be facilitated in physically embodied collaborative tasks?. First, we derived five requirements for successful human-robot co-learning from literature: shared goal, synchrony, interdependence, adaptability, and transparency. Based on these requirements, we designed a collaborative human-robot handover task and a robot Q-learning method. In an evaluation with six human participants co-learning was indeed found to emerge in the hand-over task. Particularly, for three of the human-robot dyads, our designed setup proved to facilitate co-learning in a way that met all five requirements. The task and robot learning method presented in this paper demonstrate how human-robot co-learning can be enabled in physically embodied tasks.

Abstract:
Semantic segmentation is crucial for autonomous navigation in off-road environments, enabling precise classification of surroundings to identify traversable regions. However, distinctive factors inherent to off-road conditions, such as source-target domain discrepancies and sensor corruption from rough terrain, can result in distribution shifts that alter the data differently from the trained conditions. This often leads to inaccurate semantic label predictions and subsequent failures in navigation tasks. To address this, we propose ST-Seg, a novel framework that expands the source distribution through style expansion (SE) and texture regularization (TR). Unlike prior methods that implicitly apply generalization within a fixed source distribution, ST-Seg offers an intuitive approach for distribution shift. Specifically, SE broadens domain coverage by generating diverse realistic styles, augmenting the limited style information of the source domain. TR stabilizes local texture representation affected by style-augmented learning through a deep texture manifold. Experiments across various distribution-shifted target domains demonstrate the effectiveness of ST-Seg, with substantial improvements over existing methods. These results highlight the robustness of ST-Seg, enhancing the real-world applicability of semantic segmentation for off-road navigation.

Abstract:
An active approach to fault tolerance is essential for robot swarms to achieve long-term autonomy. Previous efforts have focused on responding to spontaneous electro-mechanical faults and failures. However, many faults occur gradually over time. This work argues that the principles of predictive mainte- nance, in which potential faults are resolved before they hinder the operation of the swarm, offer a promising means of achieving long-term fault tolerance. This is a novel approach to swarm fault tolerance, which is shown to give a comparable or improved performance when tested against a reactive approach in almost all cases tested.

Abstract:
Real-time motion generation -- which is essential for achieving reactive and adaptive behavior -- under kinodynamic constraints for high-dimensional systems is a crucial yet challenging problem. We address this with a two-step approach: offline learning of a lower-dimensional trajectory manifold of task‑relevant, constraint‑satisfying trajectories, followed by rapid online search within this manifold. Extending the discrete‑time Motion Manifold Primitives (MMP) framework, we propose Differentiable Motion Manifold Primitives (DMMP), a novel neural network architecture that encodes and generates continuous‑time, differentiable trajectories, trained using data collected offline through trajectory optimizations, with a strategy that ensures constraint satisfaction -- absent in existing methods. Experiments on dynamic throwing with a 7‑DoF robot arm demonstrate that DMMP outperforms prior methods in planning speed, task success, and constraint satisfaction.

Abstract:
In recent decades, mobile robot motion planning has seen significant advancements. Both search-based and sampling-based methods have demonstrated capabilities to find feasible solutions in complex scenarios. Mainstream path planning algorithms divide the map into occupied and free spaces, considering only planar movement and ignoring the ability of mobile robots to traverse obstacles in the z-direction. Additionally, paths generated often have numerous bends, requiring additional smoothing post-processing. In this work, a fast, and direct motion planning method based on continuous curvature integration that takes into account the robot's obstacle-crossing ability under different parameter settings is proposed. This method generates smooth paths directly with pseudo-constant velocity and limited curvature, and performs curvature-based speed planning in complex 2.5-D terrain-based environment (take into account the ups and downs of the terrain), eliminating the subsequent path smoothing process and enabling the robot to track the path generated directly. The proposed method is also compared with some existing approaches in terms of solution time, path length, memory usage and smoothness under multiple scenarios. The proposed method is vastly superior to the average performance of state-of-the-art (SOTA) methods, especially in terms of the self-defined S_2 smoothness (mean angle of steering). Furthermore, simulations and experiments are conducted on our self-designed wheel-legged robot with 2.5-D traversability. These results demonstrate the effectiveness and superiority of the proposed approach in several representative environments. The implementation of this work is available at https://github.com/SkelonChan/GPCC_curvature_planning.

Abstract:
Supernumerary robotic limbs (SRL) offer substantial potential in both the rehabilitation of hemiplegic patients and the enhancement of functional capabilities for healthy individuals. Designing a general-purpose SRL device is inherently challenging, particularly when developing a unified theoretical framework that meets the diverse functional requirements of both upper and lower limbs. In this paper, we propose a MOO design theory that integrates grasping workspace similarity, walking workspace similarity, bracing force for STS movements, and overall mass and inertia. To facilitate rapid and stable convergence of the model to high-dimensional irregular Pareto fronts, we introduce a multi-subpopulation correction firefly algorithm. The optimized solution is utilized to redesign the prototype for experimentation to meet specified requirements. Six healthy participants and two hemiplegia patients participated in real experiments. Compared to the pre-optimization results, the average grasp success rate improved by 7.2%, while muscle activity during walking and STS tasks decreased by an average of 12.7% and 25.1%, respectively, following the optimization.

Abstract:
Visual Language Navigation (VLN) powered robots have the potential to guide blind people by understanding route instructions provided by sighted passersby. This capability allows robots to operate in environments often unknown a prior. Existing VLN models are insufficient for the scenario of navigation guidance for blind people, as they need to understand routes described from human memory, which frequently contains stutters, errors, and omissions of details, as opposed to those obtained by thinking out loud, such as in the R2R dataset. However, existing benchmarks do not contain instructions obtained from human memory in natural environments. To this end, we present our benchmark, Memory-Maze, which simulates the scenario of seeking route instructions for guiding blind people. Our benchmark contains a maze-like structured virtual environment and novel route instruction data from human memory. Our analysis demonstrates that instruction data collected from memory was longer and contained more varied wording. We further demonstrate that addressing errors and ambiguities from memory-based instructions is challenging, by evaluating state-of-the-art models alongside our baseline model with modularized perception and controls.

Abstract:
Effective robotic manipulation relies on a precise understanding of 3D scene geometry, and one of the most straightforward ways to acquire such geometry is through multi-view observations. Motivated by this, we present GP3a 3D geometry-aware robotic manipulation policy that leverages multi-view input. GP3 employs a spatial encoder to infer dense spatial features from RGB observations, which enable the estimation of depth and camera parameters, leading to a compact yet expressive 3D scene representation tailored for manipulation. This representation is fused with language instructions and translated into continuous actions via a lightweight policy head. We further introduce G-FiLM, which applies language-conditioned FiLM only to cross-view global attention. Comprehensive experiments demonstrate that GP3 consistently outperforms state-of-the-art methods on simulated benchmarks. Furthermore, GP3 transfers effectively to realworld robots in depth-challenging scenes with only minimal fine-tuning. These results highlight GP3 as a practical, sensoragnostic solution for geometry-aware robotic manipulation.

Abstract:
Object Navigation (ObjectNav) has made great progress with large language models (LLMs), but still faces challenges in memory management, especially in long-horizon tasks and dynamic scenes. To address this, we propose TopoNav, a new framework that leverages topological structures as spatial memory. By building and updating a topological graph that captures scene connections, adjacency, and semantic meaning, TopoNav helps agents accumulate spatial knowledge over time, retrieve key information, and reason effectively toward distant goals. Our experiments show that TopoNav achieves state-of-the-art performance on benchmark ObjectNav datasets, with higher success rates and more efficient paths. It particularly excels in diverse and complex environments, as it connects temporary visual inputs with lasting spatial understanding.

Abstract:
Robotic hands, prosthetics, and hand exoskeletons struggle with replicating the natural dexterity of human hands: the mechanical intelligence of our muscles can be hardly replicated with rigid actuators, while soft mechanisms compromise precision. Underactuated hand mechanisms represent a trade-off between these extremes. However, single-motor solutions, while robust and compact, generally actuate a maximum of four fingers or present critical differences in force transmission between the fingers. Here, we propose a design for a single-input, five-output differential gearbox that delivers balanced transmission thanks to a unique asymmetrical layout. This feature enables adaptive grasp control through mechanical intelligence only, providing the user with a reliable, safe, and lightweight solution for tendon-driven hand mechanisms. A preliminary 3D-printed prototype is presented to demonstrate the concept.

Abstract:
Diffusion models offer powerful generative capabilities for robot trajectory planning, yet their practical deployment on robots is hindered by a critical bottleneck: reliance on imitation learning from expert demonstrations. This paradigm is often impractical for specialized robots where data is scarce and creates an inefficient, theoretically suboptimal training pipeline. To overcome this, we introduce PegasusFlow, a hierarchical rolling-denoising framework that enables direct and parallel sampling of trajectory score gradients from environmental interaction, completely bypassing the need for expert data. Our core innovation is a novel sampling algorithm, Weighted Basis Function Optimization (WBFO), which leverages spline basis representations to achieve superior sample efficiency and faster convergence compared to traditional methods like MPPI. The framework is embedded within a scalable, asynchronous parallel simulation architecture that supports massively parallel rollouts for efficient data collection. Extensive experiments on trajectory optimization and robotic navigation tasks demonstrate that our approach, particularly Action-Value WBFO (AVWBFO) combined with a reinforcement learning warm-start, significantly outperforms baselines. In a challenging barrier-crossing task, our method achieved a 100% success rate and was 18% faster than the next-best method, validating its effectiveness for complex terrain locomotion planning. https://masteryip.github.io/pegasusflow.github.io/

Abstract:
This paper introduces the CareBot-H Robot, a humanoid nursing robot designed to perform patient transfer tasks in confined environments. The robot is equipped with biomimetic arms that replicate human arm size and function, and distributed tactile sensors that enhance operational safety during physical contact. To achieve stable and anthropomorphic motion, a trajectory deformation algorithm is proposed. The method comprises an offline phase, where expert demonstrations are encoded into prior trajectories using a Variational Autoencoder (VAE), and an online phase, where a tactile-informed Zero-Moment Point (ZMP) model enables real-time trajectory adjustment. Experimental validation with human participants demonstrates that the proposed approach outperforms manual teleoperation, producing smoother and more efficient transfer trajectories while significantly reducing deviations between actual and ideal ZMP. These results indicate that the CareBot-H achieves reliable and safe patient transfer performance, offering practical potential for deployment in real-world nursing care scenarios.

Abstract:
Detecting and estimating distances to power lines is a challenge for both human UAV pilots and autonomous systems, which increases the risk of unintended collisions. We present a mmWave radarbased perception system that provides spherical sensing coverage around a small UAV for robust power line detection and avoidance. The system integrates multiple compact solid-state mmWave radar modules to synthesize an omnidirectional field of view while remaining lightweight. We characterize the sensing behavior of this omnidirectional radar arrangement in power line environments and develop a robust detection-and-avoidance algorithm tailored to that behavior. Field experiments on real power lines demonstrate reliable detection at ranges up to 10 m, successful avoidance maneuvers at flight speeds upwards of 10 m/s, and detection of wires as thin as 1.2 mm in diameter. These results indicate the approachs suitability as an additional safety layer for both autonomous and manual UAV flight.

Abstract:
A cable-driven continuum robot with high redundancy is capable of performing the tip trajectory tracking task while simultaneously satisfying additional safety constraints, such as joint limits or external obstacles in the environment. To address these challenges, efficient motion planning methods are required. This paper proposes a quadratic programming based method in conjunction with convex polytopes based distance computation. Our methodology integrates safety constraints based on the robots' posture states, thus enabling barriers evasion in dynamic situations. Simulation outcomes demonstrate effective trajectory tracking in the presence of various objects and provide a comprehensive performance evaluation based on the generated robot state. Finally, real-word experiment was conducted on a prototype of a three-segment cable-driven continuum manipulator, which confirmed the efficacy of the proposed obstacle avoidance approach. The approach is versatile and can be adapted to similar multiple segments cable-driven continuum robotic systems by designing the robot parameters, enabling the success of tip trajectory tracking tasks under complex obstacle conditions.

Abstract:
This paper introduces a novel design for a robotic hand based on parallel mechanisms. The proposed hand uses a triple-symmetric Bricard linkage as its reconfigurable palm, enhancing adaptability to objects of varying shapes and sizes. Through topological and dimensional synthesis, the mechanism achieves a well-balanced degree of freedom and link configuration suitable for reconfigurable palm motion, balancing dexterity, stability, and load capacity. Furthermore, kinematic analysis is performed using screw theory and closed-loop constraints, and performance is evaluated based on workspace, stiffness, and motion/force transmission efficiency. Finally, a prototype is developed and tested through a series of grasping experiments, demonstrating the ability to perform stable and efficient manipulation across a wide range of objects. The results validate the effectiveness of the design in improving grasping versatility and operational precision, offering a promising solution for advanced robotic manipulation tasks.

Abstract:
During surgery, scrub nurses are required to frequently deliver surgical instruments to surgeons, which can lead to physical fatigue and decreased focus. Robotic scrub nurses provide a promising solution that can replace repetitive tasks and enhance efficiency. Existing research on robotic scrub nurses relies on predefined paths for instrument delivery, which limits their generalizability and poses safety risks in dynamic environments. To address these challenges, we present a collision-free dual-arm surgical assistive robot capable of performing instrument delivery. A vision-language model is utilized to automatically generate the robot's grasping and delivery trajectories in a zero-shot manner based on surgeons' instructions. A real-time obstacle minimum distance perception method is proposed and integrated into a unified quadratic programming framework. This framework ensures reactive obstacle avoidance and self-collision prevention during the dual-arm robot's autonomous movement in dynamic environments. Extensive experimental validations demonstrate that the proposed robotic system achieves an 83.33% success rate in surgical instrument delivery while maintaining smooth, collision-free movement throughout all trials. The project page and source code are available at https://give-me-scissors.github.io/.

Abstract:
Despite its resilience in adverse weather, millimeter-wave (mmWave) radar yields sparse and noisy point clouds that limit its perception and localization performance. Diffusion models have recently gained attention for enhancing millimeter-wave radar in perception tasks due to their strong denoising and generative capabilities. Yet, the enhanced radar point cloud is still far from expected due to a lack of texture information and errors caused by inherent sensormodel mismatch between LiDAR and radar. In this paper, we propose an adaptive vision-aided radar data enhancement method based on a conditional diffusion model for denoising and densifying radar point clouds. The pipeline decomposes mmWave radar into depth and BEV views, fuses the depth view with synchronized images, and uses the fused features together with BEV tokens to condition the diffusion model. LiDAR is used only for training supervision, but not for inference. Extensive experiments demonstrate that our proposed method produces dense and geometrically consistent radar point clouds, validating the effectiveness of the introduced vision-aid for radar enhancement. Notably, our method even works well in scenarios under visual occlusions. The accurate odometry and high-fidelity map reconstruction using enhanced radar point cloud highlights the great potential of our method for other downstream tasks in robotics and autonomous driving.

Abstract:
Recently, deep learningbased methods for road crack segmentation have achieved promising performance, particularly in robotic vision applications such as automated inspection and maintenance. However, most frequency-domain methods employ a decoupled processing strategy, overlooking the dynamic modulation mechanism between high- and low-frequency components, which constrains the model's effectiveness in detecting cracks within complex environments. Moreover, existing methods suffer from low information fidelity during feature transmission, where critical encoder details are progressively lost in the decoder, making it difficult to reconstruct complete crack structures. To address these issues, we propose a Bidirectional Frequency-domain Modulation Progressive Fusion Network (BFMPF-Net). Specifically, we propose a Bidirectional Frequency-domain Modulation Enhancement (BFME) module that effectively exploits bidirectional modulation between high- and low-frequency components and learns the spatial weights of high-frequency features to attenuate noise and preserve crack edge details, thereby improving the performance of crack segmentation. Furthermore, the Progressive Guidance Fusion module serves as another core component of our framework. It leverages the spatial prior provided by the original low-resolution image to guide feature refinement via stepwise optimization from coarse contours to fine edges, thereby ensuring the integrity of crack segmentation. Evaluation on three publicly available datasetsCrackTree260, CrackLS315, and Crack760affirms the superior segmentation accuracy of the proposed BFMPF-Net compared to current mainstream methods.

Abstract:
Recent advances in collaborative perception systems have led to significant improvements in 3D object detection performance. While widely deployed LiDAR and camera systems often experience performance degradation under adverse weather conditions, weather-robust 4D radar offers a promising alternative to address this challenge. However, effectively fusing 4D radar measurements with degraded LiDAR data remains a critical challenge. In this work, we decompose the weather-induced degradation in LiDAR perception into feature attenuation requiring enhancement and feature contamination requiring suppression, based on the underlying physical interactions. Building upon this decomposition, we propose a dual-branch network to handle each degradation pattern in a specialized manner. One branch focuses on enhancement based on spatial and channel attention, guided by 4D radar cues. The other branch focuses on suppression based on intra-modal structural consistency and cross-modal consistency. To achieve adaptive branch integration, we propose a dynamic decision network to generate a decision weight map for each branch and capture the complex interaction between branches. To validate the effectiveness of our method, we conduct extensive experiments on V2X-R, the only publicly available collaborative LiDAR-4D radar dataset. Extensive experimental results demonstrate that our method achieves improvements of 3.65% and 10.80% in mAP@0.7 under fog and snow conditions, respectively, outperforming previous state-of-the-art approaches.

Abstract:
Trajectory prediction for traffic agents is critical for safe autonomous driving. However, achieving effective zero-shot generalization in previously unseen domains remains a significant challenge. Motivated by the consistent nature of kinematics across diverse domains, we aim to incorporate domain-invariant knowledge to enhance zero-shot trajectory prediction capabilities. The key challenges include: 1) effectively extracting domain-invariant scene representations, and 2) integrating invariant features with kinematic models to enable generalized predictions. To address these challenges, we propose a novel generalizable Physics-guided Causal Model (PCM), which comprises two core components: a Disentangled Scene Encoder, which adopts intervention-based disentanglement to extract domain-invariant features from scenes, and a CausalODE Decoder, which employs a causal attention mechanism to effectively integrate kinematic models with meaningful contextual information. Extensive experiments on real-world autonomous driving datasets demonstrate our method's superior zero-shot generalization performance in unseen cities, significantly outperforming competitive baselines. The source code is released at https://github.com/ZY-Zong/Physics-guided-Causal-Model.

Abstract:
Self-supervised pre-training with masked autoencoders has shown promise for 3D perception, yet most approaches treat LiDAR point clouds in a geometry-agnostic manner. In this paper, we introduce Re-MAE, a geometry-aware self-supervised learning framework for LiDAR-based 3D object detection that explicitly encodes core properties of LiDAR point clouds: occlusion, distance-driven sparsity, and occupied-empty voxel structure. Re-MAE rethinks the geometric characteristics of LiDAR point clouds from the perspectives of "what to learn" and "how to learn", and introduces three components: (i) Geometry-Aware Masking, which realistically simulates occlusions in LiDAR scans and enables learning complete object representations from partial observations; (ii) Reconstruction-Contextual BCE loss, which effectively guides a multi-scale occupancy prediction task to mitigate distance-dependent point sparsity and the strong occupied-empty voxel imbalance, improving detection of both large vehicles and small, distant pedestrians; and (iii) Realistic Object Augmentation, a label-free foreground augmentation strategy that promotes object-centric representation learning and yields consistent gains across categories. Experiments on ONCE and Waymo Open Dataset validate the effectiveness of Re-MAE, delivering 2.83 mAP and 1.53 L2 mAP respectively over baselines. These results demonstrate that explicitly incorporating the geometric characteristics of LiDAR point clouds enhances the effectiveness of self-supervised learning. The code will be released.

Abstract:
Robotic sole deburring is a key, yet underexplored, challenge in footwear automation, where the deformable nature of rubber, variability of burrs, and diversity of sole geometries make automation difficult. Existing deburring approaches typically rely on CAD models or large training datasets, and often lack the ability to adapt online during execution. This paper presents a CAD-free, vision-guided framework for robotic deburring of shoe soles that integrates: (i) defect detection using the Segment Anything Model 2 without sole-specific training; (ii) motion planning for burr removal; and (iii) motion execution combining Forward Dynamics Compliance Control with online vision-based path tracking. The framework was validated on a UR5e robot equipped with a custom vacuum gripper. Results demonstrate a 95% success rate across soles of varying sizes, colors, and shapes. By eliminating CAD dependence, ensuring robust online correction, and maintaining compatibility with existing industrial deburring machines, this work provides a scalable step toward robotic finishing solutions in footwear manufacturing.

Abstract:
Autonomous exploration in structured and complex indoor environments remains a challenging task, as existing methods often struggle to appropriately model unobserved space and plan globally efficient paths. To address these limitations, we propose GUIDE, a novel exploration framework that synergistically combines global graph inference with diffusionbased decision-making. We introduce a region-evaluation global graph representation that integrates both observed environmental data and predictions of unexplored areas, enhanced by a region-level evaluation mechanism to prioritize reliable structural inferences while discounting uncertain predictions. Building upon this enriched representation, a diffusion policy network generates stable, foresighted action sequences with significantly reduced denoising steps. Extensive simulations and real-world deployments demonstrate that GUIDE consistently outperforms state-of-the-art methods, achieving up to 18.3% faster coverage completion and a 34.9% reduction in redundant movements.

Abstract:
Robust loop closure detection is a critical component of Simultaneous Localization and Mapping (SLAM) algorithms in GNSS-denied environments, such as in the context of planetary exploration. In these settings, visual place recognition often fails due to aliasing and weak textures, while LiDAR-based methods suffer from sparsity and ambiguity. This paper presents MPRF, a multimodal pipeline that leverages transformer-based foundation models for both vision and LiDAR modalities to achieve robust loop closure in severely unstructured environments. Unlike prior work limited to retrieval, MPRF integrates a two-stage visual retrieval strategy with explicit 6-DoF pose estimation, combining DINOv2 features with SALAD aggregation for efficient candidate screening and SONATA-based LiDAR descriptors for geometric verification. Experiments on the S3LI dataset and S3LI Vulcano dataset show that MPRF outperforms state-of-the-art retrieval methods in precision while enhancing pose estimation robustness in low-texture regions. By providing interpretable correspondences compatible with SLAM back-ends, MPRF achieves a favorable trade-off between accuracy, efficiency, and reliability, demonstrating the potential of foundation models to unify place recognition and pose estimation. Code and models will be released at github.com/DLR-RM/MPRF

Abstract:
This paper introduces Mapping-based Tasks for Inspection: Discovery and Allocation (Map-TIDAL), a method for generating environmentally informed tasks and distributing them in a heterogeneous multi-robot system for visual inspection of underwater structures. Map-TIDAL leverages the individual robot maps generated during SLAM (without prior knowledge of the environment) and tasks from all the robots through a communication-aware auction process to determine additional inspection locations as the structures are further explored by the robots. This allows the method to adaptively focus on geometrically interesting areas that need detailed inspection while still maintaining good overall coverage with a reasonably small number of inspection tasks. Experiments on both saline and fresh water tanks show that Map-TIDAL yields better coverage while inspecting areas with interesting geometric features more thoroughly, using equal or fewer inspection locations compared to prevalent coverage methods using Voronoi distributions and boustrophedon patterns.

Abstract:
In this paper, we introduce Context-Aware Priority Sampling (CAPS), a novel method designed to enhance data efficiency in learning-based autonomous driving systems. CAPS addresses the challenge of imbalanced datasets in imitation learning by leveraging Vector Quantized Variational Autoencoders (VQ-VAEs). In this way, we can get structured and interpretable data representations, which help to reveal meaningful patterns in the data. These patterns are used to group the data into clusters, with each sample being assigned a cluster ID. The cluster IDs are then used to re-balance the dataset, ensuring that rare yet valuable samples receive higher priority during training. We evaluate our method through closed-loop experiments in the CARLA simulator. The results on Bench2Drive scenarios demonstrate the effectiveness of CAPS in enhancing model generalization, with substantial improvements in both driving score and success rate.

Abstract:
Our work investigates how social robots can act in a user-aware manner by adapting their behaviour to users' personal characteristics and preferences without unnecessarily exposing them to frustration through the robot's actions. In particular, we investigate how implicit social signals inadvertently exhibited by users (e.g. facial expressions) during interactions can be incorporated into user-aware decision-making models while accounting for the systematic limitations of implicit feedback signals (e.g. inconsistency, noise, culture and individual-dependence). Doing so, we develop a user-aware adaptive decision-making and learning framework for human-robot interactions, building on implicit signal processing, cue-based intent inference, and multiarmed bandit learning techniques. Evaluating our approach, we conduct a user study where participants interact with a Pepper robot in a cafeteria style interaction scenario, with the robot providing recommendations and taking orders while adapting its behaviour to individual users. The experimental results demonstrate our proposed model's success in adapting its behaviour (i.e. conversational style) to users with different personal characteristics, while receiving 80% positive user feedback, and user questionnaire responses reporting higher perceived usefulness than baseline approaches. Questionnaire responses also illustrate positive user impressions of implicit signal based approaches while highlighting the importance of accounting for their limitations in learning models. In addition, we provide a dataset of over 5 hours of human and robot behaviour data extracted from multimodal recordings captured as part of our user study.

Abstract:
Multimodal interaction plays a vital role in humanAI interaction, enabling robots or AI agents to interpret human input from multiple sensory channels and respond through diverse communication modalities. This paper introduces SHAF, an LLM-based multimodal model capable of handling text, image, and human motion as both input and output modalities across different multi-turn conversational settings. In SHAF, vector quantization is employed to convert images and human motion into an aligned set of tokens, followed by pre-training and instruction fine-tuning of a small Large Language Model (LLM) on our newly created SHAF dataset. Experimental results demonstrate that SHAF achieves competitive performance in text-to-motion and motion-to-text tasks in comparison to relevant works, while handling an additional modality and supporting a broader range of tasks. This research contributes an LLM-based multimodal approach, with the aim of fostering deeper exploration of human motion modality in LLMs within the context of HRI and related domains.

Abstract:
Model predictive control (MPC) has demonstrated effectiveness for humanoid bipedal locomotion; however, its applicability in challenging environments, such as rough and slippery terrain, is limited by the difficulty of modeling terrain interactions. In contrast, reinforcement learning (RL) has achieved notable success in training robust locomotion policies over diverse terrain, yet it lacks guarantees of constraint satisfaction and often requires substantial reward shaping. Recent efforts in combining MPC and RL have shown promise of taking the best of both worlds, but they are primarily restricted to flat terrain or quadrupedal robots. In this work, we propose an RL-augmented MPC framework tailored for bipedal locomotion over rough and slippery terrain. Our method parametrizes three key components of single-rigid-body-dynamics-based MPC: system dynamics, swing leg controller, and gait frequency. We validate our approach through bipedal robot simulations in NVIDIA IsaacLab across various terrains, including stairs, stepping stones, and low-friction surfaces. Experimental results demonstrate that our RL-augmented MPC framework produces significantly more adaptive and robust behaviors compared to baseline MPC and RL. Project page: https://rl-augmented-mpc.github.io/rlaugmentedmpc/

Abstract:
Automation in construction is essential for reducing costs and human errors in large-scale projects. We approach the construction progress monitoring from the aspect of detecting changes in construction sites. As construction buildings continue to evolve in geometry and appearance over time, change detection need to be performed from arbitrary camera viewpoints. This necessitates developing 2D Change Detection (2DCD) algorithms that operate robustly across diverse camera perspectives at construction sites. While developing and evaluating such systems is data-intensive, no open-source benchmark dataset exists at the intersection of 2D change detection and construction automation research. Data collection using Unmanned Aerial Vehicles (UAVs) is gaining its popularity in outdoor large-scale surveying. However, in active construction sites conducting drone missions equipped with high-end sensors imposes safety concerns. Flight trajectory and collected camera viewpoints can be significantly limited. To address this critical gap, we introduce iVISION-2DCD, a large-scale synthetically generated dataset from dense LiDAR point clouds with photorealistic input images and accurate ground truth annotations. Our dataset formally defines the problem of viewpoint-robust 2DCD at construction sites and captures the inherent complexities of real-world deployment. In this paper, we present our systematic methodology for synthetic data generation, developing novel view synthesis techniques to overcome bi-temporal alignment and viewpoint diversity challenges, and implementing semi-automated semantic segmentation with change label generation while preserving challenging real-world cases. Benchmark evaluations using state-of-the-art 2DCD algorithms demonstrate that iVISION-2DCD poses novel research challenges for the computer vision and robotics communities.

Abstract:
Typical robotic workflows involve deploying applications from servers to robots for testing or distributing validated applications across a fleet to unify capabilities. Because these processes are often slowed by tedious environment configurations, we need a way to improve deployment and development efficiency. Existing solutions typically simplify deployment through containerization; however, they often lack integrated development environments, necessitating repetitive packaging for code modifications and hindering iterative efficiency. This paper proposes FINECYCLE, a management paradigm that encapsulates the entire software stack into a robotic image. By facilitating a complete "deploy-develop-store" cycle, FINECYCLE streamlines the transition between cross-host deployment and iterative refinement. Additionally, we opensource image templates compatible with this paradigm to reduce time costs for researchers and foster collaborative progress in the robotics community.

Abstract:
Imitation learning (IL) has proven effective across a wide range of manipulation tasks. However, IL policies often struggle when faced with out-of-distribution observations; for instance, when the target object is in a previously unseen position or occluded by other objects. In these cases, extensive demonstrations are needed for current IL methods to reach robust and generalizable behaviors. But when humans are faced with these sorts of atypical initial states, we often rearrange the environment for more favorable task execution. For example, a person might rotate a coffee cup so that it is easier to grasp the handle, or push a box out of the way so they can directly grasp their target object. In this work we seek to equip robot learners with the same capability: enabling robots to prepare the environment before executing their given policy. We propose ReSET, an algorithm that takes initial states --- which are outside the policy's distribution --- and autonomously modifies object poses so that the restructured scene is similar to training data. Theoretically, we show that this two step process (rearranging the environment before rolling out the given policy) reduces the generalization gap. Practically, our ReSET algorithm combines action-agnostic human videos with task-agnostic teleoperation data to i) decide when to modify the scene, ii) predict what simplifying actions a human would take, and iii) map those predictions into robot action primitives. Comparisons with diffusion policies, VLAs, and other baselines show that using ReSET to prepare the environment enables more robust task execution with equal amounts of total training data.

Abstract:
Autonomous navigation of ground robots in unstructured 3D environments remains a fundamental challenge, as it requires accommodating dynamic obstacles, non-planar ground, and multi-story structures within a unified framework. In this paper, we propose a versatile navigation framework named FineNav. It features a novel hierarchical mapping system that couples a high-rate local voxel grid for real-time perception with a scalable global octree for persistent storage. This design balances low-latency performance with large-scale mapping capabilities, enabling reliable navigation in unstructured environments. Moreover, the entire navigation pipeline is refactored into modular and reusable components, while maintaining compatibility with existing 2D navigation ecosystems. We validate FineNav on a wheeled robot, demonstrating its versatility across diverse scenarios. FineNav is released as open-source software for the community.

Abstract:
This study proposes a novel concept of an electric tractor that can flexibly respond to tasks with different traction force requirements. The key idea is to attach motor-integrated additive driving wheel units (AddTraX) to the rear wheel of the tractor according to the required traction force. The required functions for the driving wheel units are to allow manual attachment of the driving wheel units by the operator, and to control the height of the driving wheel units while running so that all the wheels have contact with ground. Driving experiments have been conducted using a single-side 1/4 scale model of the proposed driving wheel units and simply implemented models of several road conditions. On a paved road, the attachment of the additional driving wheel units enhance the traction force by 1.9 times, and wheel height control is unnecessary. On a soft unpaved road, traction force is increased by controlling the height of the driving wheel units when the vehicle weight is low. Furthermore, the experiment also confirms that the additional driving wheel units can help the vehicle overcoming steps on uneven road.

Abstract:
End-to-end autonomous driving systems increasingly rely on vision-centric world models to understand and predict their environment. However, a common ineffectiveness in these models is the full reconstruction of future scenes, which expends significant capacity on redundantly modeling static backgrounds. To address this, we propose IR-WM, an Implicit Residual World Model that focuses on modeling the current state and evolution of the world. IR-WM first establishes a robust bird's-eye-view representation of the current state from the visual observation. It then leverages the BEV features from the previous timestep as a strong temporal prior and predicts only the "residual", i.e., the changes conditioned on the ego-vehicles actions and scene context. To alleviate error accumulation over time, we further apply an alignment module to calibrate semantic and dynamic misalignments. Moreover, we investigate different forecastingplanning coupling schemes and demonstrate that the implicit future state generated by world models substantially improves planning accuracy. On the nuScenes benchmark, IR-WM achieves top performance in both 4D occupancy forecasting and trajectory planning.

Abstract:
Large-area tactile sensing remains a key challenge for wearable and robotic applications, where solutions must balance resolution and complexity, manufacturability, and conformability to various geometries. While acoustic waveguides have been used for contact localization and force estimation at the centimeter scale, scaling this technology to limb-scale wearable devices is unexplored. In this work, we introduce a soft, wearable tactile sleeve based on wrapped and meter-length acoustic waveguides. By patterning waveguides on a sleeve, one-dimensional time-of-flight measurements are mapped to two-dimensional contact locations. This enables conformable coverage with sparse transducers, while preserving mechanical robustness by placing rigid electronics away from the contact surface. We contribute the design and fabrication of the waveguide-based tactile sensor, provide an in-depth characterization of sensor response and evaluate frameworks for contact localization and force estimation, and demonstrate system performance on a human arm. Results show that the time-of-flight-based localization approach generalizes across contact sizes and curved geometries. However, more work is required to achieve sensitive and reliable force estimates. This work establishes acoustic waveguides as a manufacturable and reconfigurable modality for wearable tactile skins.

Abstract:
Maintaining connectivity in multi-agent systems often compromises task performance. Current strategies are frequently hampered by heavy communication loads and overly restrictive motion constraints. Furthermore, their local decision-making relies on static geometric information, neglecting agent dynamics. To address these shortcomings, this paper proposes a scalable, distributed framework centered on a novel dynamics-aware connection cost metric. This metric enables agents to prospectively select dynamically stable, task-compatible links, which are then enforced using control barrier functions (CBFs) within a cooperative optimization scheme. In multi-agent target-reaching tasks, simulations show our dynamics-aware metric reduces the final average goal distance by up to 26.1% compared to a static distance-based selection heuristic. Furthermore, our framework maintains persistent connectivity in highly dynamic scenarios, whereas a state-of-the-art algebraic connectivity-based method fails under limited communication bandwidth.

Abstract:
To navigate the environment with dynamic obstacles, a robot must continuously scan for them and find collision-free paths to reach a goal position. This process starts with receiving obstacle information in the form of a point cloud, followed by a pre-planning stage that involves preprocessing to remove unnecessary points and constructing an environment data structure. However, the pre-planning stage can consume more than 16times the runtime of the planning stage, slowing the robots reaction speed. Thus, in this work, we propose textitVCC, an efficient collision checking framework that primarily targets the pre-planning bottleneck. VCC first cleans the point cloud using Center-selective Voxel Filtering. It then divides the environment into voxels using Adaptive Workspace Voxelization and organizes them in a Multilevel Voxel Table (MVT). In addition, VCC manages the MVT in two memory pools to ensure high data locality and SIMD-aligned data layout. During motion planning, the planner can perform low-latency SIMD-accelerated collision checking using the MVT. Compared to the state-of-the-art method, the experimental results show a 3.63times speedup in filtering. In terms of environment data structure, MVT achieves a 220.48times speedup during construction and reduces memory usage by 97.73%. Additionally, VCC accelerates sampling-based planning by 1.94times. Altogether, VCC achieves an end-to-end speedup of 7.71times on the desktop CPU platform and 4.23times on the embedded computer platform, making real-time motion planning practical for resource-constrained edge devices.

Abstract:
Access to real-world data in robotics domains is often challenging due to restrictions on data sharing and limited availability. Although privacy and intellectual property concerns are the main barriers, ensuring data access is crucial for advancing data-driven models. Specifically, machine-learning-based inverse dynamic models show promising results for nonrigid robot identification, but the data used to train them are often kept private due to intellectual property protections. Federated learning proposes a methodology to access such data without centralizing them in a single repository, thus avoiding intellectual property limitations. We propose a solution that uses federated learning to train a model from distributed data to develop a robust robotic arm inverse dynamic model. Our approach demonstrates the feasibility of using a machine learning method in which local robots train on their own data while collaborating without sharing raw information. Furthermore, we propose a novel custom aggregation method that integrates locally learned solutions from different workspaces into a single global model without requiring raw data sharing. This method improves accuracy in our federated solution by approximately 20% for the learned inverse dynamic model.

Abstract:
The hierarchical nature of 3D scene graphs aligns well with the structure of man-made environments, making them highly suitable for representation purposes. Beyond this, however, their embedded semantics and geometry could also be leveraged to improve the efficiency of map and pose optimization, an opportunity that has been largely overlooked by existing methods. We introduce Situational Graphs 2.0 (S-Graphs 2.0), that effectively uses the hierarchical structure of indoor scenes for efficient data management and optimization. Our approach builds a four-layer situational graph comprising Keyframes, Walls, Rooms, and Floors. Our first contribution lies in the front-end, which includes a floor detection module capable of identifying stairways and assigning floor-level semantic relations to the underlying layers (Keyframes, Walls, and Rooms). Floor-level semantics allows us to propose a floor-based loop closure strategy, that effectively rejects false positive closures that typically appear due to aliasing between different floors of a building. Our second novelty lies in leveraging our representation hierarchy in the optimization. Our proposal consists of: (1) local optimization over a window of recent keyframes and their connected components across the four representation layers, (2) floor-level global optimization, which focuses only on keyframes and their connections within the current floor during loop closures, and (3) room-level local optimization, marginalizing redundant keyframes that share observations within the room, which reduces the computational footprint. We validate our algorithm extensively in different real multi-floor environments. Our approach shows state-of-the-art accuracy metrics in large-scale multi-floor environments, estimating hierarchical representations up to 10x faster, on average, than competing baselines.

Abstract:
Robots operating in unpredictable environments require versatile, hardware-agnostic frameworks capable of adapting to various tasks. While a recent screw-based affordance approach shows promise, it faces challenges in avoiding undesirable configurations, singularity navigation, and task success prediction. To address these limitations, we propose a novel framework that incorporates gripper orientation control and generates complete joint trajectories in real time for screw-based task-affordance execution. Our method models the affordance and manipulator as a closed-chain mechanism, introducing an innovative approach to solving closed-chain inverse kinematics. It encapsulates task constraints and simplifies task definitions, while remaining hardware and robot agnostic, robust to errors, and invariant to the initial grasp. We validate our framework with simulations on a UR5 robot and real-world implementation on a Boston Dynamics Spot robot. Our experiments demonstrate rapid joint trajectory generation (0.0077 - 0.098s) for various tasks, including a 420-degree valve turn with consideration of the gripper orientation. Comparison with the state-of-the-art methods shows a 4x improvement in planning time, reduced joint movement, and achievement of greater task goals. Video demonstrations and the open-source code for this project are available online.

Abstract:
Multi-view stereo (MVS) implicitly encodes photometric and geometric cues into the cost volume for multi-view correspondence matching, transferring insufficient geometric cues essential to depth estimation and reconstruction. This paper proposes GE-MVS, a novel multi-view stereo network with geometric encoding for more accurate and complete depth estimation and point cloud reconstruction. First, the cross-view adaptive cost volume aggregation module is proposed to strengthen the encoding of multi-view geometric cues during cost volume construction. Then, the depth consistency optimization is performed in 3D point space during learning by invoking ground-truth depth cues from adjacent views. Finally, the surface normal geometries are explicitly encoded to refine the sampled depth hypotheses to be consistent in the local neighbor regions. Extensive experiments on the standard MVS benchmarks including DTU, Tanks and Temples, and BlendedMVS demonstrate the state-of-the-art depth estimation and point cloud reconstruction performance of GE-MVS. The GE-MVS is further deployed in real-world experiments for UAV-based large-scale reconstruction, where our method outperforms the prevalent industrial reconstruction solutions in terms of reconstruction efficiency and effectiveness.

Abstract:
Humanoid locomotion is a challenging task due to its inherent complexity and high-dimensional dynamics, as well as the need to adapt to diverse and unpredictable environments. In this work, we introduce a novel learning framework for effectively training a humanoid locomotion policy that imitates the behavior of a model-based controller while extending its capabilities to handle more complex locomotion tasks, such as more challenging terrain and higher velocity commands. Our framework consists of three key components: pre-training through imitation of the model-based controller, fine-tuning via reinforcement learning, and model-assumption-based regularization (MAR) during fine-tuning. In particular, MAR aligns the policy with actions from the model-based controller only in states where the model assumption holds to prevent catastrophic forgetting. We evaluate the proposed framework through comprehensive simulation tests and hardware experiments on a full-size humanoid robot, Digit, demonstrating a forward speed of 1.5 m/s and robust locomotion across diverse terrains, including slippery, sloped, uneven, and sandy terrains.

Abstract:
This letter presents an analytical linear parameter- varying (LPV) representation of quadrotor dynamics utilizing Koopman theory, facilitating computationally efficient linear model predictive control (LMPC) for real-time trajectory track- ing. By leveraging carefully designed Koopman observables, the proposed approach enables a compact lifted-space evolution that mitigates the curse of dimensionality while preserving the non- linear characteristics of the system. Although model predictive control (MPC) is a powerful strategy for quadrotor control, it faces a trade-off between the high computational cost of nonlinear MPC (NMPC) and the reduced accuracy of LMPC. To address this gap, we introduce KQ-LMPC (Koopman Quasilinear LPV MPC), which leverages the Koopman-lifted LPV formulation to enforce constraints, ensure lower computational burden and real- time feasibility, and deliver tracking performance comparable to NMPC. Experimental validation confirms the effectiveness of the framework in reasonably agile flight. To the best of our knowledge, this is the first experimentally validated LMPC for quadrotors that employs analytically derived Koopman observ- ables without requiring training data.

Abstract:
Impedance control is a widely adopted approach that ensures the compliant behavior of robot manipulators as they interact with their environment according to specifically designed dynamics. For tasks involving six degrees of freedom (DoF), it is crucial to appropriately manage the position and orientation of the end effector by controlling dynamic behavior. However, describing orientational displacement and designing the corresponding rotational impedance can be challenging, especially when we use a minimal representation. The well-known minimal representation for orientation, the Euler angle, suffers from representation singularity. As a remedy, the quaternion or dual quaternion can be an alternative but with non-minimal representations. This lack of minimal representation, which does not suffer from the representation singularity, often leads to handling the impedance design by directly defining the potential energy function in the matrix Lie group. This paper proposes a framework for the six-DoF impedance control design that takes advantage of Lie group theory with minimal representation, known as exponential coordinate. Since the exponential coordinate can be treated as the Eu

Abstract:
Quadrupedal robots deployed for load-carrying applications must maintain stable locomotion across diverse ter- rains and varying payloads. Traditional approaches like Model Predictive Control (MPC) can handle such variations but often rely on predefined gait schedules and manually tuned trajectory planners, limiting adaptability in unstructured environments. To address this, we propose an adaptive reinforcement learning (RL) framework that enables quadrupedal robots to respond dynamically to terrain and payload changes without relying on contact force measurements or gait designs. The controller con- sists of a nominal policy that learns general locomotion across terrains and an adaptive policy that outputs corrective actions for handling dynamic variations due to payloads. We validate our approach through extensive simulations in Isaac Gym across payloads (210 kg) and terrains including flat ground, slopes, and stairs. Our method achieves higher success rates and lower height-tracking errors while maintaining the Cost of Transport (CoT) comparable to the best-performing baselines and to no-load (NL) operation. Real-world deployment on a Unitree Go1 confirms the approachs effectiveness under both static and dynamic payload changes, including freely moving masses. The policy also performs well on outdoor terrains such as grass, soil, and staircases. The adaptive policy modulates corrections based on payload changes, improving body stability and tracking without post-deployment fine-tuning.

Abstract:
This paper presents a formation control approach for contactless gesture-based Human-Swarm Interaction (HSI) between a team of multi-rotor Unmanned Aerial Vehicles (UAVs) and a human worker. The approach is designed to monitor the safety of human workers, particularly those operating at heights. In the proposed dynamic formation scheme, one UAV acts as the formation leader, equipped with sensors for detecting human workers and recognizing gestures. The follower UAVs maintain a predetermined formation relative to the worker's position, providing additional perspectives of the monitored scene. Hand gestures enable the human worker to specify movement and action commands for the UAV team and to initiate other mission-related tasks without requiring additional communication channels or specific markers. Combined with a novel unified human detection and tracking algorithm, a human position estimation method, and a gesture detection pipeline, the proposed approach represents the first instance of an HSI system incorporating all these modules onboard real-world UAVs. Simulations and field experiments involving three UAVs and a human worker in a mock-up scenario demonstrate the effectiveness and responsiveness of the proposed approach.

Abstract:
Software frameworks like the Stack of Tasks (SoT), the Stanford Whole-Body Control (WBC) library, or the instantaneous Task Specification using Constraints (iTaSC) have enabled robots to perform advanced, contact-oriented manipulation tasks. jgeom constr and eTaSL are among the few formal, computer-interpretable languages that allow users to specify such tasks independent of these frameworks. We analyse these languages for their limitations with respect to composability, the design for extensibility without having to change existing models, and compositionality, meaning that the semantics of compositions unambiguously follows from the semantics of the components and of the composition relations. To overcome these limitations we design a graph-structured and well-defined interchange format for such tasks. The associated tooling enables us to generate correct-by-construction code that adheres to predefined rules and constraints. We showcase our models and toolchain by incrementally constructing a workspace-alignment application for a highly-redundant mobile platform that is equipped with two 7-DoF, torque-controlled manipulators.

Abstract:
Multi-robot simultaneous localization and mapping (SLAM) is a fundamental task in multi-robot operations. Robots must have a common understanding of their location and that of their team members to complete coordinated actions. However, multi-robot SLAM between Uncrewed Surface Vessels (USVs) and Autonomous Underwater Vehicles (AUVs) has primarily been achieved through acoustic pinging between robots to retrieve range measurements; a measurement technique requires that robots to be in similar locations simultaneously, have an uninterrupted path for signal propagation, and may necessitate synchronized clocks. This is especially challenging in complex, cluttered maritime environments, where structures may impede signals. However, these same structures may be observable above and below the water's surface, presenting an opportunity for inter-robot SLAM loop closure between USV and AUV data streams. This work builds upon recent research on inter-robot SLAM loop closure between USV and AUV data citemsm, extending it to propose a centralized multi-robot SLAM system. Each robot performs its state estimation, and we detect loop closures between each AUV and the USV data. These inter-robot loop closures are used to merge each robot's state estimate into a centralized graph, yielding estimates for the whole time history of the USV and all AUVs in the system. Validation is performed using real-world perceptual data in three different environments. Results show improved errors for AUVs in the multi-robot SLAM system compared to single-robot SLAM over the same trajectories. To our knowledge, this is the first instance of a multi-robot SLAM system with AUVs and USVs built on loop closures rather than acoustic distance measurements.

Abstract:
Trajectory estimation involves determining the trajectory of a mobile robot by combining prior knowledge about its dynamic model with noisy observations of its state obtained using sensors. The accuracy of such a procedure is dictated by the system model fidelity and the sensor parameters, such as the accuracy of the sensor (as represented by its noise covariance) and the rate at which it can generate observations, referred to as the sensor query schedule. Intuitively, high- rate measurements from accurate sensors lead to accurate trajectory estimation. However, cost and resource constraints limit the sensor accuracy and its measurement rate. Our works novel contribution is the estimation of sensor schedules and sensor covariances necessary to achieve a specific estimation accuracy. Concretely, we focus on estimating: (i) the rate or schedule with which a sensor of known covariance must generate measurements to achieve specific estimation accuracy, and alternatively, (ii) the sensor covariance necessary to achieve specific estimation accuracy for a given sensor update rate. We formulate the problem of estimating these sensor parameters as semidefinite programs, which can be solved by off-the- shelf solvers. We validate our approach in simulation and real experiments by showing that the sensor schedules and the sensor covariances calculated using our proposed method achieve the desired trajectory estimation accuracy. Our method also identifies scenarios where certain estimation accuracy is unachievable with the given system and sensor characteristics.

Abstract:
Conventional motor-driven wearable robots often suffer from increased weight and limited torque output. To address this issue, this study proposes a motorSMA hybrid actuation approach that combines the advantages of electric motors and shape memory alloy (SMA) actuators. A dedicated testbed was developed to evaluate the proposed method under varying load conditions. Experimental results show that the SMA actuator provides additional assistive torque of approximately 3 N·m compared to motor-only operation, without significant increase in system weight. These results demonstrate the feasibility of hybrid actuation for achieving lightweight and high-performance wearable robotic systems.

Abstract:
Electroadhesion suction cup (EASC) are fully-electrical grippers (no air flow needed) with very low power consumption that can grasp flat to curved objects from the top. They conform to the shape of the object by zipping from the central contact point to their edges, driven by Electroadhesion forces. Zipping requires deforming elastically the EASC membrane. The object surface curvature at contact point strongly affects zipping ability, and therefore grasp feasibility. We developed a model for grasping point selection that predicts the voltage required for full zipping on a point of given local curvature. Feasible points are the ones where the estimated zipping voltage is lower than the breakdown voltage of the EASC. The model is based on an energy balance between electrostatic work and elastic deformation, explicitly including in-plane stretching on doubly curved surfaces. Experiments on cylinders, spheres, and ellipsoids validate the predicted thresholds and curvature-dependent trends.

Abstract:
Multi-robot task planning requires decomposing natural-language instructions into executable actions for heterogeneous robot teams. Conventional Planning Domain Definition Language (PDDL) planners provide rigorous guarantees but struggle to handle ambiguous or long-horizon missions, while large language models (LLMs) can interpret instructions and propose plans but may hallucinate or produce infeasible actions. We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner. When plans fail, the system applies TextGrad-inspired textual-gradient updates to optimize each agent's prompt and thereby improve planning accuracy. In addition, meta-prompts are learned and shared across agents within the same layer, enabling efficient prompt optimization in multi-agent settings. On the MAT-THOR benchmark, our planner achieves success rates of 0.95 on compound tasks, 0.84 on complex tasks, and 0.60 on vague tasks, improving over the previous state-of-the-art LaMMA-P by 2, 7, and 15 percentage points respectively. An ablation study shows that the hierarchical structure, prompt optimization, and meta-prompt sharing contribute roughly +59, +37, and +4 percentage points to the overall success rate.

Abstract:
In high-density environments where numerous autonomous agents move simultaneously in a distributed manner, streamlining global flows to mitigate local congestion is crucial to maintain overall navigation efficiency. This paper introduces a novel path-planning problem, congestion mitigation path planning (CMPP), which embeds congestion directly into the cost function, defined by the usage of incoming edges along agents' paths. CMPP assigns a flow-based multiplicative penalty to each vertex of a sparse graph, which grows steeply where frequently-traversed paths intersect, capturing the intuition that congestion intensifies where many agents enter the same area from different directions. Minimizing the total cost yields a set of coarse-level, time-independent routes that autonomous agents can follow while applying their own local collision avoidance. We formulate the problem and develop two solvers: (i) an exact mixed-integer nonlinear programming solver for small instances, and (ii) a scalable two-layer search algorithm, A-CMTS, which quickly finds suboptimal solutions for large-scale instances and iteratively refines them toward the optimum. Empirical studies show that augmenting state-of-the-art collision-avoidance planners with CMPP significantly reduces local congestion and enhances system throughput in both discrete- and continuous-space scenarios. These results indicate that CMPP improves the performance of multi-agent systems in real-world applications such as logistics and autonomous-vehicle operations.

Abstract:
Right heart catheterization (RHC) is a critical procedure for diagnosing and managing cardiovascular diseases such as heart failure, congenital heart disease, pulmonary edema, and pulmonary hypertension. However, currently prevalent manual RHC procedures requires continuous communication of clinicians between the main control room and the operating room, leading to navigation inaccuracies and increased physical workload for clinicians during prolonged procedure. To overcome these challenges, this paper introduces a robotic system that enables autonomous RHC (Auto-RHC) by transferring a catheter decision-making model from patient-specific digital twins to real-world robotic intervention using deep learning algorithms. By creating a high-fidelity digital twin using the Simulation Open Framework Architecture and conducting virtual RHC interventions, images capturing the catheter balloon's position and aligned behavioral datasets were collected and utilized as input for a convolutional neural network architecture. The trained catheter decision-making model derived from the digital twin was then transferred to real world implementations of robot-assisted Auto-RHC.

Abstract:
Accurately determining the shape of objects and the location of their internal structures within deformable objects is crucial for medical tasks that require precise targeting, such as robotic biopsies. We introduce LUDO, a method for accurate low-latency understanding of deformable objects. LUDO reconstructs objects in their deformed state, including their internal structures, from a single-view point cloud observation in under 30 ms using occupancy networks. LUDO provides uncertainty estimates for its predictions. Additionally, it provides explainability by highlighting key features in its input observations. Both uncertainty and explainability are important for safety-critical applications such as surgical interventions. We demonstrate LUDO's abilities for autonomous targeting of internal regions of interest (ROIs) in deformable objects. We evaluate LUDO in real-world robotic experiments, achieving a success rate of 98.9% for puncturing various ROIs inside deformable objects. LUDO demonstrates the potential to interact with deformable objects without the need for deformable registration methods.

Abstract:
We study noncooperative games, in which each player's objective is composed of a sequence of ordered- and potentially conflicting-preferences. Problems of this type naturally model a wide variety of scenarios: for example, drivers at a busy intersection must balance the desire to make forward progress with the risk of collision. Mathematically, these problems possess a nested structure, and to behave properly players must prioritize their most important preference, and only consider less important preferences to the extent that they do not compromise performance on more important ones. We consider multi-agent, noncooperative variants of these problems, and seek generalized Nash equilibria in which each player's decision reflects both its hierarchy of preferences and other players' actions. We make two key contributions. First, we develop a recursive approach for deriving the first-order optimality conditions of each player's nested problem. Second, we propose a sequence of increasingly tight relaxations, each of which can be transcribed as a mixed complementarity problem and solved via existing methods. Experimental results demonstrate that our approach reliably converges to equilibrium solutions that strictly reflect players' individual ordered preferences.

Abstract:
We present a Bayesian Neural Radiance Field (NeRF), which explicitly quantifies uncertainty in the volume density by modeling uncertainty in the occupancy, without the need for additional networks, making it particularly suited for challenging observations and uncontrolled image environments. NeRF diverges from traditional geometric methods by providing an enriched scene representation, rendering color and density in 3D space from various viewpoints. However, NeRF encounters limitations in addressing uncertainties solely through geometric structure information, leading to inaccuracies when interpreting scenes with insufficient real-world observations. While previous efforts have relied on auxiliary networks, we propose a series of formulation extensions to NeRF that manage uncertainties in density, both color and density, and occupancy, all without the need for additional networks. In experiments, we show that our method significantly enhances performance on RGB and depth images in the comprehensive dataset. Given that uncertainty modeling aligns well with the inherently uncertain environments of Simultaneous Localization and Mapping (SLAM), we applied our approach to SLAM systems and observed notable improvements in mapping and tracking performance. These results confirm the effectiveness of our Bayesian NeRF approach in quantifying uncertainty based on geometric structure, making it a robust solution for challenging real-world scenarios.

Abstract:
Hand-drawn maps can be used to convey navigation instructions between humans and robots in a natural and efficient manner. However, these maps can often contain inaccuracies such as scale distortions and missing landmarks which present challenges for mobile robot navigation. This paper introduces a novel Hand-drawn Map Navigation (HAM-Nav) architecture that leverages pre-trained vision language models (VLMs) for robot navigation across diverse environments, hand-drawing styles, and robot embodiments, even in the presence of map inaccuracies. HAM-Nav integrates a unique Selective Visual Association Prompting approach for topological map-based position estimation and navigation planning as well as a Predictive Navigation Plan Parser to infer missing landmarks. Extensive experiments were conducted in photorealistic simulated environments, using both wheeled and legged robots, demonstrating the effectiveness of HAM-Nav in terms of navigation success rates and Success weighted by Path Length. Furthermore, a user study in real-world environments highlighted the practical utility of hand-drawn maps for robot navigation as well as successful navigation outcomes compared against a non-hand-drawn map approach.

Abstract:
Harmonious human-robot collaboration requires the robot to behave like a human partner, which raises the critical question of what factors make the robot do so. This paper proposes a series of policies based on empathetic and non-empathetic intent inference, proactive and reactive action planning, and ego and non-ego action styles to examine which modules enable robots to exhibit human-like behaviors. Two series of experiments are conducted with human subjects to test the performance of the proposed controllers. In Experiment 1, the participant must identify whether the collaborating partner is a human, similar to a Turing test. The classification results empirically verify that the designed empathetic proactive policies enable the robot to exhibit human-like behaviors. Experiment 2 indicates that the proposed policy can be applied to complex collaborative tasks, and this result is consistent with the findings of Experiment 1. From empirical evidence from the experiments, we believe that empathy and proactive policies are essential elements to enable robots to perform human-like actions.

Abstract:
Robust motion planning under uncertainty is a critical challenge for applications involving real-world robotic deployments. This paper introduces SupeR-MPC, a computationally-efficient, sensitivity-aware, chance-constrained optimization framework that systematically accounts for multiple sources of uncertainty, including state estimation error, model parameter uncertainty, obstacle localization error, and process noise. This approach advances sensitivity-aware robust control by integrating chance-constrained optimization to handle the uncertainty models of Kalman-filtering methods. To demonstrate robustness against multiple uncertainty sources, SupeR-MPC was validated on a range of systems and environments, from a simple 2D example to a multi-agent dynamic obstacle avoidance scenario. Comparisons against existing MPC methods show that SupeR-MPC significantly improves constraint satisfaction and robustness while maintaining real-time computational efficiency. These results highlight the effectiveness of sensitivity-aware chance constraints in enhancing real-world robotic decision-making under uncertainty.

Abstract:
Open-vocabulary indoor three-dimensional object detection (OVI3DOD) is used to detect any class of objects in indoor scenes with prompts. Owing to the relatively limited three-dimensional (3D) data, most of the OVI3DOD algorithms perform training with pseudo labels transformed from the openvocabulary 2D detection results. For indoor scenes, point clouds are sparse and incomplete. Moreover, there is a gap between different modalities, especially for distant objects. However, existing OVI3DOD algorithms ignore this problem, which weakens the detection performance. Therefore, we propose the Compensation towards Modal Gap for open-vocabulary indoor 3D object detection (CMG3D) approach. CMG3D consists of three modules: multimodal compensation (MC), object proposal filtering (OPF) and pseudo label refinement and generation (PLRG). For the MC, features from images are converted into pseudo voxel space and then summed with the voxel space of the point cloud, which is used to compensate for the modality gap. For the OPF, we filter the object proposals to avoid confusion between the foreground and background. For the PLRG, the predictions from the two-dimensional (2D) detector are refined by the multimodal large language model (LLM) SIGLIP and then transformed to 3D pseudo labels for the training process. Finally, we evaluate CMG3D on two indoor datasets, SUN RGB-D and ScanNet, and achieve state-of-the-art results.

Abstract:
环境中的多机器人运动规划对安全高效协调构成挑战，但缺乏公平统一的测试平台来评估多样化算法。我们提出了DMRP-Bench，一个综合框架，旨在弥补这一空白。它采用分层架构，整合了全球和本地规划者，能够全面分析其在宏观系统层面结果和细粒度机器人间交互中的组合。高保真室内场景（如图书馆、商场、办公室）模拟多样的空间布局和行人动态，均在 NVIDIA Isaac Sim 环境中构建。对十六种规划器组合的广泛实验不仅揭示了轨迹效率与安全性之间的关键权衡，还促进了对机器人间协调的更深入评估。通过将路径执行的忠实度与交互结果关联，这些实验能够定量诊断局部行

Abstract:
Modern autonomous driving systems increasingly rely on mixed camera configurations with pinhole and fisheye cameras for full view perception. However, Bird's-Eye View(BEV) 3D object detection models are predominantly designed for pinhole cameras, leading to performance degradation under fisheye distortion. To bridge this gap, we introduce a multi‑view BEV detection benchmark with mixed cameras by converting KITTI�?60 into nuScenes format. Our study encompasses three adaptations: rectification for zero-shot evaluation and fine-tuning of nuScenes-trained models, distortion-aware view transformation modules(VTMs) via the MEI camera model, and polar coordinate representations to better align with radial distortion. We systematically evaluate three representative BEV architectures, BEVFormer, BEVDet and PETR, across these strategies. We demonstrate that projection-free architectures are inherently more robust and effective against fisheye distortion than other VTMs. This work establishes the first real-data 3D detection benchmark with fisheye and pinhole images and provides systematic adaptation and practical guidelines for designing robust and cost-effective 3D perception systems. The code is available at https://github.com/CesarLiu/FishBEVOD.git.

Abstract:
Cross-modal transfer learning is used to improve multi-modal classification models (e.g., for human activity recognition in human-robot collaboration). However, existing methods require paired sensor data at both training and inference, limiting deployment in resource-constrained environments where full sensor suites are not economically and technically usable. To address this, we propose Decoupled Cross-Attention Knowledge Transfer (D-CAT), a framework that aligns modality-specific representations without requiring joint sensor modality during inference. Our approach combines a self-attention module for feature extraction with a novel cross-attention alignment loss, which enforces the alignment of sensors' feature spaces without requiring the coupling of the classification pipelines of both modalities. We evaluate D-CAT on three multi-modal human activity datasets (IMU, video, and audio) under both in-distribution and out-of-distribution scenarios, comparing against uni-modal models. Results show that in in-distribution scenarios, transferring from high-performing modalities (e.g., video to IMU) yields up to +10% F1-score gains over uni-modal training. In out-of-distribution scenarios, even weaker source modalities (e.g., IMU to video) improve target performance, as long as the target model isn't overfitted on the training data. By enabling single-sensor inference with cross-modal knowledge, D-CAT reduces hardware redundancy for perception systems while maintaining accuracy, which is critical for cost-sensitive or adaptive deployments (e.g., assistive robots in homes with variable sensor availability). Code is available at [url will be added upon acceptance].

Abstract:
In robotics, diffusion models can capture multi-modal trajectories from demonstrations, making them a transformative approach in imitation learning. However, achieving optimal performance following this regiment requires a large-scale dataset, which is costly to obtain, especially for challenging tasks, such as collision avoidance. In such tasks, generalization at test time demands coverage of many obstacle types and their spatial configurations, which are impractical to acquire purely via data. Recent works ease this burden with training-free guidance by injecting environmental context at inference, however, it only works when paired with a sufficiently diverse training dataset that yields a conditional trajectory distribution with rich multimodal coverage. To remedy this problem, we propose Context-Aware diffusion policy via Proximal mode Expansion (CAPE), a framework that expands trajectory distribution modes with context-aware prior and guidance at inference via a novel prior-seeded iterative guided refinement procedure for motion replanning. The framework generates an initial trajectory plan and executes a short prefix trajectory, and then the remaining trajectory segment is perturbed to an intermediate noise level, forming a context-aware trajectory prior that preserves goal consistency and previously expanded modes. Repeating the process with context-aware guided denoising iteratively expands mode support to allow finding smoother, less collision-prone trajectories. We evaluate CAPE on reaching and pick-and-place tasks in cluttered unseen simulated and real-world settings and show that our proposed approach achieves up to 80% higher success rate and 4x improvement in replanning frequency compared to state-of-the-art, demonstrating better generalization to unseen environments.

Abstract:
Service robots designed to assist elderly people are receiving significant attention since they can improve their quality of life, promote their independence, and provide daily support. These mobile platforms can observe people moving around their homes, recognise dangerous events, and detect them promptly. This paper introduces a novel framework to perform fall detection and people following on board an autonomous legged robotic platform. The system operates on the Unitree Go2 robot and comprises two main building blocks. The first component consists of a Body Landmarks extractor and a Transformer-based network that performs binary classification, distinguishing between Fall behaviours and Activities of Daily Living (ADL). The second component is a target-driven path planner that enables the robot to follow and maintain a full-body view of the target in complex environments. Experiments on public datasets and comparison with state-of-the-art works have been conducted to demonstrate the reliability of both blocks. Real experiments in a cluttered environment have been performed to illustrate how the mobile platform is able to follow people moving around obstacles, detect falls in occluded areas, and predict peoples trajectories to maintain a full-body view.

Abstract:
We present a low-cost, autonomous, rotating-camera system that increases the usable data yield for 3D markerless motion capture of animals in uncontrolled outdoor settings. A lightweight detector (YOLOv4-Tiny) locates the subject at 10 Hz; an Extended Kalman Filter bridges sparse detections to a 50 Hz full-state feedback (FSF) controller, keeping the subject centered without a human operator. The 3D reconstruction backend uses existing markerless 2D keypoints and Full Trajectory Estimation (FTE) with a simple rotation compensation for moving cameras. On field videos of a running human and free-running cheetahs, the rotating cameras captured substantially more usable frames than fixed cameras: +52% for the human sequence (6593 vs. 4333 frames) and +135% across cheetah sequences (2419 vs. 1031 frames). Centering also shifts subject pixel distribution toward the image center, which theoretically lowers 2D keypoint error and thus 3D reprojection error for any pose-estimation backend. We detail the EKF design for sparse/noisy detections, the FSF controller with an integral state, and practical deployment considerations. Results show autonomous centering is a simple, deployable lever to scale outdoor animal mocap without changing downstream reconstruction methods.

Abstract:
Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in semantic understanding and common sense reasoning, making them promising candidates for solving planning problems in autonomous driving. However, the next-token text prediction objectives traditionally used in pre-training and supervised fine-tuning (SFT) of MLLMs may fall short of fulfilling the planning objectives for autonomous vehicles. The next-token prediction objective merely encourages per-token imitation in text, often irrespective of multi-step consequences and the alignment with crucial planning considerations such as giving space to other road actors. To overcome these limitations, we propose a reinforcement learning fine-tuning (RLFT) approach, MAGNIFIED, that aligns the MLLM-based driving agent with planning objectives by learning from token-level rewards. By mapping a sequence of predicted tokens to corresponding vehicle trajectories and learning from planning rewards, MAGNIFIED optimizes for the true planning objectives rather than focusing solely on token prediction accuracy, enabling the model to refine its understanding of the planning task beyond simple imitation. We validate our approach on the Waymo Open Motion Dataset with a novel setup incorporating rasterized birds-eye views and tokenized trajectories as inputs and planning-oriented outputs. An initial SFT phase establishes a strong baseline in outputting plan trajectories as sequences of X-Y coordinates in text, while subsequent RL fine-tuning substantially enhances planning performance relative to the SFT baseline (demonstrating over a 10.5% reduction in overlap rate and a 38.9% reduction in off-road rate), underscoring the potential of RLFT on MLLMs to achieve vehicle planning that is better aligned with compliant, comfortable, and efficient driving.

Abstract:
Learning the dynamics of deformable objects, such as dough or a sponge, from RGB-D videos is challenging due to insufficient visual cues and complex deformations. We introduce PLOP (Particle Filtering for Learning Object Physics), a novel framework to learn the dynamics model of deformable objects using a particle filter over 3D Gaussians. Our method learns (1) a dynamics function to predict the object state at the next time step and (2) a resampling function to split and merge Gaussians to handle complex deformations such as cutting. Within PLOP, we propose I2N (Implicit Particle Interaction Network), a dynamics model that leverages a mixed particle-grid representation inspired by the Material Point Method (MPM). By transferring particle features to grid nodes, solving for grid dynamics, and then projecting solutions back to particles, our approach avoids explicit pairwise interaction reasoning between particles and significantly reduces computational cost when the number of particles is large. While PLOP is applicable to general robot-object interactions, we evaluate it on cutting sequences in both simulation and the real world, which induce challenging topological changes and expose previously occluded surfaces. On these benchmarks, PLOP achieves a 53.15% improvement in 3D reconstruction accuracy and a 6.84% improvement in 2D reconstruction accuracy on the simulation benchmark, as well as 28.41% and 24.45% improvements in 3D and 2D reconstruction metrics, respectively, on the real-world dataset.

Abstract:
Preference-based reinforcement learning (RL) offers a promising approach for aligning policies with human intent but is often constrained by the high cost of human feedback. In this work, we introduce ROVED, a framework that integrates Vision-Language Models (VLMs) with selective human feedback to significantly reduce annotation requirements while maintaining performance. Our method leverages VLMs to generate initial preference labels, which are then filtered to identify uncertain cases for targeted human annotation. Additionally, we adapt VLMs using a self-supervised inverse dynamics loss to improve alignment with evolving policies. Experiments on Meta-World manipulation tasks demonstrate that ROVED achieves comparable or superior success rates to state-of-the-art methods while using up to 3x fewer human annotations. Furthermore, we show that adapted VLMs enable efficient knowledge transfer across tasks, further minimizing feedback needs. Our results highlight the potential of combining VLMs with selective human supervision to make preference-based RL more scalable and practical.

Abstract:
Unmanned aerial vehicles (UAVs) have emerged as a promising auxiliary platform for smart agriculture, capable of simultaneously performing weed detection, recognition, and data collection from wireless sensors. However, trajectory planning for UAV-based smart agriculture is challenging due to the high uncertainty of the environment, partial observations, and limited battery capacity of UAVs. To address these issues, we formulate the trajectory planning problem as a Markov decision process (MDP) and leverage multi-agent reinforcement learning (MARL) to solve it. Furthermore, we propose a novel imitation-based triple deep Q-network (ITDQN) algorithm, which employs an elite imitation mechanism to reduce exploration costs and utilizes a mediator Q-network over a double deep Q-network (DDQN) to accelerate and stabilize training and improve performance. Experimental results in both simulated and real-world environments demonstrate the effectiveness of our solution. Moreover, our proposed ITDQN outperforms DDQN by 4.43% in weed recognition rate and 6.94% in data collection rate.

Abstract:
One of the central challenges preventing robots from acquiring complex manipulation skills is the prohibitive cost of collecting large-scale robot demonstrations. In contrast, humans are able to learn efficiently by watching others interact with their environment. To bridge this gap, we introduce semantic action flow as a core intermediate representation capturing the essential spatio-temporal manipulator-object interactions, invariant to superficial visual differences. We present ViSA-Flow, a framework that learns this representation self-supervised from unlabeled large-scale video data. First, a generative model is pre-trained on semantic action flows automatically extracted from large-scale human-object interaction video data, learning a robust prior over manipulation structure. Second, this prior is efficiently adapted to a target robot by fine-tuning on a small set of robot demonstrations processed through the same semantic abstraction pipeline. We demonstrate through extensive experiments on the CALVIN benchmark and real-world tasks that ViSA-Flow achieves state-of-the-art performance, particularly in low-data regimes, outperforming prior methods by effectively transferring knowledge from human video observation to robotic execution. Videos are available at https://visaflow-web.github.io/ViSAFLOW.

Abstract:
Underwater Remotely Operated Vehicles (ROVs) exchange data with a control station via a communication cable. One or more intermediate robots can be placed along this tether to manage its shape and minimize the mechanical effects on the ROV. This work deals with the localization of a pair of underwater robots connected by a tether, in a previously unknown environment. While each robot can estimate its trajectory and a model of its surroundings using Simultaneous Localization And Mapping (SLAM) algorithms, aligning these observations in the same reference frame requires inter-robot data association. In this work, we introduce T-Coloc, a new method for aligning models frames that leverages an estimation of the tether shape to align individual robot observations. An experimental validation in a pool demonstrates that T-CoLoc can align the trajectories of the two robots in the same reference frame with an error lower than 20 cm using the noisy shape estimation of a 3 m long tether.

Abstract:
Autonomous Valet Parking (AVP) requires planning under partial observability, where parking spot availability evolves as dynamic agents enter and exit spots. Existing approaches either rely only on instantaneous spot availability or make static assumptions, thereby limiting foresight and adaptability. We propose an approach that estimates probability of future spot occupancy by distinguishing initially vacant and occupied spots while leveraging nearby dynamic agent motion. We propose a probabilistic estimator that integrates partial, noisy observations from a limited Field-of-View, with the evolving uncertainty of unobserved spots. Coupled with the estimator, we design a strategy planner that balances goal-directed parking maneuvers with exploratory navigation based on information gain, and incorporates wait-and-go behaviors at promising spots. Through randomized simulations emulating large parking lots, we demonstrate that our framework significantly improves parking efficiency and trajectory smoothness over existing approaches, while maintaining safety margins. Simulation videos: https://sites.google.com/view/avp-hri/home.

Abstract:
We present Actron3D, a framework that enables robots to acquire transferable 6-DoF manipulation skills from monocular, uncalibrated, RGB-only human demonstration videos. Our key idea is to represent manipulation knowledge within a video as a continuous neural function over object space. At the core of Actron3D lies the Neural Affordance Function, which distills geometry, visual features, contact priors, and action flows from diverse demonstration videos into a compact 3D neural representation. During deployment, we adopt a hierarchical pipeline that retrieves the matched affordance function and transfers encoded manipulation knowledge to novel objects through coarse-to-fine differentiable optimization. Leveraging the continuous nature of Neural Affordance Function, the framework performs spatial queries over multimodal features to align demonstrations with observations and generates precise 6-DoF manipulation policy. Experiments in both simulation and the real-world demonstrate that Actron3D significantly outperforms prior methods, achieving a 14.9 percentage point improvement in the average success rate across 13 tasks while requiring only 23 demonstration videos per task.

Abstract:
Friction in compact, geared actuators remains a primary barrier to transparency in upper-limb exoskeletons, especially near zero velocity and during frequent reversals. A momentum-based estimation framework is developed and evaluated on a two-DoF active device (modified EDUExo), where joint friction is recovered from on-board joint measurements and fitted to Coulombviscous and Stribeck laws. Two estimators are compared under identical conditions: a first-order momentum observer (FO) and a second-order sliding-mode momentum observer (SOSML). Three velocity trajectories are designed to probe complementary behaviors. In simulation, SOSML adheres more closely to the S-shaped friction law, and preserves loop symmetry under encoder noise; parameter variance and robustness under structured model mismatch are likewise improved relative to FO. The results indicate that SOSML delivers lower lag, cleaner noise profiles, and reduced parameter drift without changing the signal set or adding sensors, thereby strengthening friction identification and compensation on compact, gear-reduced actuators.

Abstract:
While Unmanned Aerial Vehicles (UAVs) have gained significant traction across various fields, path planning in 3D environments remains a critical challenge, particularly under size, weight, and power (SWAP) constraints. Traditional modular planning systems often introduce latency and suboptimal performance due to limited information sharing and local minima issues. End-to-end learning approaches streamline the pipeline by mapping sensory observations directly to actions but require large-scale datasets, face significant sim-to-real gaps, or lack dynamical feasibility. In this paper, we propose a self-supervised UAV trajectory planning pipeline that integrates a learning-based depth perception with differentiable trajectory optimization. A 3D cost map guides UAV behavior without expert demonstrations or human labels. Additionally, we incorporate a neural network-based time allocation strategy to improve the efficiency and optimality. The system thus combines robust learning-based perception with reliable physics-based optimization for improved generalizability and interpretability. Both simulation and real-world experiments validate our approach across various environments, demonstrating its effectiveness and robustness. Our method achieves a 30.90% reduction in control effort while maintaining competitive tracking performance compared with state-of-the-art.

Abstract:
Solving complex, long-horizon robotic manipulation tasks requires a deep understanding of physical interactions, reasoning about their long-term consequences, and precise high-level planning. Vision-Language Models (VLMs) offer a general perceive-reason-act framework for this goal. However, previous approaches using reflective planning to guide VLMs in correcting actions encounter significant limitations. These methods rely on inefficient and often inaccurate implicit learning of state-values from noisy foresight predictions, evaluate only a single greedy future, and suffer from substantial inference latency. To address these limitations, we propose a novel test-time computation framework that decouples state evaluation from action generation. This provides a more direct and fine-grained supervisory signal for robust decision-making. Our method explicitly models the advantage of an action plan, quantified by its reduction in distance to the goal, and uses a scalable critic to estimate. To address the stochastic nature of single-trajectory evaluation, we employ beam search to explore multiple future paths and aggregate them during decoding to model their expected long-term returns, leading to more robust action generation. Additionally, we introduce a lightweight, confidence-based trigger that allows for early exit when direct predictions are reliable, invoking reflection only when necessary. Extensive experiments on diverse, unseen multi-stage robotic manipulation tasks demonstrate a 24.6% improvement in success rate over state-of-the-art baselines, while significantly reducing inference time by 56.5%.

Abstract:
This paper presents a high-efficiency, CPU-only volumetric mapping framework based on a Truncated Signed Distance Field (TSDF). The system incrementally fuses raw LiDAR point-cloud data into a voxel grid using a directional bitmask-based integration scheme, producing dense and consistent TSDF representations suitable for real-time 3D reconstruction. A key feature of the approach is that the processing time per point-cloud remains constant, regardless of the voxel grid resolution, enabling high resolution mapping without sacrificing runtime performance. In contrast to most recent TSDF/ESDF methods that rely on GPU acceleration, our method operates entirely on CPU, achieving competitive results in speed. Experiments on real-world open datasets demonstrate that the generated maps attain accuracy on par with contemporary mapping techniques.

Abstract:
Robotic manipulation requires policies that are smooth and responsive to evolving observations. However, synchronous inference in the raw action space introduces several challenges, including intra-chunk jitter, inter-chunk discontinuities, and stop-and-go execution. These issues undermine a policy's smoothness and its responsiveness to environmental changes. We propose ABPolicy, an asynchronous flow‑matching policy that operates in a B‑spline control‑point action space. First, the B‑spline representation ensures intra‑chunk smoothness. Second, we introduce bidirectional action prediction coupled with refitting optimization to enforce inter‑chunk continuity. Finally, by leveraging asynchronous inference, ABPolicy delivers real-time, continuous updates. We evaluate ABPolicy across seven tasks encompassing both static settings and dynamic settings with moving objects. Empirical results indicate that ABPolicy reduces trajectory jerk, leading to smoother motion and improved performance. Project website: urlhttps://teee000.github.io/ABPolicy/.

Abstract:
In lunar and planetary exploration, legged robots have attracted significant attention as an alternative to conventional wheeled robots, which struggle to traverse rough and uneven terrain. To enable locomotion over highly irregular and steeply inclined surfaces, limbed climbing robots equipped with grippers on their feet have emerged as a promising solution. In this paper, we present LIMBERO, a 10 kg-class quadrupedal climbing robot that employs spine-type grippers for stable locomotion and climbing on rugged and steep terrain. We first introduce a novel gripper design featuring coupled finger-closing and spine-hooking motions, tightly actuated by a single motor, which achieves exceptional grasping performance (>150 N) despite its lightweight design (525 g). Furthermore, we develop an efficient algorithm to visualize a geometry-based graspability index on continuous rough terrain. Finally, we integrate these components into LIMBERO and demonstrate its ability to ascend steep rocky surfaces under a 1 G gravity condition, a performance not previously achieved yet for limbed climbing robots of this scale.

Abstract:
Perception systems provide a rich understanding of the environment for autonomous systems, shaping decisions in all downstream modules. Hence, accurate detection and isolation of faults in perception systems is important. Faults in perception systems pose particular challenges: faults are often tied to the perceptual context of the environment, and errors in their multi-stage pipelines can propagate across modules. To address this, we adopt a counterfactual reasoning approach to propose a framework for fault detection and isolation (FDI) in perception systems. As opposed to relying on physical redundancy (i.e., having extra sensors), our approach utilizes analytical redundancy with counterfactual reasoning to construct perception reliability tests as causal outcomes influenced by system states and fault scenarios. Counterfactual reasoning generates reliability test results under hypothesized faults to update the belief over fault hypotheses. We derive both passive and active FDI methods. While the passive FDI can be achieved by belief updates, the active FDI approach is defined as a causal bandit problem, where we utilize Monte Carlo Tree Search (MCTS) with upper confidence bound (UCB) to find control inputs that maximize a detection and isolation metric, designated as Effective Information (EI). The mentioned metric quantifies the informativeness of control inputs for FDI. We demonstrate the approach in a robot exploration scenario, where a space robot performing vision-based navigation actively adjusts its attitude to increase EI and correctly isolate faults caused by sensor damage, dynamic scenes, and perceptual degradation.

Abstract:
Robotic inspection tasks often require constructing high-quality 3D models of objects from a minimal number of views. Traditional next-best view planning (NBVP) approaches incrementally select view poses but fail to account for global optimality of the inspection trajectory, thus leading to inefficient inspection paths. Recent one-shot view planning (OSVP) methods address this challenge by predicting informative view poses from an initial observation. While subsequent improvements on the pioneering OSVP approach attempt to improve prediction accuracy, they can still fail when faced with out of distribution (OoD) examples. With recent advances in generative modeling, OSVP methods can infer a plausible object shape from one observation and then derive the corresponding solution set of view poses. However, because the predicted shape may deviate from the true geometry, these methods can still generate infeasible views. To overcome these limitations, we propose a novel OSVP framework that leverages RGB-D data to generate geometric priors and incorporates online video-based reconstruction. Our method formulates viewpoint selection and path optimization, so that both the calculated poses and the connecting trajectories satisfy visibility constraints, maintain smoothness, and can be locally replanned to compensate for discrepancies between predicted and real object geometries. We validate our OSVP approach through simulation benchmarks against state-of-the-art OSVP techniques and demonstrate its effectiveness on a real Franka Emika manipulator.

Abstract:
We present a unified framework for solving trajectory optimization problems in a derivative-free manner through the use of sequential convex programming. Traditionally, nonconvex optimization problems are solved by forming and solving a sequence of convex optimization problems, where the cost and constraint functions are approximated locally through Taylor series expansions. This presents a challenge for functions where differentiation is expensive or unavailable. In this work, we present a derivative-free approach to form these convex approximations by computing samples of the dynamics, cost, and constraint functions and letting the solver interpolate between them. Our framework includes sample-based trajectory optimization techniques like model-predictive path integral (MPPI) control as a special case and generalizes them to enable features like multiple shooting and general equality and inequality constraints that are traditionally associated with derivative-based sequential convex programming methods. The resulting framework is simple, flexible, and capable of solving a wide variety of practical motion planning and control problems.

Abstract:
Bimanual manipulation is a fundamental robotic skill that requires continuous and precise coordination between two arms. While imitation learning (IL) is the dominant paradigm for acquiring this capability, existing approaches, whether robot-centric or object-centric, often overlook the dynamic geometric relationship among the two arms and the manipulated object. This limitation frequently leads to inter-arm collisions, unstable grasps, and degraded performance in complex tasks. To address this, in this paper we explicitly models the RobotObject Triadic Interaction (RoTri) representation in bimanual systems, by encoding the relative 6D poses between the two arms and the object to capture their spatial triadic relationship and establish continuous triangular geometric constraints. Building on this, we further introduce RoTri-Diff, a diffusion-based imitation learning framework that combines RoTri constraints with robot keyposes and object motion in a hierarchical diffusion process. This enables the generation of stable, coordinated trajectories and robust execution across different modes of bimanual manipulation. Extensive experiments show that our approach outperforms state-of-the-art baselines by 10.2% on 11 representative RLBench2 tasks and achieves stable performance on 4 challenging real-world bimanual tasks. Project website: https://rotri-diff.github.io/.

Abstract:
Large behaviour models have transformed the field of robotic manipulation, but prohibitive data requirements have thus far prevented a revolution similar to vision language models. We believe that instrumentation, i.e. sensor integration in objects, can provide invaluable state information and enable efficient learning for robotic manipulation. In this paper, we present instrumented imitation learning of clothes hanger insertion. Using 180 teleoperated demonstrations, we train diffusion policies with and without access to instrumentation data. Results show that policies leveraging instrumentation outperform vision-only counterparts by 1425 %pt and exhibit greater task awareness. Crucially, a black-box imitation learning policy learns to prioritise instrumentation signals without explicit guidance. In addition, enhancing the teleoperation dataset with rollouts from an instrumented expert policy, enables a vision-only student policy to achieve performance comparable to the instrumented expert, thereby surpassing the original vision-only policy. These findings establish instrumentation as a promising strategy to enhance imitation learning for robotic manipulation. Datasets are available on Zenodo [10.5281/zenodo.17122216].

Abstract:
Large Vision-Language Models (VLMs) have demonstrated potential in enhancing mobile robot navigation in human-centric environments by understanding contextual cues, human intentions, and social dynamics while exhibiting reasoning capabilities. However, their computational complexity and limited sensitivity to continuous numerical data impede real-time performance and precise motion control. To this end, we propose Narrate2Nav, a real-time vision-action model that leverages a self-supervised learning framework based on the Barlow Twins redundancy reduction loss to embed implicit natural language reasoning, social cues, and human intentions within a visual encoder. The model combines RGB inputs, motion commands, and textual signals of scene context during training to bridge from robot observations to low-level motion commands for short-horizon point-goal navigation during deployment. Extensive evaluation of Narrate2Nav across diverse and challenging scenarios in an unseen offline dataset, complemented by a small-scale real-world experiment, demonstrates a 52.94% improvement over the next best baseline in offline testing, with consistent gains observed in real-world evaluations.

Abstract:
We present TreeIRL, a novel planner for autonomous driving that combines Monte Carlo tree search (MCTS) and inverse reinforcement learning (IRL) to achieve state-of-the-art performance in simulation and in real-world driving. The key idea is to use MCTS to find a promising set of safe candidate trajectories and a deep scoring function trained with IRL to select the most human-like among them. We evaluate TreeIRL against classical and state-of-the-art planners on large-scale simulations and on 500+ miles of real-world autonomous driving in the Las Vegas metropolitan area. Scenarios include navigating heavy urban traffic, adaptive cruise control, cut-ins, and traffic lights. TreeIRL achieves the best overall performance, striking a balance between safety, progress, comfort, and human-likeness. To the best of our knowledge, our work is the first public-road demonstration of MCTS-based planning and underscores the importance of evaluating planners across a diverse set of metrics and in real-world environments. TreeIRL is highly extensible and could be further improved with reinforcement learning and imitation learning, providing a framework for exploring different combinations of classical and learning-based approaches to solve the planning bottleneck in autonomous driving.

Abstract:
Shared micromobility services such as e-scooters and bikes have become an integral part of urban transportation, yet their efficiency critically depends on effective vehicle rebalancing. Existing methods either optimize for average demand patterns or employ robust optimization and reinforcement learning to handle predefined uncertainties. However, these approaches overlook emergent events (e.g., demand surges, vehicle outages, regulatory interventions) or sacrifice performance in normal conditions. We introduce AMPLIFY, an LLM-augmented policy adaptation framework for shared micromobility rebalancing. The framework combines a baseline rebalancing module with an LLM-based adaptation module that adjusts strategies in real time under emergent scenarios. The adaptation module ingests system context, demand predictions, and baseline strategies, and refines adjustments through self-reflection. Evaluations on real-world e-scooter data from Chicago show that our approach improves demand satisfaction and system revenue compared to baseline policies, highlighting the potential of LLM-driven adaptation as a flexible solution for managing uncertainty in micromobility systems.

Abstract:
We introduce HoMeR, an imitation learning framework for mobile manipulation that combines whole-body control with hybrid action modes that handle both long-range and fine-grained motion, enabling effective performance on realistic in-the-wild tasks. At its core is a fast, kinematics-based whole-body controller that maps desired end-effector poses to coordinated motion across the mobile base and arm. Within this reduced end-effector action space, HoMeR learns to switch between absolute pose predictions for long-range movement and relative pose predictions for fine-grained manipulation, offloading low-level coordination to the controller and focusing learning on task-level decisions. We deploy HoMeR on a holonomic mobile manipulator with a 7-DoF arm in a real home. We compare HoMeR to baselines without hybrid actions or whole-body control across 3 simulated and 3 real household tasks such as opening cabinets, sweeping trash, and rearranging pillows. Across tasks, HoMeR achieves an overall success rate of 79.17% using just 20 demonstrations per task, outperforming the next best baseline by 29.17 on average. HoMeR is also compatible with vision-language models and can leverage their internet-scale priors to better generalize to novel object appearances, layouts, and cluttered scenes. In summary, HoMeR moves beyond tabletop settings and demonstrates a scalable path toward sample-efficient, generalizable manipulation in everyday indoor spaces. Code, videos, and supplementary material are available at: https://homer-manip.github.io/.

Abstract:
Modularity in robots enhances versatility, enabling shape morphing and reconfiguration. In modular soft robots, the use of soft materials allows dimensional transformations across different architectures - from chains (1D) to lattices (2D) and spheres (3D). All this enables a swarm of robots to exhibit multi-modal locomotion - such as millipede-like, starfish-like, and soccer-ball-like movement patterns. However, achieving such reconfiguration remains challenging, especially in soft robots, where docking is difficult to realize without compromising compliance. Conventional approaches - such as rigid inserts, magnetic actuators, and adhesives - face challenges due to rigidsoft fabrication mismatch, interference with body compliance and limited holding strength. To address these challenges, this work proposes a geometric, active shape-morphing docking mechanism for spherically reconfigurable soft robots, that combines concepts of topology design and mechanical metamaterials. The robot module edges are designed to create geometric interlocks between adjacent edges (similar to jigsaw puzzle pieces) with an internal structure that deforms under actuation by inlaid shape memory alloy (SMA) wires. The metamaterial internal structure is obtained through inverse design optimization of a computational deformation model created in Abaqus CAE. The constraint-aware optimization strategy blends random search and genetic algorithm features to handle a large number of bounded variables and nonlinear objective function, driving convergence toward a global minimum via geometric decay of the search space. The resulting optimal geometry is designed to buckle under high localized forces, enabling docking and undocking, while remaining minimally deformed under distributed forces, thereby passively maintaining coupling during operation. The docking mechanism is experimentally validated by confirming that the deformation achieved under actuation can facilitate the docking operation and th

Abstract:
Vision-Language-Action (VLA) models have shown a strong capability in enabling robots to execute general instructions, yet they struggle with contact-rich manipulation tasks, where success requires precise alignment, stable contact maintenance,and effective handling of deformable objects. A fundamental challenge arises from the imbalance between high-entropy vision and language inputs and low-entropy but critical force signals, which often leads to over-reliance on perception and unstable control. To address this, we introduce CRAFT,a force-aware curriculum fine-tuning framework that integrates a variational information bottleneck module to regulate vision and language embeddings during early training. This curriculum strategy encourages the model to prioritize force signals initially, before progressively restoring access to the full multimodal information. To enable force-aware learning, we further design a homologous leaderfollower teleoperation system that collects synchronized vision, language, and force data across diverse contact-rich tasks. Real-world experiments demonstrate that CRAFT consistently improves task success,generalizes to unseen objects and novel task variations,and adapts effectively across diverse VLA architectures,enabling robust and generalizable contact-rich manipulation.

Abstract:
Gaussian Splatting has significantly improved the quality of novel view synthesis with explicit Gaussian representation. However, we observed that existing 3D Gaussian Splatting methods (3DGS) often suffer from surface collapse issues on reflective regions, and thus produce inferior geometry and low-quality specular. In this work, we propose a physically-based deferred rendering framework, named Reflection-aware Gaussian Splatting (RGS), that can accurately model specular regions and improve novel view synthesis performance. Specifically, we found that a powerful 3D foundation model can provide a strong 3D geometric prior to foster correct geometric modeling. Based on this, we propose a cross-view shape consistency regularization to regularize the geometry surface with the large model prior and cross-view constraints. In this manner, our RGS can produce smoother geometric surfaces on reflective regions while reducing geometric hollows. To further improve rendering results on reflective regions, we present a reflection-aware densification strategy that is designed to capture specular variations across various views. With this strategy, our RGS is able to render novel views of objects in higher quality. Extensive experiments demonstrate our method consistently renders high-quality reflective objects, achieving state-of-the-art performance.

Abstract:
Urban Air Mobility emerges as a transformative mode of transportation, but its integration into complex low-altitude urban environments requires systematic consideration of safety and efficiency. This study aims to develop a computational framework that enables structured traffic organization while accounting for spatially variant risks. The framework introduces a dual-field environmental model that couples a traversability field, which quantifies continuous anisotropic risk, with a scalar potential field, which encodes macroscopic traffic flow. The path planning formulation computes geodesics under an anisotropic metric derived from the dual-field, and the centralized coordination mechanism updates the fields to maintain real-time deconfliction. Simulation results demonstrate that the proposed framework generates paths that reduce exposure to high-risk regions to a negligible level and achieve a substantial reduction in average curvature compared to a baseline planner. Furthermore, the local update mechanism provides significant computational speedup for dynamic real-time scenarios. These results validate the capability of the dual-field framework to unify safety and efficiency in urban airspace management, providing a scalable foundation for future unmanned traffic management systems.

Abstract:
While recent advancements in reinforcement learning have enabled quadrupedal robots to perform non-prehensile manipulation tasks like pushing, existing methods have largely overlooked the critical challenge of obstacle avoidance. In this paper, we address this significant limitation by introducing a novel reinforcement learning (RL) framework that controls a quadrupedal robot to push large objects in cluttered, real-world environments. In particular, obstacle avoidance is integrated as a primary objective directly into the policy training process. To achieve this, we propose to represent the traversable space with a low-dimensional safe corridor, a method that is both computationally efficient and highly effective. This approach avoids the need for complex and resource-intensive training pipelines typically required for processing high-dimensional sensor data. We validate our policy through extensive experiments in both simulation and the real world. The implementation code will be released to benefit the research community.

Abstract:
Unstructured scenes present unique challenges for autonomous driving, as irregular obstacles and sparse scene layouts undermine the effectiveness of traditional perception methods such as 3D object detection. 3D semantic occupancy prediction has emerged as a prominent focus due to its ability to provide dense spatial representations by assigning semantic labels to individual voxels in 3D space. However, directly applying 3D semantic occupancy prediction to unstructured scenes remains challenging because scene sparsity hinders effective cross-modal fusion and the more severe long-tail distribution in these scenarios further degrades prediction performance. To validate the effectiveness of our approach, we construct a dedicated dataset of unstructured scenes collected from open-pit mines. Based on this, we propose UnsOcc, a multi-modal 3D semantic occupancy prediction framework that improves robustness in unstructured environments. At its core, we introduce a rendering-based fusion module, RenderFusion, which enhances cross-modal feature alignment through bidirectional rendering supervision. Furthermore, we propose GSRefinement, a detail-aware auxiliary supervision method based on Gaussian Splatting that projects sparse 3D occupancy predictions into dense 2D semantic segmentation maps, enabling effective supervision for long-tail categories. Extensive experiments on both the open-pit mine dataset and the nuScenes dataset demonstrate that our method significantly outperforms existing state-of-the-art approaches.

Abstract:
We study the problem of long-term (multiple days) mapping of a river plume using multiple autonomous underwater vehicles (AUVs), focusing on the Douro river representative use-case. We propose an energy - and communication - efficient multi-agent reinforcement learning approach in which a central coordinator intermittently communicates with the AUVs, collecting measurements and issuing commands. Our approach integrates spatiotemporal Gaussian process regression (GPR) with a multi-head Q-network controller that regulates direction and speed for each AUV. Simulations using the Delft3D ocean model demonstrate that our method consistently outperforms both single- and multi-agent benchmarks, with scaling the number of agents both improving mean squared error (MSE) and operational endurance. In some instances, our algorithm demonstrates that doubling the number of AUVs can more than double endurance while maintaining or improving accuracy, underscoring the benefits of coordination. Our learned policies generalize across unseen seasonal regimes over different months and years, demonstrating promise for future developments of data-driven long-term monitoring of dynamic plume environments.

Abstract:
Robotic manipulation policies are advancing rapidly, but their direct evaluation in the real world remains costly, time-consuming, and difficult to reproduce, particularly for tasks involving deformable objects. Simulation provides a scalable and systematic alternative, yet existing simulators often fail to capture the coupled visual and physical complexity of soft-body interactions. We present a real-to-sim policy evaluation framework that constructs soft-body digital twins from real-world videos and renders robots, objects, and environments with photorealistic fidelity using 3D Gaussian Splatting. We validate our approach on representative deformable manipulation tasks, including plush toy packing, rope routing, and T-block pushing, demonstrating that simulated rollouts correlate strongly with real-world execution performance and reveal key behavioral patterns of learned policies. Our results suggest that combining physics-informed reconstruction with high-quality rendering enables reproducible, scalable, and accurate evaluation of robotic manipulation policies. Website: https://real2sim-eval.github.io/

Abstract:
Household environments present one of the most common, impactful yet challenging application domains for robotics. Within household scenarios, manipulating deformable objects is particularly difficult, both in simulation and real-world execution, due to varied categories and shapes, complex dynamics, and diverse material properties, as well as the lack of reliable deformable-object support in existing simulations. We introduce LeHome, a comprehensive simulation environment designed for deformable object manipulation in household scenarios. LeHome covers a wide spectrum of deformable objects, such as garments and food items, offering high-fidelity dynamics and realistic interactions that existing simulators struggle to simulate accurately. Moreover, LeHome supports multiple robotic embodiments and emphasizes low-cost robots as a core focus, enabling end-to-end evaluation of household tasks on resource-constrained hardware. By bridging the gap between realistic deformable object simulation and practical robotic platforms, LeHome provides a scalable testbed for advancing household robotics. Webpage: lehome-web.github.io/.

Abstract:
In this work, we present a hierarchical framework designed to support robotic inspection under environment uncertainty. By leveraging a known environment model, existing methods plan and safely track inspection routes to visit points of interest. However, discrepancies between the model and actual site conditions, caused by either natural or human activities, can alter the surface morphology or introduce path obstructions. To address this challenge, the proposed framework divides the inspection task into: (a) generating the initial global view-plan for region of interests based on a historical map and (b) local view replanning to adapt to the current morphology of the inspection scene. The proposed hierarchy preserves global coverage objectives while enabling reactive adaptation to the local surface morphology. This enables the local autonomy to remain robust against environment uncertainty and complete the inspection tasks. We validate the approach through deployments in real-world subterranean mines using quadrupedal robot. A supplementary media highlighting the proposed method can be found here https://youtu.be/6TxK8S_83Lw.

Abstract:
Achieving efficient and uniform coverage in obstacle-laden unknown environments is essential for au- tonomous robots in cleaning, inspection and agricultural op- erations. Unlike most existing approaches that prioritize path length and time optimality, we propose the SHIFT planner framework, which integrates semantic mapping, adaptive cov- erage planning, and real-time obstacle avoidance to ensure comprehensive coverage across diverse terrains and seman- tic features. We first develop an innovative Radiant-Field- Informed Coverage Planning (RFICP) algorithm, which gen- erates trajectories that adapt to terrain variations. A Gaussian diffusion field is employed to adaptively adjust the robots speed, ensuring efficient coverage under varying environmental conditions influenced by semantic attributes. Next, we present a novel incremental KD-tree sliding window optimization (IKD- SWOpt) method to effectively handle unforeseen obstacles. IKD-SWOpt leverages an enhanced A algorithm guided by the IKD-tree distance field to generate initial local avoidance tra- jectories. Subsequently, it optimizes trajectory segments within and outside waypoint safety zones by evaluating and refining non-compliant segments using an adaptive sliding window. This method not only reduces computational overhead but also guarantees high-quality real-time obstacle avoidance. Extensive experiments were conducted using drones in simulated envi- ronments and robotic vacuum cleaners in real-world settings.

Abstract:
Learning policies for complex humanoid tasks remains both challenging and compelling. Inspired by how infants and athletes rely on external supportsuch as parental walkers or coach-applied guidanceto acquire skills like walking, dancing, and performing acrobatic flips, we propose A2CF: Adaptive Assistive Curriculum Force for humanoid motion learning. A2CF trains a dual-agent system, in which a dedicated assistive force agent applies state-dependent forces to guide the robot through difficult initial motions and gradually reduces assistance as the robot's proficiency improves. Across three benchmarksbipedal walking, choreographed dancing, and backflipsA2CF achieves convergence 30% faster than baseline methods, lowers failure rates by over 40%, and ultimately produces robust, support-free policies. Real-world experiments further demonstrate that adaptively applied assistive forces significantly accelerate the acquisition of complex skills in high-dimensional robotic control.

Abstract:
Autonomous mobile robots increasingly rely on LiDARIMU odometry for navigation and mapping, yet horizontally mounted LiDARs (e.g., MID360) capture limited near-ground returns, reducing terrain awareness and degrading performance in feature-scarce environments. Prior solutions, such as static tilt, active rotation, or higher-density sensors, either compromise horizontal perception or introduce extra actuation, cost, weight, and power. We introduce PERAL, a perception-aware motion control framework for spherical robots that provides passive LiDAR excitation without dedicated hardware. By modeling the coupling between the internal differential-drive actuation and sensor attitude, PERAL superimposes bounded, non-periodic oscillations onto nominal goal- or trajectory-tracking commands to increase vertical scan diversity while preserving navigation accuracy. Implemented on a compact spherical robot, PERAL is validated in laboratory, corridor, and tactical environments. Experiments show up to 96% map completeness and a 27% reduction in trajectory tracking error (relative to fixed-horizontal baselines), while improving the observability of near-ground targets in the reconstructed map, at lower weight, power, and cost than static tilt and active rotation. Design and code are available at https://github. com/snakehaihai/PERAL_robot_design.

Abstract:
Tool-mediated interactions enable robotics to manipulate and explore granular objects, producing informative auditory signals. A central challenge is transferring this perceptual knowledge across different tools and behaviors without costly data collection for each new context. We address this problem in the domain of audio-based recognition of granular and liquid-like objects. In this work, we leverage audio signals from tool-mediated interactions and learn context-agnostic representations for object recognition. We propose two contrastive learning approaches: a shared-object transfer method that performs supervised contrastive learning using audio data, and a zero-shot transfer method that integrates both audio and natural language descriptions of interaction contexts. Experiments on real-world data show that both methods achieve strong object recognition performance in unseen contexts, sometimes matching or exceeding a supervised baseline despite limited target context data. Furthermore, the learned latent spaces exhibit clearly separable clusters by object identity, and the zero-shot method successfully recognizes novel objects, offering a practical solution for robot perception in data-scarce scenarios. The code for this paper is available at: https://github.com/siliu6487/AuditoryKnowledgeTransfer.

Abstract:
This letter introduces SVN-ICP, a novel Iterative Closest Point (ICP) algorithm with uncertainty estimation that leverages Stein Variational Newton (SVN) on manifold. Designed specifically for fusing LiDAR odometry in multisensor systems, the proposed method ensures accurate pose estimation and consistent noise parameter inference, even in LiDAR-degraded environments. By approximating the posterior distribution using particles within the Stein Variational Inference framework, SVN-ICP eliminates the need for explicit noise modeling or manual parameter tuning. To evaluate its effectiveness, we integrate SVN-ICP into a simple error-state Kalman filter alongside an IMU and test it across multiple datasets spanning diverse environments and vehicle platforms. Extensive experimental results demonstrate that our approach outperforms best-in-class methods on challenging scenarios while providing reliable uncertainty estimates. We release our code at https://anonymous.4open.science/r/SVN-ICP-5B77.

Abstract:
This study addresses the challenge of implementing proprioceptive and kinesthetic (PK) feedback in robotic hands, essential for grasping and manipulation tasks in unstructured environments. We developed a compact modular actuator featuring a low-module, high-transmission-ratio multistage gear mechanism that measures 25×10×24 mm, weighs only 10 grams, and maintains moderate backdrivability. The actuator provides multimodal PK feedback, capturing position, velocity, current, and torque data, which are critical for performing various grasping and manipulation tasks. To enable precise motion and force control, we introduced a new adaptive velocity estimator and a simplified Reaction Torque Observer (RTOB). Comprehensive experiments demonstrated the actuators ability to accurately detect surface shape, roughness, and stiffness of target objects, eliminating the need for additional sensors or space. Experimental results confirmed the actuators precision, achieving measurement errors of 5.8 mrad for position, 0.19 rad/s for velocity, and 0.011 N·m for torque. These findings highlight the actuators ability to leverage proprioceptive information, significantly enhancing the functionality and adaptability of robotic hands in diverse and dynamic scenarios.

Abstract:
Active delivery of food to a human mouth in a controlled and safe manner remains a key challenge for robot‑assisted feeding systems (RAFSs). Existing RAFS designs struggle to simultaneously achieve efficiency and safety: rigid manipulators offer fast and accurate motion but risk hazardous contact, while soft robots provide passive compliance at the cost of limited speed or workspace. To meet the specific demands of feeding tasks, we design a tendon-driven continuum robot that allows precise orientation control of the utensil while exhibiting strong passive compliance in position. Integrating it with a 6-DoF rigid robot for fast and long-range positioning, we propose a hybrid RAFS architecture that achieves safe, efficient, and accurate food delivery. Controlling a passive-compliant RAFS to acquire various food is non‑trivial: physical modeling struggles with complex interactions between soft robot and food, while typical imitation learning methods lead to discontinuous or distorted movements out of the passive deformation. To handle this, we design a pose-torque learning policy that enables the soft and rigid robots to generate coherent and synchronized movements, offering a case-specific solution to the long-standing challenge of soft robot imitation learning. Experiments show that our method achieve a food acquisition success rate of 76.7%, while user tests with 14 volunteers confirm user preference, marking our RAFS as a practical step toward safe and efficient robotic feeding.

Abstract:
We present Rad-GS, a 4D radar-camera SLAM system designed for kilometer-scale outdoor environments, utilizing 3D Gaussian as a differentiable spatial representation. Rad-GS combines the advantages of raw radar point cloud with Doppler information and geometrically enhanced point cloud to guide dynamic object masking in synchronized images, thereby alleviating rendering artifacts and improving localization accuracy. Additionally, unsynchronized image frames are leveraged to globally refine the 3D Gaussian representation, enhancing texture consistency and novel view synthesis fidelity. Furthermore, the global octree structure coupled with a targeted Gaussian primitive management strategy further suppresses noise and significantly reduces memory consumption in large-scale environments. Extensive experiments and ablation studies demonstrate that Rad-GS achieves performance comparable to traditional 3D Gaussian methods based on camera or LiDAR inputs, highlighting the feasibility of robust outdoor mapping using 4D mmWave radar. Real-world reconstruction at kilometer scale validates the potential of Rad-GS for large-scale scene reconstruction.

Abstract:
The human hand plays a vital role in daily life and industrial applications, yet replicating its multifunctional capabilities-including motion, sensing, and coordinated manipulation-with robotic systems remains a formidable challenge. Developing a dexterous robotic hand requires balancing human-like agility with engineering constraints such as complexity, size-to-weight ratio, durability, and force-sensing performance. This letter presents Dex-Hand 021, a high-performance, cable-driven five-finger robotic hand with 12 active and 7 passive degrees of freedom (DoFs), achieving 19 DoFs dexterity in a lightweight 1 kg design. We propose a proprioceptive force-sensing-based admittance control method to enhance manipulation. Experimental results demonstrate its superior performance: a single-finger load capacity exceeding 10 N, fingertip repeatability under 0.001 m, and force estimation errors below 0.2 N. Compared to PID control, joint torques in multi-object grasping are reduced by 31.19%, significantly improves force-sensing capability while preventing overload during collisions. The hand excels in both power and precision grasps, successfully executing 33 GRASP taxonomy motions and complex manipulation tasks. This work advances the design of lightweight, industrial-grade dexterous hands and enhances proprioceptive control, contributing to robotic manipulation and intelligent manufacturing.

Abstract:
Robots built from soft materials will inherently apply lower environmental forces than their rigid counterparts, and therefore may be more suitable in sensitive settings with unintended contact. However, these robots' applied forces result from both their design and their control system in closed-loop, and therefore, ensuring bounds on these forces requires controller synthesis for safety as well. This article introduces the first feedback controller for a soft manipulator that formally meets a safety specification with respect to environmental contact. In our proof-of-concept setting, the robot's environment has known geometry and is deformable with a known elastic modulus. Our approach maps a bound on applied forces to a safe set of positions of the robot's tip via predicted deformations of the environment. Then, a quadratic program with Control Barrier Functions in its constraints is used to supervise a nominal feedback signal, verifiably maintaining the robot's tip within this safe set. Hardware experiments on a multi-segment soft pneumatic robot demonstrate that the proposed framework successfully maintains a positive safety margin. This framework represents a fundamental shift in perspective on control and safety for soft robots, implementing a formally verifiable logic specification on their pose and contact forces.

Abstract:
Most existing bearing-only formation control methods required that the relative bearings among neighboring agents are measured under a well-known global reference frame for each individual. To remove such constraint, this paper novelly introduces a distributed formation control scheme for quadrotors with only bearing measurement in each vehicle's local reference frame. To this end, firstly, a prescribed-time quaternion-based orientation estimator is proposed for each follower to estimate the leader's orientation without knowledge of the global reference frame. Secondly, a bearing-only formation control law is developed to achieve desired maneuvering formation using relative bearings under local reference frame, wherein a finite-time differentiator is incorporated to remove the need of bearing rate. The convergence is rigorously proven through mathematical derivations. Both comparative simulations and real-world experiments are conducted to validate the effectiveness of the proposed control scheme.

Abstract:
While soft robots offer advantages in adaptability and safe interaction, their modeling remains challenging. This letter presents a novel, data-driven approach for model order reduction of slender soft robots using autoencoder-parameterized strain within the Geometric Variable Strain (GVS) framework. We employ autoencoders (AEs) to learn low-dimensional strain parameterizations from data to construct reduced-order models (ROMs), preserving the Lagrangian structure of the system while significantly reducing the degrees of freedom. Our comparative analysis demonstrates that AE-based ROMs consistently outperform proper orthogonal decomposition (POD) approaches, achieving lower errors for equivalent degrees of freedom across multiple test cases. Additionally, we demonstrate that our proposed approach achieves computational speed-ups over the high-order models (HOMs) in all cases, and outperforms the POD-based ROM in scenarios where accuracy is matched. We highlight the intrinsic dimensionality discovery capabilities of autoencoders, revealing that HOM often operate in lower-dimensional nonlinear manifolds. Through both simulation and experimental validation on a cable-actuated soft manipulator, we demonstrate the effectiveness of our approach, achieving near-identical behavior with just a single degree of freedom. This structure-preserving method offers significant reductions in the system degrees of freedom and computational effort while maintaining physical model interpretability, offering a promising direction for soft robot modeling and control.

Abstract:
Haptic devices (HDs) play a vital role in simulating the sense of touch in various virtual environments (VEs). Ensuring stable interaction between the HD and the VE is critical, particularly when simulating stiff virtual objects. One approach to enhancing stability is increasing the sampling rate; however, excessively high rates can compromise velocity information, thereby reducing damping stability. Dual-rate haptic devices address this issue by sampling position at higher rates and velocity at lower rates. This paper presents a novel closed-form equation for predicting the stability boundary of a dual-rate HD without restrictions on time delay or virtual damping. The proposed equation, which depends on the physical parameters of the HD and VE, sampling times, and time delay, is validated through simulations and experiments.

Abstract:
We consider a sequential task and motion planning (TAMP) setting in which a robot is assigned continuous-space rearrangement-style tasks one-at-a-time in an environment that persists between each. Lacking advance knowledge of future tasks, existing (myopic) planning strategies unwittingly introduce side effects that impede completion of subsequent tasks: e.g., by blocking future access or manipulation. We present anticipatory task and motion planning, in which estimates of expected future cost from a learned model inform selection of plans generated by a model-based TAMP planner so as to avoid such side effects, choosing configurations of the environment that both complete the task and reduce overall cost. Simulated many-task deployments in navigation-among-movable-obstacles and cabinet-loading domains yield improvements of 32.7% and 16.7% average per-task cost respectively. When given time in advance to prepare the environment, our learning-augmented planning approach yields improvements of 83.1% and 22.3%.Finally, we also demonstrate anticipatory TAMP on a real-world Fetch mobile manipulator.

Abstract:
Synchrony is a cornerstone for the successful physical interaction between humans while cooperating or competing towards a goal and is achieved by correct and smooth information exchange between subjects. Recently, Human-Robot- Human (HRH) interaction arose as an emerging paradigm for improving motor control in collaborative and dyadic movement tasks. Among the robotic solutions explored for agent coupling, exoskeletons represent powerful tools for exerting torque and force feedback at the joint level. In this work, two identical torque-controlled elbow exoskeletons were used in the context of dyadic interaction, to provide haptic feedback and improve synchrony between two individuals performing a tapping task. Each exoskeleton is lightweight and compact, with a total weight of 0.8 kg on the arm and a volume of 360x80x80 𝒎𝒎𝟑. Bench tests to verify the performance of closed-loop torque control showed a residual torque below 0.2 Nm when the reference torque was set to null, and a bandwidth higher than 6 Hz, thus achieving adequate performance for applications in HRH scenarios. During human subjects experiments, the root-mean-squared error between the two users joint trajectories was 50% lower when users received haptic feedback compared to the condition without feedback; similarly, the relative phase error was lower than 60%. The results of this study suggest that exoskeletons can be used to enhance synchrony in HRH interactions, which could potentially be useful in rehabilitation training, collaborative industrial operations or sport and music learning.

Abstract:
This paper considers the problem of designing motion planning algorithms for control-affine systems that generate collision-free paths from an initial to a final destination and can be executed using safe and dynamically-feasible controllers. We introduce the C-CLF-CBF-RRT algorithm, which produces paths with such properties and leverages rapidly exploring random trees (RRTs), control Lyapunov functions (CLFs) and control barrier functions (CBFs). For linear systems with polytopic and ellipsoidal constraints, C-CLF-CBF-RRT requires solving a quadratically constrained quadratic program (QCQP) at every iteration of the algorithm, which can be done efficiently. We prove the probabilistic completeness of C-CLF-CBF-RRT and showcase its performance in simulation and hardware experiments.

Abstract:
Collapsing terrains, often present in search and rescue missions or planetary exploration, pose significant challenges for quadruped robots. This paper introduces a robust locomotion framework for safe navigation over unstable surfaces by integrating terrain probing, load-bearing analysis, motion planning, and control strategies. Unlike traditional methods that rely on specialized sensors or external terrain mapping alone, our approach leverages joint measurements to assess terrain stability without hardware modifications. A Model Predictive Control (MPC) system optimizes robot motion, balancing stability and probing constraints, while a state machine coordinates terrain probing actions, enabling the robot to detect collapsible regions and dynamically adjust its footholds. Experimental results on custom-made collapsing platforms and rocky terrains demonstrate the framework's ability to traverse collapsing terrain while maintaining stability and prioritizing safety.

Abstract:
Existing place recognition descriptors developed for single-agent SLAM struggle with multi-modal LiDAR differences in collaborative SLAM. To overcome this, we propose an online place recognition method for multi-modal LiDARs. This method introduces a dual-view combination descriptor, termed DVMM, by separately encoding azimuthal and vertical scene information. The place recognition process consists of two stages: loop closure detection and verification. In the detection stage, point clouds are projected onto an adaptive grid and a 1D azimuthal descriptor is generated via Gaussian-weighted column summation. The azimuthal descriptor is utilized to retrieve loop candidates through vector matching. In the verification stage, point clouds within a fixed height range are encoded as a binary occupancy image, which serves as the cross-section descriptor. Accurate loop closures are determined by performing image matching on the cross-section descriptors. We evaluate the proposed method on both public and realworld datasets encompassing a total of seven LiDAR sensors. The results demonstrate that DVMM significantly outperforms state-of-the-art descriptors in handling multi-modal LiDAR data and is compatible with collaborative SLAM systems. The code will be open-sourced upon acceptance.

Abstract:
Avoiding collision between the fabric and the obstacle is critical to transport fabric piece in the garment factory. If the fabric collides with the sharp-edged obstacle, it can be scratched or contaminated, resulting in poor product quality and increased waste. However, when we consider the fabric model, we find that current fabric models are not accurate enough for this real-world application. It is almost impossible to model the dynamic motion of the fabric with high accuracy, because its motion and deformation are affected by many hard-to-estimate factors. In this paper, instead of using an accurate fabric model, we propose a new fabric motion modeling method using the proposed OBB-Transformer, which models the fabric motion as a time series of oriented bounding boxes (OBBs). Using OBB-Transformer, a dynamic collision avoidance method is designed to plan a robot trajectory connecting the start point and the goal point without collision between the fabric and the obstacle. The performance of the fabric dynamic motion modeling is compared between the proposed and conventional methods. Then, the collision avoidance of a piece of fabric using the proposed method is demonstrated on a real robot system in both 2D and 3D scenarios.

Abstract:
In this article, we address the shape formation problem for massive robot swarms in environments where external localization systems are unavailable.Achieving this task effectively with solely onboard measurements is still scarcely explored and faces some practical challenges.To solve this challenging problem, we propose the following novel results.Firstly, to estimate the relative positions among neighboring robots, a concurrent-learning based estimator is proposed.It relaxes the persistent excitation condition required in the classical ones such as the least-square estimator.Secondly, we introduce a finite-time agreement protocol to determine the shape location.This is achieved by estimating the relative position between each robot and a randomly assigned seed robot.The initial position of the seed one marks the shape location.Thirdly, based on the theoretical results of the relative localization, a novel behavior-based control strategy is devised.This strategy not only enables the adaptive shape formation of large groups of robots but also enhances the observability of inter-robot relative localization.Numerical simulation results are provided to verify the performance of our proposed strategy compared to the state-of-the-art ones.Additionally, outdoor experiments on real robots further demonstrate the practical effectiveness and robustness of our methods.

Abstract:
Event cameras report per-pixel brightness changes asynchronously with microsecond latency, but their output is incompatible with vision foundation models trained on conventional images. We propose State-Space Time Surfaces (S3TS), a training-free representation that recasts exponential-decay time surfaces as a diagonal state-space model with multi-scale temporal channels and Mamba-inspired selective decay. The resulting pseudo-RGB image is fed directly to a frozen OWLv2 detector for zero-shot, text-prompted object detection from events alone. We demonstrate two applications on a 6-DOF manipulator: event-only grasping with near-nadir refinement, and dense 3D scene reconstruction via multi-view TSDF fusion with neuromorphic surface descriptors. S3TS detects over twice as many objects as single-channel event representations and produces faithful 3D workspace meshes

Abstract:
This paper presents a heterogeneous skill learning framework for asynchronous multi robot relay pushing in complex and cluttered environments. To support cooperative relay transportation, we construct a skill library comprising room robot pushing, corridor helper pushing, and standby behaviors. We further propose a geometry aware pushing strategy that enables contact rich manipulation without relying on external force sensors. For the room robot, curriculum learning is adopted to decompose training into an approach to parcel phase and a parcel to target pushing phase, thereby improving training stability and task progression. For long horizon transportation in constrained corridors, an affordance network is introduced to model the local feasibility of pushing actions, providing structured guidance that improves policy learning efficiency. The overall framework combines Soft Actor Critic (SAC) with Dijkstra based reachability maps to coordinate the ``Room Robot Pushing'' and ``Corridor Helper Pushing'' skills. Experimental results demonstrate high success rates across progressive curriculum lessons, suggesting that the proposed framework provides an effective skill primitive for cooperative multi robot transportation.

Abstract:
Curved-spoke tri-wheel (CSTW) has been proposed as a simple mechanism for overcoming stair-like unstructured obstacles with fast, pure rolling motion. However, when the contact point transitions from one spoke to the next during climbing, the discontinuity in the radius of curvature between adjacent spokes causes a sudden drop in the effective rotational radius. As a result, the robots linear speed drops abruptly, which can induce dynamic instability such as payload disturbance and slip, leading to difficulty in maintaining stable climbing. To mitigate these issues, we propose a passive Compliant Spiral Torsional Suspension (C-STS) placed between the motor and the wheels drive axis to reduce the transition-induced deceleration. Using camera-based marker tracking, we obtain wheel COR velocities under low, medium, and high torsional stiffness and speed conditions, and quantified dynamic stability using the deceleration associated with the velocity drop at each contact transition. By comparing both cases with and without C-STS, the results reveal that the proposed C-STS effectively reduces deceleration under appropriate stiffness-speed combinations, owing to the torsional compliance and the release of spring energy during transition. Although the proposed C-STS was effective only under a limited range of conditions, it shows potential for further refinement through hybrid dynamic modeling and for application to other reconfigurable wheel systems with similar instabilities.

Abstract:
A quadruped robot is a promising system that can offer assistance comparable to that of guide dogs due to its similar form factor. However, various challenges remain in making these robots a reliable option for blind and low-vision (BLV) individuals. Among these challenges, noise and jerky motion during walking are critical drawbacks of existing quadruped robots. While these issues have largely been overlooked in guide dog robot research, our interviews with guide dog handlers and trainers revealed that acoustic and physical disturbances can be particularly disruptive for BLV individuals, who rely heavily on environmental sounds for navigation. To address these issues, we developed a novel walking controller for slow stepping and smooth foot swing/contact while maintaining human walking speed, as well as robust and stable balance control. The controller integrates with a perception system to facilitate locomotion over non-flat terrains, such as stairs. Our controller was extensively tested on the Unitree Go1 robot and, when compared with other control methods, demonstrated significant noise reduction -- half of the default locomotion controller. To evaluate the usability, workload, and perceived noise of the developed system from a users perspective, we conducted indoor walking experiments. In these tests, participants compared our controller with the robots default controller. The results demonstrated higher user acceptance of our controller, highlighting its potential to improve the overall user experience of robotic guide dogs.

Abstract:
3D Gaussian Splatting (3DGS) enables fast, high- quality novel view synthesis but relies on densification followed by pruning to optimize the number of Gaussians. Existing mask-based pruning, such as MaskGS, regularizes the global mean of the mask, which is misaligned with the local per-pixel (per-ray) reconstruction loss that determines image quality along individual camera rays. This paper introduces SVR-GS, a spatially variant regularizer that renders a per-pixel spatial mask from each Gaussians effective contribution along the ray, thereby applying sparsity pressure where it matters: on low-importance Gaussians. We explore three spatial-mask aggregation strategies, implement them in CUDA, and conduct a gradient analysis to motivate our final design. Extensive experiments on Tanks&Temples, Deep Blending, and Mip-NeRF360 datasets demonstrate that, on average across the three datasets, the proposed SVR-GS reduces the number of Gaussians by 1.79× compared to MaskGS and 5.63× compared to 3DGS, while incurring only 0.50 dB and 0.40 dB PSNR drops, respectively. These gains translate into significantly smaller, faster, and more memory-efficient models, making them well-suited for real-time applications such as robotics, AR/VR, and mobile perception. Additional materials are available on our project page: https://ashkantaghipour.github.io/svrgs/.

Abstract:
Magnetic levitation is poised to revolutionize industrial automation by integrating flexible in-machine product transport and seamless manipulation. It is expected to become the standard drive technology for automated manufacturing. However, controlling such systems is inherently challenging due to their complex, unstable dynamics. Traditional control approaches, which rely on hand-crafted control engineering, typically yield robust but conservative solutions, with their performance closely tied to the expertise of the engineering team. In contrast, learning-based neural control presents a promising alternative. This paper presents the first neural controller for 6D magnetic levitation. Trained end-to-end on interaction data from a proprietary controller, it directly maps raw sensor data and 6D reference poses to coil current commands. The neural controller can effectively generalize to previously unseen situations while maintaining accurate and robust control. These results underscore the practical feasibility of learning-based neural control in complex physical systems and suggest a future where such a paradigm could enhance or even substitute traditional engineering approaches in demanding real-world applications. The trained neural controller, source code, and demonstration videos are publicly available at https://sites.google.com/view/neural-maglev.

Abstract:
Conventional mobile tensegrity robots constructed with straight links offer mobility at the cost of locomotion speed. While spherical robots provide highly effective rolling behavior, they often lack the stability required for navigating unstructured terrain common in many space exploration environments. This research presents a solution with a semi-circular, curved-link tensegrity robot that strikes a balance between efficient rolling locomotion and controlled stability, enabled by discontinuities present at the arc endpoints. Building upon an existing geometric static modeling framework [1], this work presents the system design of an improved Tensegrity eXploratory Robot 2 (TeXploR2). Internal shifting masses instantaneously roll along each curved-link, dynamically altering the two points of contact with the ground plane. Simulations of quasistatic, piecewise continuous locomotion sequences reveal new insights into the positional displacement between inertial and body frames. Non-intuitive rolling behaviors are identified and experimentally validated using a tetherless prototype, demonstrating successful dynamic locomotion. A preliminary impact test highlights the tensegrity structures inherent shock absorption capabilities and conformability. Future work will focus on finalizing a dynamic model that is experimentally validated with extended testing in real-world environments as well as further refinement of the prototype to incorporate additional curved-links and subsequent ground contact points for increased controllability.

Abstract:
Friction is the essential mediator of terrestrial locomotion, yet in robotic systems it is almost always treated as a passive property fixed by surface materials and conditions. Here, we introduce ultrasonic lubrication as a method to actively control friction in robotic locomotion. By exciting resonant structures at ultrasonic frequencies, contact interfaces can dynamically switch between "grip" and "glide" states, enabling locomotion. We developed two friction control modules: a cylindrical design for lumen-like environments and a flat-plate design for external surfaces, and integrated them into bio-inspired systems modeled after inchworm and wasp ovipositor locomotion. Both systems achieved bidirectional locomotion with nearly perfect locomotion efficiencies that exceeded 90%. Friction characterization experiments further demonstrated substantial friction reduction across various surfaces, including rigid, soft, granular, and biological tissue interfaces, under dry and wet conditions, and on surfaces with different levels of roughness, confirming the versatility of ultrasonic lubrication for locomotion applications. These findings establish ultrasonic lubrication as a viable active friction control mechanism for robotic locomotion, with the potential to reduce design complexity and improve the efficiency of robotic locomotion systems.

Abstract:
Recent advances in neural rendering have enabled the 3D reconstruction of dynamic humans from monocular videos, with applications in robotics. However, it is still challenging to reconstruct clear humans from in-the-wild video encountering motion blur, causing shape and appearance inconsistencies, especially in blurry regions like hands and legs. In this paper, we propose ExFMan, the first neural rendering framework that unveils the possibility of rendering high-quality humans in rapid motion with a hybrid frame-based RGB and bio-inspired event camera. The ``out-of-the-box'' insight is to leverage the high temporal information of event data in a complementary manner and adaptively reweight the effect of losses for both RGB frames and events in the local regions, according to the velocity of the rendered human. This significantly mitigates the inconsistency associated with motion blur in the RGB frames. Specifically, we first formulate a velocity field of the 3D body in the canonical space and render it to image space to identify the body parts with motion blur. We then propose two novel losses, i.e., velocity-aware photometric loss and velocity-relative event loss, to optimize the neural human for both modalities under the guidance of the estimated velocity. In addition, we incorporate novel pose regularization and alpha losses to facilitate continuous pose and clear boundary. Extensive experiments on synthetic and real-world datasets demonstrate that ExFMan can reconstruct sharper and higher quality humans over the compared baselines and the state-of-the-art methods for diverse blurry subjects.

Abstract:
We present an integrated (or end-to-end) framework for the Real2Sim2Real problem of manipulating deformable linear objects (DLOs) based on visual perception. Working with a parameterised set of DLOs, we use likelihood-free inference (LFI) to compute the posterior distributions for the physical parameters using which we can approximately simulate the behaviour of each specific DLO. We use these posteriors for domain randomisation while training, in simulation, object-specific visuomotor policies (i.e. assuming only visual and proprioceptive sensory) for a DLO reaching task, using model-free reinforcement learning. We demonstrate the utility of this approach by deploying sim-trained DLO manipulation policies in the real world in a zero-shot manner, i.e. without any further fine-tuning. In this context, we evaluate the capacity of a prominent LFI method to perform fine classification over the parametric set of DLOs, using only visual and proprioceptive data obtained in a dynamic manipulation trajectory. We then study the implications of the resulting domain distributions in sim-based policy learning and real-world performance.

Abstract:
The joint optimization of sensor poses and 3D structure is fundamental for state estimation in robotics and related fields. Current LiDAR systems often prioritize pose optimization, with structure refinement either omitted or treated separately using implicit representations. This paper introduces a framework for simultaneous optimization of sensor poses and 3D map, represented as surfels. A generalized LiDAR uncertainty model is proposed to address less reliable measurements in varying scenarios. Experimental results on public datasets demonstrate improved performance over most comparable state-of-the-art methods. The system is provided as open-source software to support further research.

Abstract:
In this letter, we present tightly coupled LiDAR-IMU-leg odometry, which is robust to challenging conditions such as featureless environments and deformable terrains. We developed an online learning-based leg kinematics model named the neural leg kinematics model, which incorporates tactile information (foot reaction force) to implicitly express the nonlinear dynamics between robot feet and the ground. Online training of this model enhances its adaptability to weight load changes of a robot (e.g., assuming delivery or transportation tasks) and terrain conditions. According to the neural adaptive leg odometry factor and online uncertainty estimation of the leg kinematics model-based motion predictions, we jointly solve online training of this kinematics model and odometry estimation on a unified factor graph to retain the consistency of both. The proposed method was verified through real experiments using a quadruped robot in two challenging situations: 1) a sandy beach, representing an extremely featureless area with a deformable terrain, and 2) a campus, including multiple featureless areas and terrain types of asphalt, gravel (deformable terrain), and grass. Experimental results showed that our odometry estimation incorporating the neural leg kinematics model outperforms state-of-the-art works. Our project page is available for further details: https://takuokawara.github.io/RAL2025_project_page/

Abstract:
This paper presents a novel shared autonomy and baseline policy adapting framework for human-robot interactions in high-level context-aware robotic tasks. With a unique methodology that leverages hierarchies in decision-making as well as variational analysis of human policy, we propose a mathematical model of shared autonomy policy. The framework aims at interpretable high-level decision-making for efficient robot operation with human in the loop. We modeled the decision-making process using hierarchical Markov decision processes (MDPs) in an algorithm we called policy adapting, where the autonomous system policy is adapted, and hence, shaped by incorporating design variables contextual to the robot, human, task, and pre-training. By integrating deep reinforcement learning within a multi-agent hierarchical context, we present an end-to-end algorithm to train a baseline policy designed for shared autonomy. We showcase the effectiveness of our framework, and particularly the interplay between different design elements and human's skill level, in a pilot study with a human user in a simulated sequence of high-level pick-and-place tasks. The proposed framework advances the state-of-the-art in shared autonomy for robotic tasks, but can also be applied to other domains of autonomous operation.

Abstract:
We propose PRED-MPPI, the first MPPI variant that seamlessly integrates real-time disturbance preview and adaptive discretization for quadrotor tracking control under significant model inaccuracies and time-varying disturbances. Unlike prior MPPI variants (e.g., mathcalL_1-MPPI, DA-MPPI), which assume constant or matched disturbances, PRED-MPPI leverages a high-order Generalized Extended State Observer for disturbance preview and a Variable Discretization Grid (VDG) to reduce computation and control variance. The synergy enables real-time (50 Hz) quadrotor control under time-varying and mismatched disturbances. Extensive comparative simulation and real-world Crazyflie experiments demonstrate substantial performance gains. In AirSim simulation, PRED-MPPI reduces computation time by over 30%, and mean RMSE by 10.3%, 13.5%, and 14.6% compared to baseline MPPI, and by 2.59%, 3.62%, and 5.80% compared to DA-MPPI across three representative scenarios. In real-world Crazyflie experiments, for ground-effect-disturbed hovering, PRED-MPPI reduces mean and standard deviation (Std) of XY plane error by 14.2%/17.9% and 6.03%/21.6% compared to MPPI and DA-MPPI; for fan-induced wind experiments, PRED-MPPI yields improvements of 23.4%/36.8% and 13.8%/25.0% in RMSE and tracking error Std. These results establish PRED-MPPI as the first disturbance-preview MPPI achieving real-world UAV robustness and efficiency, paving the way for deployment on resource-limited robotic platforms. GitHub page with videos is at https://pred-mppi.github.io/

Abstract:
Handling fragile objects at high speeds presents a significant challenge. To address this challenge, researchers explored hybrid grippers and grippers with adjustable stiffness. However, grippers exhibiting anisotropic stiffness have received little attention. This paper presents an extreme case of an anisotropic stiffness gripper designed for high-speed handling of delicate foods. The proposed gripper incorporates linear motors and low-friction linear guides. Its extremely low friction in the grasping direction ensures minimal grasping force (less than 0.1 N) and high compliance, enabling the secure handling of fragile objects. Simultaneously, its rigid structure provides sufficient stiffness in the translational direction, ensuring stability during high-speed motion. Leveraging anisotropic stiffness, the gripper achieves both gentle grasp and stable high-speed translation--two requirements that typically necessitate a trade-off. Theoretical analysis was conducted to determine the maximum permissible acceleration under two conditions: with and without stiffness anisotropy. Results indicate that stiffness anisotropy enables significantly higher acceleration during translational motion, thereby reducing task time. Pick-and-place experiments on a 3D printed object, delicate foods of a potato chip, a block of tofu, and a piece of dried seaweed validated theoretical findings and demonstrated the gripper's capability to handle fragile objects at high speeds effectively.

Abstract:
Collaborative perception enhances the reliability and spatial coverage of autonomous vehicles by sharing complementary information across vehicles, offering a promising solution to long-tail scenarios that challenge single-vehicle perception. However, the bandwidth constraints of vehicular networks make transmitting the entire feature map infeasible. Recent methods therefore adopt a foreground-centric paradigm, transmitting only predicted foreground region features but discarding background which encodes essential context. We propose FadeLead, a foreground-centric framework that overcomes this limitation by learning to encapsulate background context into compact foreground features during training. At the core of our design is a curricular learning strategy that leverages background cues early on but progressively prunes them away, forcing the model to internalize context into foreground representations without transmitting background itself. Extensive experiments on both simulated and real-world scenarios show that FadeLead outperforms prior methods under different bandwidth settings, underscoring the effectiveness of context-enriched foreground sharing. Code, demos, and checkpoints are available at https://wyhallenwu.github.io/FadeLead/.

Abstract:
Visuotactile sensors typically employ sparse marker arrays that limit spatial resolution and lack clear analytical force-to-image relationships. To solve this problem, we present MoiréTac, a dual-mode sensor that generates dense interference patterns via overlapping micro-gratings within a transparent architecture. When two gratings overlap with misalignment, they create moiré patterns that amplify microscopic deformations. The design preserves optical clarity for vision tasks while producing continuous moiré fields for tactile sensing, enabling simultaneous 6-axis force/torque measurement, contact localization, and visual perception. We combine physics-based features (brightness, phase gradient, orientation, and period) from moiré patterns with deep spatial features. These are mapped to 6-axis force/torque measurements, enabling interpretable regression through end-to-end learning. Experimental results demonstrate three capabilities: force/torque measurement with R²>0.98 across tested axes; sensitivity tuning through geometric parameters (threefold gain adjustment); and vision functionality for object classification despite moiré overlay. Finally, we integrate the sensor into a robotic arm for cap removal with coordinated force and torque control, validating its potential for dexterous manipulation.

Abstract:
Due to the limited online computational resources and the inherent probability of hardware and software failures of real-world robots, large-scale formation planning faces two common challenges: computational intractability and agent failures. Based on the theory of sparse graphs and the maximum clique, we achieve a resilient and efficient formation planning (mathbfRE-mathbfFormation) to address these issues. To improve the computational efficiency of trajectory planning while ensuring flexible formation maneuvers, we introduce sparse graphs to describe connection relationships and present a sparse graph construction method with closed-form solutions. The sparse graphs ensure the underlineGlobal underlineRigidity for uniquely corresponding to a geometric shape and underlinePreserve the main underlineFeatures of complete graphs, denoted as the mathbfGRPF sparse graph. To prevent the impact of abnormal agents, the problem of eliminating abnormal agents is transformed into an outlier rejection problem that can be solved by computing the maximum clique. We approximate the maximum clique by periodically triggering the calculation of the maximum k-core to meet the real-time computational demands of large-scale swarms. We validate the performance through real-world experiments and implement formation planning with 100 drones in simulation. Benchmark comparisons and ablation experiments demonstrate the effectiveness of

Abstract:
Single-domain generalization is essential for object detection, particularly when training models on a single source domain and evaluating them on unseen target domains. Domain shifts, such as changes in weather, lighting, or scene conditions, pose significant challenges to the generalization ability of existing models. To address this, we propose Cross-Domain Feature Knowledge Distillation (CD-FKD), which enhances the generalization capability of the student network by leveraging both global and instance-wise feature distillation. The proposed method uses diversified data through downscaling and corruption to train the student network, whereas the teacher network receives the original source domain data. The student network mimics the features of the teacher through both global and instance-wise distillation, enabling it to extract object-centric features effectively, even for objects that are difficult to detect owing to corruption. Extensive experiments on challenging scenes demonstrate that CD-FKD outperforms state-of-the-art methods in both target domain generalization and source domain performance, validating its effectiveness in improving object detection robustness to domain shifts. This approach is valuable in real-world applications, like autonomous driving and surveillance, where robust object detection in diverse environments is crucial.

Abstract:
Bimanual robot manipulation for long-horizon (LH) tasks is crucial for the practical use of humanoids, but it struggles with robust planning and generalization. Approaches based on Task and Motion Planning (TAMP), transformers, and Large Language Models (LLMs) suffer from critical limitations, including costly human demonstrations, task planner hallucination, and unsatisfactory generalization performance. To address these challenges, this paper introduces the Multi-modal Affordance Planner with Temporal-Context Action Policy (MAP-TCA), a novel hierarchical framework that learns and performs diverse bimanual long-horizon (LH) tasks by generating action plans from MAP. The MAP-TCA consists of a planner based on Bimanual Robot Manipulation Retrieval-Augmented Generation (Bi-RAG)-enhanced Large-Language Model (LLM) and a low-level Temporal Context Action Policy (TCA). With multimodal inputs including vision, language, and affordance for primitive action demonstration, Bi-RAG generates a Primitive Action (PA)-specific embedded space. Then, MAP generates LH plans, LH demonstrations, and reward functions within the PA-specific embedded space, thereby mitigating hallucinations and reducing training cost. The generated plan, demos, and rewards then guide TCA, which learns the LH tasks via behavior cloning (BC) and online fine-tuning. We demonstrate that the proposed MAP-TCA achieves an average success rate of 86.75%, comparable to a baseline model, TCA, which is trained extensively on direct human demonstrations and manually designed rewards. Our work presents a scalable and generalizable solution for complex bimanual LH manipulation, significantly reducing the dependency on human supervision.

Abstract:
The multi-robot unlabeled motion planning problem of concurrently assigning robots to goals and generating safe trajectories is central in many collaborative tasks. Recent Graph Neural Network methods offer scalable decentralized solutions but rely on simplified dynamics and simulation environments, overlooking key challenges of real-world deployment such as dynamic feasibility and communication constraints. To address these gaps, we propose a hierarchical framework that combines a Graph ATtention Planner (GATP) with a decentralized Nonlinear Model Predictive Controller (NMPC). GATP provides intermediate subgoals through multi-robot cooperation, and the NMPC enforces safety under nonlinear dynamics and actuation constraints. We evaluate our framework in both simulation and real-world quadrotor experiments. Thanks to attention mechanisms and minimal communication requirements, we demonstrate improved generalization to larger teams, robustness to communication delays up to 200 ms and practical feasibility with decentralized on-board inference.

Abstract:
To support the circular economy, robotic systems must not only assemble new products but also disassemble end-of-life (EOL) ones for reuse, recycling, or safe disposal. Existing approaches to disassembly sequence planning often assume deterministic and fully observable product models, yet real EOL products frequently deviate from their initial designs due to wear, corrosion, or undocumented repairs. We argue that disassembly should therefore be formulated as a Partially Observable Markov Decision Process (POMDP), which naturally captures uncertainty about the product's internal state. We present a mathematical formulation of disassembly as a POMDP, in which hidden variables represent uncertain structural or physical properties. Building on this formulation, we propose a task and motion planning framework that automatically derives specific POMDP models from CAD data, robot capabilities, and inspection results. To obtain tractable policies, we approximate this formulation with a reinforcement-learning approach that operates on stochastic action outcomes informed by inspection priors, while a Bayesian filter continuously maintains beliefs over latent EOL conditions during execution. Using three products on two robotic systems, we demonstrate that this probabilistic planning framework outperforms deterministic baselines in terms of average disassembly time and variance, generalizes across different robot setups, and successfully adapts to deviations from the CAD model, such as missing or stuck parts.

Abstract:
Object rearrangement planning in complex, cluttered environments is a common challenge in warehouses, households, and rescue sites. Prior studies largely address monotone instances, whereas real-world tasks are often non-monotoneobjects block one another and must be temporarily relocated to intermediate positions before reaching their final goals. In such settings, effective multi-agent collaboration can substantially reduce the time required to complete tasks. This paper introduces Centralized, Asynchronous, Multi-agent Monte Carlo Tree Search (CAM-MCTS), a novel framework for general-purpose makespan-efficient object rearrangement planning in challenging environments. CAM-MCTS combines centralized task assignmentwhere agents remain aware of each others intended actions to facilitate globally optimized planningwith an asynchronous task execution strategy that enables agents to take on new tasks at appropriate time steps, rather than waiting for others, guided by a one-step look-ahead cost estimate. This design minimizes idle time, prevents unnecessary synchronization delays, and enhances overall system efficiency. We evaluate CAM-MCTS across a diverse set of monotone and non-monotone tasks in cluttered environments, demonstrating consistent reductions in makespan compared to strong baselines. Finally, we validate our approach on a real-world multi-agent system under different configurations, further confirming its effectiveness and robustness. Videos can be found at https://www.youtube.com/watch?v=kNRg2kNnFxg.

Abstract:
Recent advances in robot learning have motivated integrated pipelines that combine hardware for data collection with imitation learning algorithms. Existing data collection methods like leaderfollower, VR/AR, and exoskeletons rely on costly hardware and exhibit limited scalability, while imitation learning algorithms built on them remain highly sensitive to viewpoint shifts, further constraining generalizability. Handheld grippers provide a low-cost, robot-agnostic alternative, but prior systems bypass exocentric view alignment by relying solely on wrist-mounted cameras, resulting in narrowed observation and reduced policy robustness. We propose VIL, a framework pairing customized handheld gripper with zero-shot, exocentric viewpoint-robust imitation learning algorithm, bridging the handheld gripper with exocentric views. Our approach employs adapters for appearance alignment and a hybrid encoder design to extract view-consistent representations for an ACT-style policy, enabling robust execution across diverse perspectives. We further optimize the data collection pipeline and validate the system in both simulation and real-world tasks. Experiments show that VIL achieves stable performance under viewpoint shifts, challenging low-horizon scenarios, and dynamic perspectives, outperforming SOTA methods and demonstrating a scalable pipeline for manipulator-independent, viewpoint-robust policy learning. The project repository containing code and hardware is available at https://github.com/liboyan233/VIL.git.

Abstract:
We propose a novel model of a virtual energy storage system (ESS) that leverages the aggregate battery capacity of parked and idling electric vehicles (EVs). Such an energy service is offered to a community of prosumers as a temporary energy buffer and managed by a parking lot manager (PLM), which absorbs the risks arising from the unreliability of the EV-based ESS due to the arrival and departure of EVs. Hence, from the prosumers' perspective, such a virtual storage service behaves deterministically. Both the PLM and the prosumers act as self-interested agents that optimize their own objectives, subject to operational constraints, leading to a non-cooperative game. To deal with the uncertainty of prosumers' renewable net generation and EVs' arrivals/departures, we use a data-driven distributionally robust approach, showing that a tractable reformulation can be obtained, where the equilibrium solutions can be computed as a variational inequality. Numerical simulations based on real data illustrate the behavior of the proposed model.

Abstract:
In recent years, diffusion models have demonstrated remarkable potential across diverse domains, from vision generation to language modeling. Transferring its generative capabilities to modern end-to-end autonomous driving systems has also emerged as a promising direction. However, existing diffusion-based trajectory generative models often exhibit mode collapse where different random noises converge to similar trajectories after the denoising process. Therefore, state-of-the-art models often rely on anchored trajectories from pre-defined trajectory vocabulary or scene priors in the training set to mitigate collapse and enrich the diversity of generated trajectories, but such inductive bias are not available in real-world deployment, which can be challenged when generalizing to unseen scenarios. In this work, we investigate the possibility of effectively tackling the mode collapse challenge without the assumption of pre-defined trajectory vocabulary or pre-computed scene priors. Specifically, we propose TransDiffuser, an encoder-decoder based generative trajectory planning model, where the encoded scene information and motion states serve as the multi-modal conditional input of the denoising decoder. Different from existing approaches, we exploit a simple yet effective multi-modal representation decorrelation optimization mechanism during the denoising process to enrich the latent representation space which better guides the downstream generation. Without any predefined trajectory anchors or pre-computed scene priors, TransDiffuser achieves the PDMS of 94.9 on the closed-loop planning-oriented benchmark NAVSIM, surpassing previous state-of-the-art methods. Qualitative evaluation further showcases TransDiffuser generates more diverse and plausible trajectories which explore more drivable area.

Abstract:
Real-time path planning in unknown non-convex environments is challenging, as obstacle updates can invalidate existing paths while narrow passages restrict feasible connectivity. This paper presents textbfInBi-RRT, an incremental bidirectional tree-based framework that grows a reverse tree from the goal and maintains a reusable forward tree from the start. When the current path becomes invalid, a cost-guided expansion selectively extends the forward tree to establish collision-free connections with the reverse tree, followed by backtracking and lightweight path optimization for efficient repair. Simulation results in unknown and non-convex scenarios demonstrate that InBi-RRT achieves significantly faster replanning than baseline methods, being up to textbf5.5times faster than RT-RRT and textbf22times faster than RRT^textX, with paths up to 19.8% shorter than RRT^textX under the same sample count. Furthermore, real-world experiments in an indoor maze-like environment verify the practicality and robustness of the proposed planner in unknown non-convex scenarios.

Abstract:
Autonomous navigation in GPS-denied and visually degraded environments remains challenging for unmanned aerial vehicles (UAVs). To this end, we investigate the use of a monocular thermal camera as a standalone sensor on a UAV platform for real-time depth estimation and simultaneous localization and mapping (SLAM). To extract depth information from thermal images, we propose a novel pipeline employing a lightweight supervised network with recurrent blocks (RBs) integrated to capture temporal dependencies, enabling more robust predictions. The network combines lightweight convolutional backbones with a thermal refinement network (T-RefNet) to refine raw thermal inputs and enhance feature visibility. The refined thermal images and predicted depth maps are integrated into ORB-SLAM3, enabling thermal-only localization. Unlike previous methods, the network is trained on a custom non-radiometric dataset, obviating the need for high-cost radiometric thermal cameras. Experimental results on datasets and UAV flights demonstrate competitive depth accuracy and robust SLAM performance under low-light conditions. On the radiometric VIVID++ (indoor-dark) dataset, our method achieves an absolute relative error of approximately 0.06, compared to baselines exceeding 0.11. In our non-radiometric indoor set, baseline errors remain above 0.24, whereas our approach remains below 0.10. Thermal-only ORB-SLAM3 maintains a mean trajectory error under 0.4 m.

Abstract:
Learning from human video demonstrations offers a scalable alternative to teleoperation or kinesthetic teaching, but poses challenges for robot manipulators due to embodiment differences and joint feasibility constraints. We address this problem by proposing the Joint Flow Trajectory Optimization (JFTO) framework for grasp pose generation and object trajectory imitation under the video-based Learning from Demonstration (LfD) paradigm. Rather than directly imitating human hand motions, our method treats demonstrations as object-centric guides, balancing three objectives: (i) selecting a feasible grasp pose, (ii) generating object trajectories consistent with demonstrated motions, and (iii) ensuring collision-free execution within robot kinematics. To capture the multimodal nature of demonstrations, we extend flow matching to SE(3) for probabilistic modeling of object trajectories, enabling density-aware imitation that avoids mode collapse. The resulting optimization integrates grasp similarity, trajectory likelihood, and collision penalties into a unified differentiable objective. We validate our approach in both simulation and real-world experiments across diverse real-world manipulation tasks.

Abstract:
Model-predictive control (MPC) is a state-of-the-art control method for constrained robotic systems, yet deployment on resource-limited hardware remains difficult. This challenge is magnified by expressive conic constraints, which offer greater modeling power but require significantly more computation than linear alternatives. To address this challenge, we extend recent work developing fast, structure-exploiting, cached solvers for embedded applications based on the Alternating Direction Method of Multipliers (ADMM) to provide support for second-order cones, as well as C++ code generation from Python, MATLAB, and Julia. Microcontroller benchmarks show that our solver provides up to a two-order-of-magnitude speedup, ranging from 10.6x to 142.7x, over state-of-the-art embedded solvers on QP and SOCP problems, and enables us to fit order-of-magnitude larger problems in memory. We validate our solver's deployed performance through simulation and hardware experiments, including trajectory tracking with conic constraints on a 27g Crazyflie quadrotor. Our open-source code is available at https://tinympc.org.

Abstract:
Accurate 3D reconstruction in visually-degraded underwater environments remains a formidable challenge. Single-modality approaches are insufficient: vision-based methods fail due to poor visibility and geometric constraints, while sonar is crippled by inherent elevation ambiguity and low resolution. Consequently, prior fusion techniques rely on heuristics and flawed geometric assumptions, leading to significant artifacts and an inability to model complex scenes. In this paper, we introduce SonarSweep, a novel end-to-end deep learning framework that overcomes these limitations by adapting the principled plane sweep algorithm for cross-modal fusion between sonar and visual data. Extensive experiments in both high-fidelity simulation and real-world environments demonstrate that SonarSweep consistently generates dense and accurate depth maps, significantly outperforming state-of-the-art methods under challenging conditions, particularly in high turbidity. To foster further research, we publicly release our code and a novel dataset featuring synchronized stereo-camera and sonar datathe first of its kindat urlhttps://github.com/LIAS-CUHKSZ/SonarSweep.

Abstract:
In reinforcement learning (RL) for robotic manipulation, the Decision Transformer (DT) has emerged as an effective framework for addressing long-horizon tasks. However, DTs performance depends heavily on the coverage of collected experiences. Without an active exploration mechanism, standard DT relies on uniform replay, which leads to poor sample efficiency, limited exploration, and reduced overall effectiveness. At the same time, while excessive exploration can help avoid local optima, it often delays policy convergence and leads to degraded efficiency. To address these limitations, we propose E^2DT, a DT-guided k-Determinantal Point Process sampling framework that enables the model to actively shape its own experience selection. Our framework is experience-aware, allowing E^2DT to be both efficient, by prioritizing sampling quality (e.g., high-return, high-uncertainty, and underrepresented trajectories), and effective, by ensuring diversity across trajectory windows to preserve policy optimality. Specifically, DTs internal latent embeddings measure diversity across trajectory windows, while quality is quantified through a composite metric that integrates return-to-go (RTG) quantiles, predictive uncertainty, and stage coverage (inverse frequency). These two dimensions are integrated into a novel qualitydiversity joint kernel that prioritizes the most informative experiences, thereby enabling learning that is both efficient and effective. We evaluate E^2DT on challenging robotic manipulation benchmarks in both simulation and real-robot settings. Results show that it consistently outperforms prior methods. These findings demonstrate that coupling policy learning with experience-aware sampling provides a principled path toward robust long-horizon robotic learning.

Abstract:
This paper addresses the problem of robot navigation in challenging dynamic environments by extending the Velocity Obstacle (VO) framework to the Nonlinear Acceleration Obstacle (NAO). The NAO represents the set of robot accelerations that would lead to collisions with an obstacle moving along an arbitrary trajectory. By formulating the problem in the acceleration domain, the method allows direct selection of accelerations, the natural control input of second-order systems, to generate safe avoidance maneuvers in complex dynamic environments. Simulation results show that NAO enables real-time collision avoidance while explicitly accounting for both robot kinematics (velocity) and dynamics (acceleration). The proposed framework thus provides a reactive and efficient basis for autonomous navigation in complex dynamic environments.

Abstract:
Learning fast and robust ball-kicking skills is a critical capability for humanoid soccer robots, yet it remains a challenging problem due to the need for rapid leg swings, postural stability on a single support foot, and robustness under noisy sensory input and external perturbations (e.g., opponents). This paper presents a reinforcement learning (RL)based training pipeline that enables humanoid robots to execute robust continual ball-kicking with adaptability to different ball-goal configurations. The pipeline extends a typical teacher-student training framework--in which a teacher policy is trained with ground truth state information and the student learns to mimic it with noisy, imperfect sensing--by including four training stages: (1) long-distance ball chasing (teacher); (2) directional kicking (teacher); (3) teacher policy distillation (student), and (4) student adaptation and refinement (student). Key design elements--including tailored reward functions, realistic noise modeling, and online constrained RL for adaptation and refinement--are critical for closing the sim-to-real gap and sustaining performance under perceptual uncertainty. Extensive evaluations in both simulation and on a real robot demonstrate strong kicking accuracy and goal-scoring success across diverse ballgoal configurations. Ablation studies further highlight the necessity of the constrained RL, noise modeling, and the adaptation stage. This work presents a training pipeline for robust continual humanoid ball-kicking under imperfect perception, establishing a benchmark task for visuomotor skill learning in humanoid whole-body control.

Abstract:
Robotic manipulation in complex scenes demands precise perception of task-relevant details, yet fixed or suboptimal viewpoints often impair fine-grained perception and induce occlusions, constraining imitation-learned policies. We present AVR (Active Vision-driven Robotics), a bimanual teleoperation and learning framework that unifies head-tracked viewpoint control (HMD-to-2-DoF gimbal) with motorized optical zoom to keep targets centered at an appropriate scale during data collection and deployment. In simulation, an AVR plugin augments RoboTwin demonstrations by emulating active vision (ROI-conditioned viewpoint change, aspect-ratio-preserving crops with explicit zoom ratios, and super-resolution), yielding 517% gains in task success across diverse manipulations. On our real-world platform, AVR improves success on most tasks, with over 25% gains compared to the static-view baseline, and extended studies further demonstrate robustness under occlusion, clutter, and lighting disturbances, as well as generalization to unseen environments and objects. These results pave the way for future robotic precision manipulation methods in the pursuit of human-level dexterity and precision.

Abstract:
Collision avoidance is essential for robotic systems. This paper presents a method for designing directional projection control barrier functions (CBFs) based on differentiable optimization for second-order robotic systems. The approach reduces high-order CBFs to first-order ones and estimates collision risk by examining the intersection of projections along the relative velocity direction. Under the assumption that both the target and obstacles are convex polyhedra whose projections yield convex polygons, a tunable uniform scaling function, centered at the centroid, is introduced to pad the convex polygon. The strict convexity of this padded region is rigorously proven. Using the minimum scaling factor that leads to intersection between two projected convex polygons, a CBF is constructed and incorporated into a tracking controller to ensure collision avoidance. The effectiveness of the proposed method is validated through simulations with a 2D mobile robot and a 7-DOF Franka manipulator.

Abstract:
Hyperspectral imaging (HSI) captures detailed spectral signatures across hundreds of contiguous bands per pixel, being indispensable for remote sensing applications such as land-cover classification, change detection, and environmental monitoring. Due to the high dimensionality of HSI data and the slow rate of data transfer in satellite-based systems, compact and efficient models are required to support onboard processing and minimize the transmission of redundant or low-value data. To this end, we introduce a novel curriculum multi-task self-supervised learning (CMTSSL) framework designed for lightweight architectures for HSI analysis. CMTSSL integrates masked image modeling with decoupled spatial and spectral jigsaw puzzle solving, guided by a curriculum learning strategy that progressively increases data difficulty during self-supervision. This enables the encoder to jointly capture fine-grained spectral continuity, spatial structure, and global semantic features. Unlike prior dual-task SSL methods, CMTSSL simultaneously addresses spatial and spectral reasoning within a unified and computationally efficient design, being particularly suitable for training lightweight models for onboard satellite deployment. We validate our approach on four public benchmark datasets, demonstrating consistent gains in downstream segmentation tasks, using architectures that are over 16, 000× lighter than some state-of-the-art models. These results highlight the potential of CMTSSL in generalizable representation learning with lightweight architectures for real-world HSI applications. Our code is publicly available at: https://github.com/hugocarlesso/CMTSSL.

Abstract:
Three-dimensional (3D) medical image analysis faces challenges such as massive data volume, difficulty in integrating cross-slice information, and limited model generalization. This paper proposes 3DME, a foundational model for 3D medical imaging. Its core innovations feature a dual-branch 3D encoder that integrates a Vision Transformer for modeling global long-range dependencies and a 3D graph convolutional network for capturing local voxel structures, enhanced by multi-level deformable attention for cross-planar correlation; a progressive volumetric masking strategy for self-supervised pretraining, which dynamically adjusts masking ratios and block sizes to force the model to learn cross-slice continuity and global semantics; and a unified foundation model framework supporting lightweight adaptation for downstream tasks. Experiments demonstrate that 3DME achieves state-of-the-art (SOTA) performance on 12 segmentation and classification tasks, exhibiting strong zero-shot transfer capabilities, thereby significantly enhancing model generalization and clinical deployment efficiency.

Abstract:
Robotic harvesting of cherry tomatoes remains challenging due to dense foliage, asynchronous ripening, and the strict market requirement for calyx-preserving cuts. The calyx frequently occludes the pedicel, making precise localization indispensable. In 640 × 480 images, pedicels span only 732 pixels, where even minor errors can lead to miscutting the calyx. To address this challenge, we apply YOLO-DSC to localize pedicels across dynamic frames as the arm-mounted camera moves during the best-view search. This strategy maximizes the visible pedicel length, exposing it perpendicularly to the camera and ensuring clear separation from the calyx, while null-data suppresses false positives from distractors such as leaves, stems, and calyces. In 15 autonomous trials along a 28m greenhouse row, YOLO-DSC achieved the lowest pedicel localization errors, outperforming YOLO baseline model (significant under p < 0.05). This improvement directly translated into higher harvesting success, increasing from 47% with YOLO (include null data training) to 73.3% with YOLO-DSC. These results demonstrate that integrating YOLO-DSC with best-view searching enhances recall and stability under dynamic viewpoints, enabling more reliable calyx-preserving harvesting in real greenhouse conditions.

Abstract:
The demand for personal mobility devices (PMDs) has increased, prompting studies on various control interfaces such as joysticks and handlebars. However, these interfaces remain external to the PMD, and no system has been developed in which the PMD itself functions as the interface. This study proposes an interface that measures and controls the internal air pressure of an inflatable PMD, enabling the system to recognize operator inputs, such as pressing, leaning, or pushing, directly from the PMD body. The system adjusts air pressure according to the operator's estimated weight, allowing operation with minimal force, while continuous speed control is realized through press, lean, and double-push inputs. Experimental results demonstrated that translational and angular speeds could be controlled through embodied body movements, and filter processing effectively mitigated the influence of air pressure fluctuations caused by uneven terrain, ensuring stable operation. Tests conducted on indoor and outdoor courses, including obstacles and uneven surfaces, showed operability comparable to a joystick, though narrower paths required more time to navigate. This study contributes a novel embodied air-pressure-based control paradigm that directly integrates the interface into the PMD itself.

Abstract:
Bio-inspired aquatic propulsion offers high thrust and maneuverability but is prone to destabilizing forces such as lift fluctuations, which are further amplified by six-degree-of-freedom (6-DoF) fluid coupling. We formulate quadrupedal swimming as a constrained optimization problem that maximizes forward thrust while minimizing destabilizing fluctuations. Our proposed framework, Accelerated Constrained Proximal Policy Optimization with a PID-regulated Lagrange multiplier (ACPPO-PID), enforces constraints with a PID‑regulated Lagrange multiplier, accelerates learning via conditional asymmetric clipping, and stabilizes updates through cycle‑wise geometric aggregation. Initialized with imitation learning and refined through on-hardware towing-tank experiments, ACPPO-PID produces control policies that transfer effectively to quadrupedal free-swimming trials. Results demonstrate improved thrust efficiency, reduced destabilizing forces, and faster convergence compared with state-of-the-art baselines, underscoring the importance of constraint-aware safe RL for robust and generalizable bio-inspired locomotion in complex fluid environments.

Abstract:
An event camera is a vision sensor that captures pixel-level brightness changes and outputs this information as asynchronous events. These events are primarily generated from geometric structures such as edges, which are sensitive to variations in brightness. In this letter, we aim to leverage line structure information alongside point features to enhance the robustness and accuracy of localization in indoor or human-made environments. To obtain precise line measurements from events, we propose a novel line detection method that incorporates a coarse-to-fine motion compensation scheme, which generates highly sharp event frames. The extracted line features are paired with point features, eliminating the need for traditional line descriptors. Finally, the event features are effectively fused with frame-based point features within a multi-state constraint Kalman filter-based backend, fully exploiting the complementary advantages of both sensors. The performance of the proposed method is verified through an author-constructed experiment and two public datasets, demonstrating improved accuracy in line detection and pose estimation.

Abstract:
Autonomous robot navigation in complex environments requires robust perception as well as high-level scene understanding due to perceptual challenges, such as occlusions, and uncertainty introduced by robot movement. For example, a robot climbing a cluttered staircase can misinterpret clutter as a step, misrepresenting the state and compromising safety. This requires robust state estimation methods capable of inferring the underlying structure of the environment even from incomplete sensor data. In this paper, we introduce a novel method for robust state estimation of staircases. To address the challenge of perceiving occluded staircases extending beyond the robot's field-of-view, our approach combines an infinite-width staircase representation with a finite endpoint state to capture the overall staircase structure. This representation is integrated into a Bayesian inference framework to fuse noisy measurements enabling accurate estimation of staircase location even with partial observations and occlusions. Additionally, we present a segmentation algorithm that works in conjunction with the staircase estimation pipeline to accurately identify clutter-free regions on a staircase. Our method is extensively evaluated on real robot across diverse staircases, demonstrating significant improvements in estimation accuracy and segmentation performance compared to baseline approaches.

Abstract:
Sim-to-real transfer is a fundamental challenge in robot learning. Discrepancies between simulation and reality can significantly impair policy performance, especially if it receives high-dimensional inputs such as dense depth estimates from vision. We propose a novel depth transfer method based on domain adaptation to bridge the visual gap between simulated and real-world depth data. A Variational Autoencoder (VAE) is first trained to encode ground-truth depth images from simulation into a latent space, which serves as input to a reinforcement learning (RL) policy. During deployment, the encoder is refined to align stereo depth images with this latent space, enabling direct policy transfer without fine-tuning. We apply our method to the task of autonomous drone navigation through cluttered environments. Experiments in IsaacGym show that our method nearly doubles the obstacle avoidance success rate when switching from ground-truth to stereo depth input. Furthermore, we demonstrate successful transfer to the photo-realistic simulator AvoidBench using only IsaacGym-generated stereo data, achieving superior performance compared to state-of-the-art baselines. Real-world evaluations in both indoor and outdoor environments confirm the effectiveness of our approach, enabling robust and generalizable depth-based navigation across diverse domains.

Abstract:
Robots operating alongside people, particularly in sensitive scenarios such as aiding the elderly with daily tasks or collaborating with workers in manufacturing, must guarantee safety and cultivate user trust. Continuum soft manipulators promise safety through material compliance, but as designs evolve for greater precision, payload capacity, and speed, and increasingly incorporate rigid elements, their injury risk resurfaces. In this letter, we introduce a comprehensive High-Order Control Barrier Function (HOCBF) + High-Order Control Lyapunov Function (HOCLF) framework that enforces strict contact force limits across the entire soft-robot body during environmental interactions. Our approach combines a differentiable Piecewise Cosserat-Segment (PCS) dynamics model with a convex-polygon distance approximation metric, named Differentiable Conservative Separating Axis Theorem (DCSAT), based on the soft robot geometry to enable real-time, whole-body collision detection, resolution, and enforcement of the safety constraints. By embedding HOCBFs into our optimization routine, we guarantee safety, allowing, for instance, safe navigation in operational space under HOCLF-driven motion objectives. Extensive planar simulations demonstrate that our method maintains safety-bounded contacts while achieving precise shape and task-space regulation. This work thus lays a foundation for the deployment of soft robots in human-centric environments with provable safety and performance.

Abstract:
Reliably grasping unknown objects in logistics automation remains a major challenge. While most approaches rely on 3D CAD models or large-scale training, their applicability to novel items is limited. This paper proposes a plug-and-play geometric refinement module that can be appended to any existing grasp planner. The module operates in a training-free and mesh-free manner, estimating an object's approximate centroid from a single RGB-D image to enhance grasp stability. Its core mechanism involves using an initial grasp candidate as an automatic prompt for segmentation, followed by geometric primitive fitting to the isolated object's point cloud. By rescoring grasp candidates based on proximity to the estimated centroid, our module improves physical stability. Experimental results demonstrate that our module improves the success rate of baseline grasp planners by up to 25%p enhancing real-world pick-and-place performance without requiring any offline training or prior object models.

Abstract:
Uncertainties in dynamic road environments pose significant challenges for behavior and trajectory planning in autonomous driving. This paper introduces Hi-Drive, a hierarchical planning algorithm addressing uncertainties at both behavior and trajectory levels using a hierarchical Partially Observable Markov Decision Process (POMDP) formulation. Hi-Drive employs driver models to represent uncertain behavioral intentions of other vehicles and uses their parameters to infer hidden driving styles. By treating driver models as high-level decision-making actions, our approach effectively manages the exponential complexity inherent in POMDPs. To further enhance safety and robustness, Hi-Drive integrates a trajectory optimization based on importance sampling, refining trajectories using a comprehensive analysis of critical agents. Evaluations on real-world urban driving datasets demonstrate that Hi-Drive significantly outperforms state-of-the-art planning-based and learning-based methods across diverse urban driving situations in real-world benchmarks.

Abstract:
This study presents a model predictive path integral (MPPI) method capable of conducting high-frequency real-time model predictive control (MPC) for robot manipulators. Real-time MPC-based manipulation holds significant potential for controlling an end-effector precisely and reactively while satisfying various constraints in dynamic environments. However, the optimization under a complex robot model and various constraints imposes a heavy computational burden, hindering the realization of high-frequency updates. To address this challenge, we propose a single-instance sampling-based MPPI algorithm and dynamic time horizon to significantly reduce the computational burden while enhancing control performance. The performance and efficacy of the proposed method are verified through experiments conducted on a 7-degree-of-freedom robotic arm, along with comparative simulations and analysis.

Abstract:
Autonomous exploration of unknown 3D environments requires motion planners that can efficiently identify informative regions to explore while continuously adapting to the evolving map of the environment. While existing sampling-based methods have demonstrated strong real-time performance, they often ignore the robots kinodynamic model and constraints. Consequently, they generate only target positions, neglecting kinodynamic considerations in the next-best-view decision process. This results in frequent slowdowns and abrupt maneuvers, reducing coverage speed and exploration efficiency. In this work, we propose a kinodynamic motion planning framework designed for fast and efficient exploration of unknown environments. By incorporating the robots kinodynamic model and constraints into a kinodynamic RRT, our approach bridges the gap between dynamically feasible motion and effective viewpoint selection, producing smoother and faster trajectories that improve exploration performance. Additionally, we present an Iterative Minimum Gain (IMG) approach to improve global coverage, and a novel informed yaw optimization method that accelerates optimal yaw selection, capable of achieving up to more than twice the speed of state-of-the-art methods. We validate our framework through extensive simulation and real-world experiments, demonstrating improved exploration rates, higher average velocities, and better global coverage over existing methods.

Abstract:
In the field of safe navigation for mobile robots, control barrier functions (CBFs) have garnered significant attention due to their ability to transform complex safety constraints into real-time solvable optimization problems. In this letter, we propose a novel Lyapunov-based CBF framework. It offers the following key advantages: (1) Using a single Control Lyapunov Function (CLF), this method synthesizes spatially shifted CBFs to construct an expansive safe invariant set in obstacle-dense environments. (2) The framework is capable of incorporating existing approaches for constructing quadratic CLF, making it applicable to a wide range of complex nonlinear systems and enhancing its generality and extensibility. (3) It enables real-time synthesis of CBFs, and ensures safety in large-scale 3D environments through efficient CBF-based quadratic programming (CBF-QP). (4) The method ensures safety while inheriting the stability properties of the CLF, allowing the asymptotic convergence of the system state to equilibrium, thus unifying safety and motion stability. To validate efficacy, we rigorously tested the framework in both simulations and hardware experiments.

Abstract:
Motivated by the problem of pursuit-evasion, we present a motion planning framework that combines energy-based diffusion models with artificial potential fields for robust real time trajectory generation in complex environments. Our approach processes obstacle information directly from point clouds, enabling efficient planning without requiring complete geometric representations. The framework employs classifier-free guidance training and integrates local potential fields during sampling to enhance obstacle avoidance. In dynamic scenarios, the system generates initial trajectories using the diffusion model and continuously refines them through potential field-based adaptation, demonstrating effective performance in pursuit-evasion scenarios with partial pursuer observability.

Abstract:
In this study, we address the problem of open-vocabulary mobile manipulation, where a robot is required to carry a wide range of objects to receptacles based on free-form natural language instructions. This task is challenging, as it involves understanding visual semantics and the affordance of manipulation actions. To tackle these challenges, we propose Affordance RAG, a zero-shot hierarchical multimodal retrieval framework that constructs Affordance-Aware Embodied Memory from pre-explored images. The model retrieves candidate targets based on regional and visual semantics and reranks them with affordance scores, allowing the robot to identify manipulation options that are likely to be executable in real-world environments. Our method outperformed existing approaches in retrieval performance for mobile manipulation instruction in large-scale indoor environments. Furthermore, in real-world experiments where the robot performed mobile manipulation in indoor environments based on free-form instructions, the proposed method achieved a task success rate of 85%, outperforming existing methods in both retrieval performance and overall task success.

Abstract:
Accurate estimation of tumor boundaries is critical for ensuring adequate surgical margins in robot-assisted minimally invasive surgery (RMIS). In this study, we present a method that estimates tumor boundaries in RMIS using sweeping palpation data acquired with a single force/torque (F/T) sensor. From the reconstructed surface, tissue displacement and normal force were derived to calculate stiffness, which was then used to construct a stiffness map. To reduce noise and enhance feature representations, we employed a sparse autoencoder (SAE). The SAE outputs were subsequently clustered with a Gaussian mixture model (GMM) and K-means to segment the tumor from normal tissue. Experiments with phantom models and an ex vivo model demonstrated that the SAE-based approach significantly improved the Dice similarity coefficient (DSC) and sensitivity while maintaining specificity, and reduced the Hausdorff distance (HD and HD95) and average symmetric surface distance (ASSD), compared with results from raw data. Importantly, when evaluated under clinically relevant surgical margin conditions, the estimated HD consistently remained below threshold across all models. These results indicate that the proposed method achieves both high accuracy and clinical feasibility without additional imaging devices or displacement sensors, highlighting its potential to support margin minimization and organ function preservation in RMIS.

Abstract:
Multiple Peg-in-Hole (MPiH) assembly is one of the fundamental tasks in robotic assembly. In the MPiH tasks for large-size parts, it is challenging for a single manipulator to simultaneously align multiple distant pegs and holes, necessitating tightly coupled multi-manipulator systems. For such MPiH tasks using tightly coupled multiple manipulators, we propose a collaborative visual servo control framework that uses only the monocular in-hand cameras of each manipulator to reduce positioning errors. Initially, we train a state classification neural network and a positioning neural network. The former divides the states of the peg and hole in the image into three categories: obscured, separated, and overlapped, while the latter determines the position of the peg and hole in the image. Based on these findings, we propose a method to integrate the visual features of multiple manipulators using virtual forces, which can naturally combine with the cooperative controller of the multi-manipulator system. To generalize our approach to holes of different appearances, we varied the appearance of the holes during the dataset generation process. The results confirm that by considering the appearance of the holes, classification accuracy and positioning precision can be improved. Finally, the results show that our method achieves a success rate close to 100% in dual-manipulator dual peg-in-hole tasks with a clearance of 0.2 mm, while robust to camera calibration errors.

Abstract:
Weld bead grinding is a critical post-processing step in metal fabrication, yet conventional robotic grinding based on teach-pendant programming lacks adaptability to variations in bead geometry and position. This paper presents a vision-guided robotic grinding system that combines deep learning-based weld bead segmentation, automated grinding path generation, and digital twin-based pre-verification. A U-Net model with a ResNet34 encoder and ImageNet pre-training segments weld bead regions from RGB images captured by an Intel RealSense D415 camera mounted on a Staubli RX160 manipulator, achieving a mean Intersection over Union (IoU) of 0.9311 and a Dice coefficient of 0.9641. The segmented bead contours are transformed into the robot coordinate frame through hand-eye calibration and forward kinematics, enabling automated generation of grinding waypoints along the bead centerline. The CHOMP algorithm plans collision-free trajectories within MoveIt, and all planned motions are validated in a digital twin environment built on NVIDIA Isaac Sim 5.0, integrated with ROS through a distributed multi-container architecture. Experimental results demonstrate that the proposed system effectively generates adaptive grinding paths for varying weld bead geometries and verifies them in simulation before physical deployment.

Abstract:
This paper proposes a transformer-based single stream model with CRFontology decoding for stable worker intention recognition in humanrobot collaboration(HRC). Although existing intention recognition methods achieve high accuracy, they often suffer from temporal prediction instability and logically inconsistent combinations among actions, tools, parts, and intentions. To address these issues, the proposed approach employs a transformer encoder to integrate worker actions and part-related information, thereby capturing the task context and jointly predicting actions, tools, parts, and intentions. For intention prediction, a conditional random field (CRF) is applied to enforce temporal consistency and improve prediction stability. In addition, an ontology-based postprocessing step removes infeasible combinations under a given task intention and reselects predictions that satisfy structural constraints. Experimental results show that the CRF reduces the intention change rate from 7.9% to 3.0%, improving temporal stability, while ontology-based decoding decreases the violation rate from 26.5% to 6.9% by eliminating inconsistent predictions. When combined, the proposed method achieves both a low change rate (3.0%) and a low violation rate (3.7%), demonstrating its effectiveness for reliable intention recognition in HRC.

Abstract:
Intention recognition is essential for wearable robotics and assistive systems. However, conventional approaches often suffer from cumbersome sensor setups or sensitivity to external disturbances. To address these limitations, this study proposes an LSTM-based intention recognition method using lower-limb Muscle-Volume (MV) sensors. An insole-type pressure sensor, an IMU sensor, and a cuff-type MV sensor were used to record a series of motions, including sitting, standing, walking, and running. Deep learning techniques were then applied for classification and transition detection. Accuracies of the predicted movement states based on data from the IMU, insole-type pressure, and cuff-type MV sensors were 93.04%, 97.65%, and 93.08%, respectively. The average transition detection latencies for the IMU, insole, and MV sensor model were 0.135 s, 0.377 s, and 0.455 s, respectively. Results show that the proposed MV sensor achieves performance comparable to insole pressure sensors, demonstrating its potential as a practical and robust alternative for intention recognition in wearable systems.

Abstract:
As manufacturing capabilities advance to greater autonomy, interest is increasingly directed toward versatile agents capable of performing complex tasks. Recently, learning-based approaches have shown more rapid progress compared to classical methods. While these advancements are enabled by the offline setting of Imitation Learning (IL), transfer to pure online exploration Reinforcement Learning (RL) remains less explored. This work experiments with a simple extension to the standard Markovian MLP policy by explicitly encoding a history of states using a tiny transformer model.

Abstract:
Robot manipulation tasks involve direct interactions with objects, which can be viewed as dynamic changes to the robots kinematic chain. Morphology-aware learning frameworks, in which robot embodiment is explicitly modeled, do not account for these object-induced changes in their architectures. We address this gap by proposing ManiMorph, a multi-task, morphology-aware manipulation-learning framework in which object features are integrated into the robots morphological graph. We demonstrate that this node-centric representation, combined with a Feature-wise Linear Modulation (FiLM) task component, enhances the performance of the morphology-aware frameworks for robotic manipulation and generalizes effectively to new object variations.

Abstract:
utonomous racing pushes vehicles to their physical limits, requiring control policies that can rapidly adapt to localized changes in track conditions, such as varying surface friction. Current Reinforcement Learning (RL) approaches rely either on ground-truth system identification, which is impractical in the real world, or short-horizon reactive adaptations (e.g., Rapid Motor Adaptation (RMA)) that cannot remember spatial disturbances across multiple laps. In this extended abstract, we propose a novel RL architecture based on Mamba, a structured State Space Model (SSM), for autonomous racing. By fusing vehicle state with Fourier features of vehicle position on the racetrack, our Mamba-based policy builds a long-horizon episodic memory. This allows the policy not only to adapt to unknown friction online but also to map and memorize slippery zones for future laps. Evaluated in a simulated F1Tenth environment, our approach demonstrates continuous lap-to-lap improvement, approaching the performance of an oracle policy trained on exact ground-truth friction, whereas standard Multi-Layer Perceptron (MLP) and Recurrent Neural Network (RNN) baselines plateau at inferior performance levels.

Abstract:
Surgical automation is being increasingly studied, yet bridging visual scene understanding with autonomous action planning remains a fundamental challenge. While much research effort has been made on scene perception (e.g., tool recognition and scene segmentation), understanding and predicting actionable possibilities for surgical automation is still underexplored. In this paper, we introduce surgical affordance prediction, which identifies actionable regions for fundamental surgical actions from visual data. Specifically, a novel adaptive feature fusion framework is proposed that leverages the complementary strengths of a self-supervised vision transformer encoder for its superior semantic understanding and a large-scale generative model encoder for its spatially-aware capability. Furthermore, we introduce a hierarchical prompt learning mechanism to adapt to varying procedural contexts. Finally, a scene-guided attention decoder is proposed to focus on critical surgical areas while suppressing background distractions. To validate the effectiveness, we established a new dataset, derived from publicly available surgical datasets with affordance annotations for three basic surgical actions: aspiration, clipping, and retraction. Extensive experiments demonstrate that our approach achieves state-of-the-art performance. Moreover, we validate our framework's applicability for downstream automation on a realistic lung and prostate phantom, and results show that the predicted affordance maps successfully enable autonomous surgical actions.

Abstract:
Traditional human activity recognition uses either direct image analysis or data from wearable inertial measurement units (IMUs), but can be ineffective in challenging underwater environments. We introduce a novel hybrid approach that bridges this gap to monitor scuba diver safety. Our method leverages computer vision to generate high-fidelity motion data, effectively creating a pseudo-IMU from a stream of 3D human joint keypoints. This technique circumvents the critical problem of wireless signal attenuation in water, which plagues conventional diver-worn sensors communicating with an autonomous underwater vehicle (AUV). We apply this system to the vital task of identifying anomalous scuba diver behavior that signals the onset of a medical emergency such as cardiac arresta leading cause of scuba diving fatalities. By integrating our classifier onboard an AUV and conducting experiments with simulated distress scenarios, we demonstrate the utility and effectiveness of our method for advancing robotic monitoring and diver safety

Abstract:
Accurate 3D volumetric mapping is critical for autonomous underwater vehicles operating in obstacle-rich environments. Vision-based perception provides high-resolution data but fails in turbid conditions, while sonar is robust to lighting and turbidity but suffers from low resolution and elevation ambiguity. This paper presents a volumetric mapping framework that fuses a stereo sonar pair with a monocular camera to enable safe navigation under varying visibility conditions. Overlapping sonar fields of view resolve elevation ambiguity, producing fully defined 3D point clouds at each time step. The framework identifies regions of interest in camera images, associates them with corresponding sonar returns, and combines sonar range with camera-derived elevation cues to generate additional 3D points. Each 3D point is assigned a confidence value reflecting its reliability. These confidence-weighted points are fused using a Gaussian Process Volumetric Mapping framework that prioritizes the most reliable measurements. Experimental comparisons with other opti-acoustic and sonar-based approaches, along with field tests in a marina environment, demonstrate the methods effectiveness in capturing complex geometries and preserving critical information for robot navigation in both clear and turbid conditions. Our code will be released as open-source to support community adoption.

Abstract:
This work addresses the problem of underwater propeller cleaning in environments containing obstacles using an Underwater Vehicle Manipulator System (UVMS). Prior propeller-cleaning approaches plan coverage tool paths without explicitly considering the connectivity of the associated Surface-Constrained Configuration Space (SCCS), leading to unnecessary lift-offs in obstacle-occluded settings. In contrast, we formulate the coverage problem in the disconnected SCCS as a Generalized Traveling Salesman Problem (GTSP) within a hierarchical framework, accounting for obstacles and attempting to minimize the number of tool lift-offs. We consider explicitly the tool lift-off paths in the GTSP cost formulation, utilizing the hierarchical framework to guide the search without exhaustively evaluating all possible paths. To achieve smoother tool paths with fewer turns, we introduce a cost that promotes alignment with desired coverage curves. Finally, we time-parameterize the coverage path into a whole-body UVMS trajectory by minimizing the duration of the cleaning task, while respecting the robot hardware limitations. The effectiveness of the proposed method is demonstrated in a realistic simulation scenario.

Abstract:
Surface-mobile platforms have explored the moon and the red planet for nearly half century, providing a wealth of scientific data. However, surface mobility on planetary bodies remains a challenging task. However, surface mobility on planetary bodies remains a challenging task. In this paper, the formulation of reaction force by a grouser with a generalized geometry for a wheel of a planetary rover is presented, along with its verification through comparisons with the results by the conventional geometry. In a simulation study, the resistive force theory is applied to a general grouser geometry model. The study determines the impact of several parameters, particularly the grouser inclination, on draw-bar pull. The results obtained from the study suggest the formulation of a design for the grouser that is nearly optimal in its capacity to maximize the draw-bar pull per sinkage. We also apply the proposed geometry to the wheel on LEV-1, demonstrating that it works well in actual lunar operations.

Abstract:
Constructing 3D representations of object geometry is critical for many robotics tasks, particularly manipulation problems. These representations must be built from potentially noisy partial observations. In this work, we focus on the problem of reconstructing a multi-object scene from a single RGBD image using a fixed camera. Traditional scene representation methods generally cannot infer the geometry of unobserved regions of the objects in the image. Attempts have been made to leverage deep learning to train on a dataset of known objects and representations, and then generalize to new observations. However, this can be brittle to noisy real-world observations and objects not contained in the dataset, and do not provide well-calibrated reconstruction confidences. We propose BRRP, a reconstruction method that leverages preexisting mesh datasets to build an informative prior during robust probabilistic reconstruction. We introduce the concept of a retrieval-augmented prior, where we retrieve relevant components of our prior distribution from a database of objects during inference. The resulting prior enables estimation of the geometry of occluded portions of the in-scene objects. Our method produces a distribution over object shape that can be used for reconstruction and measuring uncertainty. We evaluate our method in both simulated scenes and in the real world. We demonstrate the robustness of our method against deep learning-only approaches while being more accurate than a method without an informative prior. Through real-world experiments, we particularly highlight the capability of BRRP to enable successful dexterous manipulation in clutter.

Abstract:
Recent advances in soft robotics, wearable devices, and deployable systems have sparked tremendous interest in origami structures due to their controllable volume changes and shape-morphing capabilities. Despite significant progress in the design and fabrication of origami using traditional materials such as paper, textiles, thermoplastics, and thick panels, challenges persist in creating soft elastomeric origami designs that allow for precise, programmable deformations. This work proposes an architected approach for designing and 3D printing Room Temperature Vulcanization (RTV) silicone-based origami structures actuated by negative pressure. Central to this approach is a flexible hinge design, which enables controlled bending angles ranging from 45° to 90° upon the application of vacuum actuation. This architected method simplifies the complex folding of origami structures by strategically arranging the flexible hinges. A Python-based tool was developed to generate G-code directly from user-defined design parameters, streamlining the design-to-fabrication pipeline for Direct Ink Writing (DIW) RTV silicone-based origami parts. Initial fabrication experiments were conducted using a three-step print-assemble-bond approach. As an alternative to eliminating manual processing steps, a monolithic flexible hinge with a cavity was printed within a gel support. This paper introduces a hinge design library and discusses the design-to-fabrication workflow for two origami-inspired active structures.

Abstract:
Growing labor shortages are increasing the demand for domestic service robots (DSRs) to assist in various settings. In this study, we develop a DSR that transports everyday objects to specified pieces of furniture based on open-vocabulary instructions. Our approach focuses on retrieving images of target objects and receptacles from pre-collected images of indoor environments. For example, given an instruction "Please get the right red towel hanging on the metal towel rack and put it in the white washing machine on the left," the DSR is expected to carry the red towel to the washing machine based on the retrieved images. This is challenging because the correct images should be retrieved from thousands of collected images, which may include many images of similar towels and appliances. To address this, we propose RelaX-Former, which learns diverse and robust representations from among positive, unlabeled positive, and negative samples. We evaluated RelaX-Former on a dataset containing real-world indoor images and human annotated instructions including complex referring expressions. The experimental results demonstrate that RelaX-Former outperformed existing baseline models across standard image retrieval metrics. Moreover, we performed physical experiments using a DSR to evaluate the performance of our approach in a zero-shot transfer setting. The experiments involved the DSR to carry objects to specific receptacles based on open-vocabulary instructions, achieving an overall success rate of 75%.

Abstract:
Quadruped robots used for rescue and exploration are susceptible to various leg failures, where unpredictable joint locking or power loss can pose an immediate risk of falling. Traditional controllers lack fault-tolerant control capabilities in the case of multi-joint concurrent faults, and erroneous controller outputs may lead to robot damage. This paper proposes a model-free reinforcement learning framework based on central pattern generators (CPG) for fault-tolerant control (FT-CPG). The framework uses biomimetic gait generation and section-wise training to address various types of multi-joint concurrent faults. FT-CPG adopts a fault-tolerant CPG module to generate safe gaits, while utilizing neural network-based policies to infer failures and coordinate the rhythmic behaviors of the CPG, ensuring the ability to track velocity commands under fault conditions. Experiments show that FT-CPG is robust in unexpected situations, where a single leg experiences failures across any number of joints, with each joint randomly encountering locking or power loss faults. Furthermore, the proposed framework preserves the robot's omnidirectional mobility. Finally, zero-shot sim-to-real transfer was successfully implemented on the real-world Unitree Go1 robot, effectively addressing various multi-joint leg failures.

Abstract:
A control method is proposed for a tensegrity robot to generate legged-rolling locomotion (i.e., rolling movement produced by a legged system). The robot has a minimal structure composed only of two rods and four elastic cables. The difficulty of the control arises from the minimalistic structure that makes the system underactuated. Our control strategy is divided into two phases: 1) overcoming the gravitational potential energy and 2) adjusting the robot's posture to prepare for the landing. Numerical simulations demonstrated that the system was capable of traversing complex terrains with two types of gaits, i.e., quasi-static and dynamic. The proposed structure also enabled the robot to autonomously recover from arbitrary stationary states and initiate legged-rolling locomotion. Physical experiments validated the applicability of the tensegrity robot to various terrains such as uphill and stair climbing and showed its capability of overcoming discrete steps up to 20 % of the robot's frame length.

Abstract:
LiDAR odometry is a fundamental technology for autonomous navigation. However, existing LiDAR-based odometry methods typically demand extensive manual parameter tuning and remain prone to instability when deployed across varying LiDAR types and environments. This letter focuses on the essence of point clouds and introduces a fast, highly adaptable, and robust LiDAR odometry framework named Onion-LO. Onion-LO demonstrates strong compatibility with various LiDAR types and reliable operation across diverse scenarios. This is facilitated by an onion-like point cloud processing structure termed Onion Ball. The Onion Ball supports multi-threaded implementation, efficiently executing point cloud distribution analysis, segmentation, and downsampling. In addition, we design an adaptive optimization strategy for local map management and iterative optimization, which effectively enhances the system's robustness and accuracy. Extensive experiments on five datasets demonstrate that Onion-LO outperforms existing state-of-the-art methods regarding localization accuracy and robustness. Additional evaluations across 11 LiDAR sensors and 8 diverse scenarios further confirm its strong generalization capability. Our method is designed for practical deployment and supports real-time operation on onboard processors. We open-source the code on https://anonymous.4open.science/r/Onion-LO.

Abstract:
Accurate three-dimensional (3D) localization is critical for robust human-robot collaboration (HRC) in dynamic indoor environments. However, realizing high-precision localization in complex scenarios still faces challenges such as multipath effects, field-of-view occlusion, etc. To address these limitations, we propose Geo-LSTM, a geometry-constrained long short-term memory (LSTM) framework that integrates ultra-wideband (UWB) sensors, inertial measurement unit (IMU), and barometric pressure (BMP) sensors. First, a Simplified Geometric Localization (SGL) algorithm is proposed, which uses dual-BMP sensors and IMU sensor to obtain precise height information and utilizes the geometric relationships between the UWB tag and anchors to compute an initial location estimate, serving as a priori input for the Geo-LSTM network. This Geo-LSTM algorithm then incorporates multi-source geometric information to extract time-series features from the UWB ranging data and the tag's a priori location, further enhancing 3D localization accuracy. The experimental results from the cluttered indoor environments, including real-world HRC tasks with occlusions, show that the Geo-LSTM algorithm achieves an average 3D localization root mean square error (RMSE) of 0.103 m, representing improvements of 38.60% and 31.20% over the weighted least squares (WLS) method and the range-based LSTM algorithm, respectively. These results demonstrate Geo-LSTM's potential for reliable multi-sensor 3D localization in HRC applications.

Abstract:
Skill discovery methods enable agents to tackle intricate tasks by acquiring diverse and useful skills from task-agnostic datasets in an unsupervised manner. To apply these methods to more general and everyday tasks, the skill set must be scalable. However, current approaches struggle with this scalability, often facing the challenge of catastrophic forgetting when learning new skills. To address this imitation, we propose a scalable skill discovery algorithm, a playbook, which can accommodate unseen tasks by acquiring new skills while maintaining previously learned ones. The scalable structure of the playbook, consisting of finite and independent plays and primitives, enables expansion by adding new elements to accommodate new tasks. The proposed method is evaluated in the complex robotic manipulation benchmarks, and the results show that the playbook outperforms existing state-of-the-art methods.

Abstract:
The increasing labor shortage and aging population underline the need for assistive robots to support human care recipients. To enable safe and responsive assistance, robots require accurate human motion prediction in physical interaction scenarios. However, this remains a challenging task due to the variability of assistive settings and the complexity of coupled dynamics in physical interactions. In this work, we address these challenges through two key contributions: (1) HHI-Assist, a dataset comprising motion capture clips of human-human interactions in assistive tasks; and (2) a conditional Transformer-based denoising diffusion model for predicting the poses of interacting agents. Our model effectively captures the coupled dynamics between caregivers and care receivers, demonstrating improvements over baselines and strong generalization to unseen scenarios. By advancing interaction-aware motion prediction and introducing a new dataset, our work has the potential to significantly enhance robotic assistance policies. The dataset and code are available at https://sites.google.com/view/hhi-assist/home .

Abstract:
Aerial robots are transitioning from traditional surveillance and monitoring roles to more advanced tasks involving physical interaction. Despite this progress, physical Human-Aerial Robot Interaction remains largely underexplored due to the complexity and stability-related issues of such platforms. This paper introduces a novel control framework for letting an aerial platform transport an object with a human operator cooperatively. The control approach is built on a nonlinear model predictive control (NMPC), integrating the dynamic models of humans, aerial robots, and transported objects. To ensure safe and robust physical interaction, the NMPC is combined with a compliant controller. Additionally, our controller prioritizes forward motion over lateral movements to accommodate the human's natural direction of motion. We validate this framework through indoor flight experiments, demonstrating how a human operator and a fully actuated hexarotor can effectively collaborate to transport a bar. The results highlight the aerial robot's ability to assist the human during physical transportation tasks, enhancing efficiency and comfort.

Abstract:
Studying tissue samples obtained during autopsies is the gold standard when diagnosing the cause of death and for understanding disease pathophysiology. Recently, the interest in post mortem minimally invasive biopsies has grown which is a less destructive approach in comparison to an open autopsy and reduces the risk of infection. While manual biopsies under ultrasound guidance are more widely performed, robotic post mortem biopsies have been recently proposed. This approach can further reduce the risk of infection for physicians. However, planning of the procedure and control of the robot need to be efficient and usable. We explore a virtual reality setup with a digital twin to realize fully remote planning and control of robotic post mortem biopsies. The setup is evaluated with forensic pathologists in a usability study for three interaction methods. Furthermore, we evaluate clinical feasibility and evaluate the system with three human cadavers. Overall, 132 needle insertions were performed with an off-axis needle placement error of 5.30±3.25 mm. Tissue samples were successfully biopsied and histopathologically verified. Users reported a very intuitive needle placement approach, indicating that the system is a promising, precise, and low-risk alternative to conventional approaches.

Abstract:
Reactive intelligence remains one of the cornerstones of versatile robotics operating in cluttered, dynamic, and human-centred environments. Among reactive approaches, potential fields (PF) continue to be widely adopted due to their simplicity and real-time applicability. However, existing PF methods typically oversimplify environmental representations by relying on isotropic, point- or sphere-based obstacle approximations. In human-centred settings, this simplification results in overly conservative paths, cumbersome tuning, and computational overheadeven breaking real-time requirements. In response, we propose the Geometric Potential Field (GeoPF), a reactive motion-planning framework that explicitly infuses geometric primitivespoints, lines, planes, cubes, and cylinderstheir structure and spatial relationship in modulating the real-time repulsive response. Extensive quantitative analyses consistently show GeoPFs higher success rates, reduced tuning complexity (a single parameter set across experiments), and substantially lower computational costs (up to 2 orders of magnitude) compared to traditional PF methods. Real-world experiments further validate GeoPFs reliability, robustness, and practical ease of deployment. GeoPF provides a fresh perspective on reactive planning problems driving geometric-aware temporal motion generation, enabling flexible and low-latency motion planning suitable for modern robotic applications.

Abstract:
Designing effective reward functions remains a major challenge in reinforcement learning (RL), often requiring considerable human expertise and iterative refinement. Recent advances leverage Large Language Models (LLMs) for automated reward design, but these approaches are limited by hallucinations, reliance on human feedback, and challenges with handling complex, multi-step tasks. In this work, we introduce Reward Evolution with Graph-of-Thoughts (RE-GoT), a novel bi-level framework that enhances LLMs with structured graph-based reasoning and integrates Visual Language Models (VLMs) for automated rollout evaluation. RE-GoT first decomposes tasks into text-attributed graphs, enabling comprehensive analysis and reward function generation, and then iteratively refines rewards using visual feedback from VLMs without human intervention. Extensive experiments on 10 RoboGen and 4 ManiSkill2 tasks demonstrate that RE-GoT consistently outperforms existing LLM-based baselines. On RoboGen, our method improves average task success rates by 32.25%, with notable gains on complex multi-step tasks. On ManiSkill2, RE-GoT achieves an average success rate of 93.73% across four diverse manipulation tasks, significantly surpassing prior LLM-based approaches and even exceeding expert-designed rewards. Our results indicate that combining LLMs and VLMs with graph-of-thoughts reasoning provides a scalable and effective solution for autonomous reward evolution in RL.

Abstract:
Engaging K-12 students in authentic scientific research remains a significant challenge, particularly at the intersection of environmental science and robotics. We introduce the Jar Jar ROV, a low-cost, open-source Remotely Operated Vehicle (ROV) platform designed for citizen science-based water quality monitoring by middle school students. This paper presents the design of the platform and the results of a large-scale deployment with over 100 students across a US state who built, programmed, and deployed the ROVs in local lakes. The educational framework yielded high student engagement in hands-on activities, with ROV construction earning a perfect average score from mentors. From a scientific standpoint, the program successfully established a grassroots monitoring network, generating nearly eleven thousand validated measurements of temperature, pH, dissolved oxygen, and turbidity. However, our evaluation identified a critical "engagement gap," with student interest declining sharply during more complex tasks such as electronics assembly and data uploading. This paper contributes both a validated, scalable model for integrating robotics into environmental education and a clear, data-driven roadmap for future improvements. These enhancements focus on lowering technical barriers and creating a more intuitive link between data collection and scientific discovery, addressing a key challenge in empowering the next generation of citizen scientists.

Abstract:
Transformable wheel-legged robots can adjust their configuration according to terrain conditions, enabling effective operation in harsh environments. While existing controllers based on preset commands have successfully demonstrated the feasibility of reconfigurable mechanisms, they still struggle to handle complex autonomous operations. To address this, we develop a comprehensive motion model for such robots, encompassing chassis kinematics, chassis-wheel kinematics, and stability models, along with a hierarchical path tracking method. The upper controller uses model predictive control with an error state-space model to optimize real-time tracking error under input constraints and generate desired commands. The lower controller utilizes feedforward control to convert desired inputs into actual ones, while accommodating physical constraints and geometric coupling associated with variable-radius wheels. Comparative analyses confirm the effectiveness of the proposed approach and demonstrate the robot's performance under different wheel modes.

Abstract:
Artificial Intelligence is widely recognised as a driver of adaptive autonomy in robotics. Yet, the extent to which AI techniques truly permeate the functional architecture of autonomous systems is still only partially characterised. Existing bibliometric analyses typically map research themes, keywords or algorithms but provide limited insight into how contributions distribute across the functional logic of autonomous systems. This raises a fundamental question: is AI really pervasive across the functions that enable robots to act adaptively in complex environments? Which areas are mature or under-explored? To achieve this outcome, the paper adopts a functional, control-loop-oriented perspective that avoids the bias of vertical domains or robot-specific applications, More than 2500 scientific works, published in the last 25 years, were mapped across the 13 functional modules, using a multi-label neural classification pipeline, and analysed via co-occurrence and structural techniques. This approach allowed to highlight not only areas where AI is already known to be central and consistently confirmed, but also those where its impact would be expected to be significant yet remains surprisingly limited. By combining architectural reasoning with bibliometric evidence, the study provides a broader lens for assessing research gaps and for situating current advances within the long-term agenda of adaptive and human-centred autonomy.

Abstract:
Moving object segmentation (MOS) is essential for autonomous driving, enabling robust detection, tracking, and prediction of dynamic agents in complex traffic scenarios. Radar sensors offer notable advantages for long-range sensing, but their lower spatial resolution, measurement noise, and geometric distortionsparticularly for distant targetspose significant challenges for accurate MOS. These limitations are amplified when detecting small objects such as scooters. In this work, we present MotionNet-PGA, a Polar-guided Attention Framework designed specifically for scanning radar-based MOS. Our method builds on the multi-frame motion encoding backbone of MotionNet, and introduces a polar-guided attention module to suppress clutter, enhance motion feature representation, and improve segmentation of small and distant targets. For evaluation, we construct and annotate the ITRI Radar moving object segmentation Dataset. Experimental results demonstrate that our method surpasses state-of-the-art baseline, MotionNet, by 2.48% in overall IoU and achieves a 4.08% improvement in small-object segmentation. These results highlight the effectiveness of polar-guided attention in addressing scanning radar-specific challenges.

Abstract:
The problem of localization on a large-scale satellite image given a frame of query ground view point clouds remains challenging. Existing LiDAR-to-image cross-view localization methods struggle in large-scale scenarios due to limited semantic alignment and the modality gap between point clouds and satellite images. This paper introduces the large-scale LiDAR-to-image geo-localization pipeline called GeoISF. GeoISF introduces an instance semantic forest constructed using WordNet, which enhances temporal semantic representation and discriminative power by integrating semantic trees from multiple frames. By leveraging environmental semantic representation as a shared medium, GeoISF effectively bridges the modality gap and improves semantic matching accuracy. Extensive experiments demonstrate the superior performance of GeoISF in large-scale cross-view localization, which achieves a 13.22-fold improvement compared to the parallel LiDAR-to-image method in the R@10 metric on KITTI dataset. The proposed method addresses the existing gap in large-scale LiDAR-to-image cross-view localization, offering a robust solution to the computational and accuracy challenges inherent in such scenarios. We will release the code as an open-source resource available online for the broader research community.

Abstract:
Underwater 3D scene reconstruction is critical for the operation of underwater robotics, yet remains highly challenging due to the semi-transparent water medium, which introduces optical distortions, light scattering, and severe visibility degradation. Therefore, effective underwater image enhancement is a prerequisite for reliable reconstruction. However, existing approaches typically enhance individual views with pre-trained models before reconstruction, leading to poor generalization and inconsistent multi-view results. To address these limitations, we propose GS-UVCE, an end-to-end framework for Gaussian Splatting-driven Unsupervised Visual Consistency Enhancement. GS-UVCE incorporates a Medium-MLP to model water-medium effects and a Light-MLP to adaptively correct illumination, ensuring illumination consistency. Furthermore, depth regularization is introduced to preserve geometric consistency under varying scene conditions. Extensive experiments on multiple underwater datasets show that GS-UVCE consistently outperforms SOTA methods, achieving superior reconstruction fidelity and visual consistency enhancement.

Abstract:
In this paper, we use a machine learning approach to stabilize a Single Actuator Monocopter (SAM), showing its ability to operate autonomously outdoors utilizing an onboard Inertial Measurement Unit (IMU). We introduce a neural network-based proportional stabilizer that works in parallel to cascaded P/PID controllers. This network uses the IMUs data to predict the world frame angular velocity, which is then used to stabilize the SAM. Training data was collected to establish correspondences between the IMU readings and the world frame angular velocity from flights conducted within an indoor motion capture environment. We used data augmentation to improve the networks generalization and prediction performance by 9%. Once trained, the neural network was deployed on the SAM to estimate its angular velocity in real time. We then tested the SAMs autonomous capabilities in a large semi-outdoor space of approximately 16,000 m3 with wind disturbances of up to 1.5 m/s. We demonstrate position hold, waypoint, and continuous tracking tests, achieving median position errors of 0.5 m, 1.05 m, and 2.22 m, respectively, where no stabilization would result in failure of the defined tests.

Abstract:
Vision-Language-Action models (VLAs) achieve strong performance in general robotic manipulation tasks by scaling imitation learning. However, existing VLAs are limited to predicting short-sighted next-action, which struggle with long-horizon trajectory tasks due to incremental deviations. To address this problem, we propose a plug-in framework named \method that effectively empowers off-the-shelf VLAs with the capability of foreseeing future states via test-time scaling. Specifically, \method samples and rolls out possible action trajectories where involved actions are rationales to generate future states via a world model, which enables \method to foresee and reason potential outcomes and search for the optimal actions. We further leverage Monte Carlo Tree Search (MCTS) to improve search efficiency in large action spaces, where step-wise VLA predictions seed the root. Meanwhile, we introduce a confidence sampling mechanism based on Kernel Density Estimation (KDE), to enable efficient exploration in MCTS without redundant VLA queries. We evaluate intermediate states in MCTS via an offline value estimation strategy, to score predicted futures and correct deviations with long-term feedback. We conducted extensive experiments in both simulators and the real world, demonstrating that our proposed VLA-Reasoner achieves significant improvements over the state-of-the-art VLAs. Our method highlights a potential pathway toward scalable test-time computation of robotic manipulation. The project website is available at: https://vla-reasoner.github.io/.

Abstract:
We investigate the task and motion planning problem for Signal Temporal Logic (STL) specifications in robotics. Existing STL methods rely on pre-defined maps or mobility representations, which are ineffective in unstruc- tured real-world environments. We propose the Structured- MoE STL Planner (S-MSP), a differentiable framework that maps synchronized multi-view camera observations and an STL specification directly to a feasible trajectory. S-MSP integrates STL constraints within a unified pipeline, trained with a composite loss that combines trajectory reconstruction and STL robustness. A structure-aware Mixture-of-Experts (MoE) model enables horizon-aware specialization by projecting sub-tasks into temporally anchored embeddings. We evaluate S-MSP using a high-fidelity simulation of factory-logistics scenarios with temporally constrained tasks. Experiments show that S- MSP outperforms single-expert baselines in STL satisfaction and trajectory feasibility. A rule-based safety filter at inference improves physical executability without compromising logical correctness, showcasing the practicality of the approach.

Abstract:
We present STACK, a framework for discovering and learning composable manipulation skills from unsegmented demonstrations by leveraging spatial and temporal structure extracted from foundation models. STACK automatically extracts temporal structure by segmenting raw demonstrations into short-horizon skills using a video-language model, and spatial structure by identifying skill-relevant elements in 3D point cloud observations. For each discovered skill, we learn a diffusion-based trajectory sampler and a skill effect model, both of which operate in the reference frame of the relevant scene element. At test time, given a language goal, STACK segments the 3D scene, samples skill trajectories, and composes them by simulating geometric effects. This enables generalization to new scene configurations, geometric constraints, and longer task horizons beyond training across diverse real-world manipulation tasks. Project page: https://icra-stack.github.io

Abstract:
Modular robots offer a promising solution for building versatile and adaptable robotic systems. For instance, space exploration robots can be designed to reconfigure to meet diverse task demands across varying environments. However, training such systems by Reinforcement Learning (RL) remains challenging due to the diversity of morphologies and the lack of simulation environments that support simultaneous multi-morphology learning. We present Modular Mixture of Experts (M2oE), a novel reinforcement learning backbone network that imitates the modular structure of robots to enable efficient and module-wise parallelizable policy learning for modular robots. In M2oE, the shared pool of experts, combined with an attention-based gating mechanism that dynamically selects experts based on inter-module correlations, enables both specialization and generalization. This structure supports training across multiple morphologies within a single framework, avoiding gradient conflicts and enhancing experience sharing across modules and morphologies. To support training, we also extend the Isaac Lab simulator with multi-morphology extensions that enable concurrent training across diverse robot configurations. Experiments on a space-exploration-inspired modular robot, Moonbot, demonstrate that M2oE significantly improves learning efficiency and achieves superior performance compared to both MLP and Transformer baselines. More information and the project video are available on the project website: https://ryuuchou17.github.io/m2oe

Abstract:
This paper presents a control methodology for achieving orbital stabilization with simultaneous time synchronization of periodic trajectories in underactuated robotic systems. The proposed approach extends the classical transverse linearization framework to explicitly incorporate time-desynchronization dynamics. To stabilize the resulting extended transverse dynamics, we employ a combination of time-varying LQR and sliding-mode control. The theoretical results are validated experimentally through the implementation of both centralized and decentralized control strategies on a group of six Butterfly robots.

Abstract:
In recent years, precision agriculture has been introducing groundbreaking innovations in the field, with a strong focus on automation. However, research studies in robotics and autonomous navigation often rely on controlled simulations or isolated field trials. The absence of a realistic common benchmark represents a significant limitation for the diffusion of robust autonomous systems under real complex agricultural conditions. Vineyards pose significant challenges due to their dynamic nature, and they are increasingly drawing attention from both academic and industrial stakeholders interested in automation. In this context, we introduce the TEMPO-VINE dataset, a large-scale multi-temporal dataset specifically designed for evaluating sensor fusion, simultaneous localization and mapping (SLAM), and place recognition techniques within operational vineyard environments. TEMPO-VINE is the first multi-modal public dataset that brings together data from heterogeneous LiDARs of different price levels, AHRS, RTK-GPS, and cameras in real trellis and pergola vineyards, with multiple rows exceeding 100 m in length. In this work, we address a critical gap in the landscape of agricultural datasets by providing researchers with a comprehensive data collection and ground truth trajectories in different seasons, vegetation growth stages, terrain and weather conditions. The sequence paths with multiple runs and revisits will foster the development of sensor fusion, localization, mapping and place recognition solutions for agricultural fields. The dataset, the processing tools and the benchmarking results are available on the webpage.

Abstract:
The automatic manipulation of Deformable Linear Objects (DLOs) remains currently a challenge in robotics. Previous research on robotic DLOs manipulation has primarily addressed quasi-static DLO manipulation at low speeds, leaving the potential of dynamic DLO manipulation largely unexplored. This paper introduces DynDLO, a goal conditioned, 6-axis robot-independent Reinforcement Learning sandbox for training agents on a variety of DLO dynamic manipulation tasks. In DynDLO, a DLO attached to the robot Tool Center Point (TCP) is simulated in the MuJoCo environment. By employing a B-Spline based trajectory generation function, the agent is capable of learning single and multiple step trajectories for the TCP, which succeed in various DLO dynamic manipulation problems. Specifically, we propose tailored design strategies for the reward function according to the classification of tasks into implicit or explicit DLO shape control tasks. Experiments on four representative tasks demonstrate that DynDLO is capable of generating dynamic manipulation policies that transfer successfully from simulation to the real world, achieving high success rates without requiring real-world training.

Abstract:
Autonomous mobile robots must know each other's positions to coordinate their actions and motion. Beyond collision avoidance, relative position estimation is essential for spatial coordination tasks such as collective motion, leaderfollower dynamics, or formation control.To overcome the scalability and resilience issues of centralized orchestrators that transmit real-time positional information to every robot, we study mechanisms of onboard vision sensing. Conventional localization methods, such as SLAM, are typically too computationally demanding for real-time use on small, resource-constrained mobile robots. Vision-based neural networks offer a promising alternative but often require large, high-quality datasets that are expensive to collect. We present AutoPercep, a~pipeline that automatically generates training data and trains a lightweight neural network to estimate neighbor positions. Robots capture camera images that are automatically labeled using ground-truth data from a motion-capture system. In our experiments, AutoPercep collected over 10,000 high-quality images within 10 minutes and trained a neural network in about 1 hour, which could be deployed on Raspberry Pi 4Bbased robots for onboard neighbour detection. Moreover, we show that a network trained on five robots generalizes to seven-robot deployments. We finally evaluate the trained model in a sequential leader-follower case study. Our end-to-end pipeline demonstrates the feasibility and low cost of onboard, vision-based neighbor perception, supporting scalability to large robot swarms and opening opportunities for deployment beyond laboratory settings. The code for training and evaluation is available at https://github.com/preon7/autopercep

Abstract:
Deep learning methods for Visual Place Recognition (VPR) have advanced significantly, largely driven by large-scale datasets. However, most existing approaches are trained on a single dataset, which can introduce dataset-specific inductive biases and limit model generalization. While multi-dataset joint training offers a promising solution for developing universal VPR models, divergences among training datasets can saturate the limited information capacity in feature aggregation layers, leading to suboptimal performance. To address these challenges, we propose Query-based Adaptive Aggregation (QAA), a novel feature aggregation technique that leverages learned queries as reference codebooks to effectively enhance information capacity without significant computational or parameter complexity. We show that computing the Cross-query Similarity (CS) between query-level image features and reference codebooks provides a simple yet effective way to generate robust descriptors. Our results demonstrate that QAA outperforms state-of-the-art models, achieving balanced generalization across diverse datasets while maintaining peak performance comparable to dataset-specific models. Ablation studies further explore QAA's mechanisms and scalability. Visualizations reveal that the learned queries exhibit diverse attention patterns across datasets. Project page: http://xjh19971.github.io/QAA.

Abstract:
Screw-based propulsion systems offer promising capabilities for amphibious mobility, yet face significant challenges in optimizing locomotion across water, granular materials, and transitional environments. This study presents a systematic investigation into the locomotion performance of various screw configurations in media such as dry sand, wet sand, saturated sand, and water. Through a principles-first approach to analyze screw performance, it was found that certain parameters are dominant in their impact on performance. Depending on the media, derived parameters inspired from optimizing heat sink design help categorize performance within the dominant design parameters. Our results provide specific insights into screw shell design and adaptive locomotion strategies to enhance the performance of screw-based propulsion systems for versatile amphibious applications.

Abstract:
Localization is a key task in robot navigation, and many techniques exist for it. In many plausible scenarios, a robot might face unforeseen, dynamic obstacles, rendering any pre-determined map inaccurate for localization. In this work, we propose a robust lifelong localization framework in dynamic planar indoor environments, using the robot's odometry and sparse distance sampling. We demonstrate how distance samples can be used to provide a robust prior on the robot's location. This technique can solve the kidnapped robot problem in real time, up to symmetries. Based on insights from real-world recorded data, we also account for dynamic obstacles. We then fuse this prior, over time, with the odometry to converge to the robot's location. A central property of our method is that it provably converges to the robot's ground truth pose even in large indoor environments when the environment is static. We further show that this guarantee also holds in dynamic environments, as long as the nature of those changes has been correctly learned. We demonstrate the effectiveness of our approach in different real-world indoor environments. In particular, we achieve a localization comparable to SLAM with merely a few (sixteen) distance samples, as opposed to the full LiDAR range. Sufficing with only sparse distance sampling is advantageous in terms of sensor cost, privacy, storage space, and transmission bandwidth.

Abstract:
Level set methods underpin modern safety techniques such as control barrier functions (CBFs), while also serving as implicit surface representations for geometric shapes via distance fields. Inspired by these two paradigms, we propose a unified framework where the implicit surface itself acts as a CBF. We leverage Gaussian process (GP) implicit surface (GPIS) to represent the safety boundaries, using safety samples derived from sensor measurements to condition the GP. The GP posterior mean defines the implicit safety surface (safety belief), while the posterior variance provides a robust safety margin. Although GPs have favorable properties such as uncertainty estimation and analytical tractability, they scale cubically with data. To alleviate this issue, we develop a sparse solution called sparse Gaussian CBFs. To the best of our knowledge, GPIS has not been explicitly used to synthesize CBFs. We validate the approach on collision avoidance tasks in two settings: a simulated 7-DOF manipulator operating around the Stanford bunny, and a quadrotor navigating in 3D around a physical chair. In both cases, Gaussian CBFs (with and without sparsity) enable safe interaction and collision-free execution of trajectories that would otherwise intersect the objects.

Abstract:
Simulation-based training offers an efficient paradigm for robotic skill learning, providing scalable data generation while reducing reliance on costly hardware trials and manual data collection. However, existing methods that rely on handcrafted scenarios fail to fully cover the complexity of open-world variations and neglect the critical insights offered by inevitable failures in unseen environments. As a result, current policies struggle to achieve robust generalization, hindering deployment in open-world settings. This highlights the need for a continuous learning framework that enables robots to reflect on failures and iteratively refine policies in a targeted way. In this paper, we propose CorrectManip, a novel data-driven closed-loop framework that enables the policy to continuously improve performance in unseen environments by learning from failures. Existing methods remain confined to single-loop adaptation, addressing policy errors in static environments or indiscriminately scaling data without targeting failure modes, CorrectManip closes the loop both at the policy recovery and environment generation: EvoGen, a self-evolving generator, and TTO, a test-time optimization module. EvoGen adaptively generates training data to strengthen policy performance, while TTO analyzes execution failures to provide fine-grained optimization signals. Together, TTO exposes policy weakness and EvoGen converts them into task-relevant training data, forming a closed feedback loop that drives continual policy improvement and stronger generalization. Extensive experiments across diverse tasks demonstrate that CorrectManip improves the average success rate in unseen environments by 45.22% over baseline methods. These results validate the complementary roles of TTO and EvoGen in enhancing generalization. Furthermore, we showcase sim-to-real transfer ability on Unitree H1 and Unitree G1. Demos are available here.

Abstract:
Compliant mechanisms, e.g., tensegrities, inherently exhibit nonlinear behavior, wherein the stiffness matrix, evaluated at a specific configuration, characterizes the instantaneous relationship between applied forces and resulting displacements. For traditional robot joints, the stiffness matrix is defined using Cartesian and Euler angle parameters. This representation is convenient when the joints display translation or single degree of rotation behavior. However, it faces parameterization issues in modeling higher degree of freedom joints due to singularities and lack of uniqueness. Lie groups and screw theory representations provide a minimal and intrinsic representation of the rigid body motion. This representation is well suited for tensegrity joints which combine tensile and compressive members and behave as six degree-of-freedom joints. A key challenge in this context is that computing the stiffness matrix necessitates differentiating the transformation matrix with respect to the screw, a task that is highly nontrivial. This work derives an analytical formulation of the stiffness matrix for six degree-of-freedom tensegrity joints using screw theory representation, including a closed-form expression for the derivative of the transformation matrix with respect to its exponential coordinates. The analytical results are validated against numerical differentiation while achieving approximately three times faster computation speeds. The paper further interprets the stiffness matrix through block-form, column-wise, and row-wise representations, providing additional physical insight into the translational, rotational, and coupled stiffness contributions. These contributions establish an efficient framework for the stiffness analysis and lay the foundation for future integration of screw theory methods into Euler-Lagrange dynamics for higher degree-of-freedom robot joints including tensegrity joints.

Abstract:
Monocular visual odometry (VO) often suffers from scale ambiguity and interference from moving objects in real-world scenarios. Jointly learning optical flow and depth estimation provides a promising solution for these issues by leveraging their geometric correlation and task complementarity. In this paper, we propose JFD-VO, a novel monocular VO framework that integrates jointly learned optical flow and depth networks. We design a two-stage training process with recursive noise diffusion and a specialized loss function, which enables the model to predict dense and scale-aware depth and optical flow using only readily available sparse LiDAR data and pose ground truth, thereby eliminating the need for expensive and difficult-to-obtain dense annotations. Furthermore, a dedicated masking module is incorporated during joint training to enhance robustness in dynamic environments. Within the VO pipeline, we introduce a Keypoint-weighted Matching Selection module that prioritizes stable features based on forward-backward flow consistency, rather than treating all pixels equally as in conventional optical flow methods. Extensive experiments on public datasets demonstrate the effectiveness of our joint training approach. JFD-VO achieves state-of-the-art accuracy, reducing absolute trajectory error by 14.99% and 27.37% over KPDepth-VO and DF-VO.Code and our self-collected dataset are available at: https://github.com/huqingyuan-9952/JFD-VO.

Abstract:
Autonomous navigation in partially observable environments requires agents to reason beyond immediate sensor input, exploit occlusion, and ensure safety while progressing toward a goal. These challenges arise in many robotics domains, from urban driving and warehouse automation to defense and surveillance. Classical path planning approaches and memory-less reinforcement learning often fail under limited fields-of-view (FoVs) and occlusions, committing to unsafe or inefficient maneuvers. We propose a hierarchical navigation framework that integrates a Deep Transformer Q-Network (DTQN) as a high-level subgoal selector with a modular low-level controller for waypoint execution. The DTQN consumes short histories of task-aware features, encoding odometry, goal direction, obstacle proximity, and visibility cues, and outputs Q-values to rank candidate subgoals. Visibility-aware candidate generation introduces masking and exposure penalties, rewarding the use of cover and anticipatory safety. A low-level potential field controller then tracks the selected subgoal, ensuring smooth short-horizon obstacle avoidance. We validate our approach in 2D simulation and extend it directly to a 3D UnityROS environment by projecting point-cloud perception into the same feature schema, enabling transfer without architectural changes. Results show consistent improvements over classical planners and RL baselines in success rate, safety margins, and time-to-goal, with ablations confirming the value of temporal memory and visibility-aware candidate design. These findings highlight a generalizable framework for safe navigation under uncertainty, with broad relevance across robotic platforms.

Abstract:
Service robots must retrieve objects in dynamic, open-world settings where requests may reference attributes (the red mug), spatial context (the mug on the table), or past states (the mug that was here yesterday). Existing approaches capture only parts of this problem: scene graphs capture spatial relations but ignore temporal grounding, temporal reasoning methods model dynamics but do not support embodied interaction, and dynamic scene graphs handle both but remain closed-world with fixed vocabularies. We present STAR (SpatioTemporal Active Retrieval), a framework that unifies memory queries and embodied actions within a single decision loop. STAR leverages a non-parametric long-term memory and a working memory to support efficient recall, and uses a vision-language model to select either temporal or spatial actions at each step. We introduce STARBench, a benchmark of spatiotemporal object search tasks across simulated and real environments. Experiments on STARBench and on a Tiago robot show that STAR consistently outperforms scene-graph and memory-only baselines, demonstrating the benefits of treating search in time and search in space as a unified problem.

Abstract:
With the proliferation of autonomous vehicles (AVs) and their increasing interaction and communication with the riders, how to ground or locate the visual objects of interests (OoIs), such as the concerned pedestrians and other traffic participants, based on the human riders natural language and communication (e.g., vocal commands), is essential for increasing the efficiency, effectiveness, and reliability/safety of AVs in following the riders reasonable commands and preferences. There are several technical challenges to achieve visual grounding for such human-to-vehicle commanding (HVC) scenes, including (1) how to fuse heterogeneous sensor modalities i.e., visual object information, textual contexts, and situation awareness (say, obtained from the light detection and ranging); (2) how to discern the opaque commands in the human natural language; and (3) how to reason about the relative positions of the OoIs within the visual modality. To meet these challenges, we propose VIGOR, a VIsual Grounding approach based on heterogeneous mOdality learning and hierarchical Reasoning for HVC scenes. First, we design a heterogeneous modality learning approach in order to incorporate the visual, textual, and situational modalities, and learn their cross-modality representations to identify important information for visual grounding. Then, VIGOR performs hierarchical reasoning of objects and context levels, and differentiates the OoIs in the complex traffic environments that relate to the natural language commands. Finally, we conduct extensive experimental studies on a total of 12,037 HVC scenes, demonstrating VIGOR to achieve higher accuracy than the state-of-the-art approaches (by 14.81% on average) in terms of the Intersection over Union (IoU) in grounding the OoIs in the complex (including low-visibility) HVC scenes.

Abstract:
Peripheral nerve injuries represent a significant clinical challenge in reconstructive surgery, traumatology, and neurosurgery, often leading to permanent sensorimotor deficits and diminished life quality. Thus, achieving precise epineurial suturing without nerve fascicle damage and tension remains a long-term aspiration for nerve repair. Yet, current techniques, mostly using direct suturing by surgeons, showcase unavoidable tension and limited functional outcomes. To address them, this work proposes a dual arm nanorobotic system-based approach for highly automated, precise, repeatable nerve suturing. An optimized path planning algorithm is designed leveraging the epineurial thickness estimation in order to control needle insertion depth and suturing trajectory. Due to the natural advantages of nanorobotics and microscope, the developed system can suture nerve with micron-scale diameter within confined space. Ex-vivo experiments on three types of rabbit sciatic nerves demonstrated the effectiveness and motion accuracy of 48 microns and 39 microns for two arms. In-vivo experiments with anatomic and functional analyses further validated the functional recovery, showing the potential for clinical translation.

Abstract:
In autonomous driving, end-to-end(E2E) driving systems that predict control commands directly from sensor data achieved significant advancements. For safe autonomous driving in unexpected scenarios, one may additionally rely on human interventions such as natural language instructions.Using a multi-modal large language model(MLLM) in autonomous driving facilitates humanvehicle interactions, and may improve driving performances in unexpected scenarios. However, this approach requires substantial computational resources due to its reliance on an LLM and many visual tokens from sensor inputs, that are inherently limited in autonomous vehicles. Many MLLM studies have explored reducing the number of visual tokens, and many approaches tend to exhibit some end-task performance degradation compared to using all tokens. For efficient E2E driving while maintaining driving performance comparable to using all tokens, this paper proposes the first Supervised Token Reduction framework for Multi-modal LLMs(SToRM). The proposed SToRM framework consists of three key elements. First, we propose a lightweight importance predictor with short-term sliding windows that pre-dicts the importance scores of visual tokens. Second, we propose a supervised learning approach for the importance predictor, that uses an auxiliary path to obtain pseudo-supervision signals from an all-token pass through the LLM. Third, guided by predicted importance scores, we propose an anchorcontext merging module that partitions tokens into anchors and context tokens, then merges the latter into their most relevant anchors to reduce redundancy while minimizing information loss. Experiments with the LangAuto benchmark dataset show that the proposed SToRM outperforms state-of-the-art E2E driving MLLM under an equal reduced-token budget and maintains all-token performance while substantially reducing computational cost, by up to 30×.

Abstract:
Vision-based 3D semantic occupancy prediction is vital for autonomous driving, enabling unified modeling of static infrastructure and dynamic agents. Global occupancy maps serve as long-term memory priors, providing valuable historical context that enhances local perception. This is particularly important in challenging scenarios such as occlusion or poor illumination, where current and nearby observations may be unreliable or incomplete. Priors aggregated from previous traversals under better conditions help fill gaps and enhance the robustness of local 3D occupancy prediction. In this paper, we propose Long-term Memory Prior Occupancy (LMPOcc), a plug-and-play framework that incorporates global occupancy priors to boost local prediction and simultaneously updates global maps with new observations. To realize the information gain from global priors, we design an efficient and lightweight Current-Prior Fusion module that adaptively integrates prior and current features. Meanwhile, we introduce a model-agnostic prior format to enable continual updating of global occupancy and ensure compatibility across diverse prediction baselines. LMPOcc achieves state-of-the-art local occupancy prediction performance validated on the Occ3D-nuScenes benchmark, especially on static semantic categories. Furthermore, we verify LMPOccs capability to build largescale global occupancy maps through multi-vehicle crowdsourcing, and utilize occupancy-derived dense depth to support the construction of 3D open-vocabulary maps. Our method opens up a new paradigm for continuous global information updating and storage, paving the way towards more comprehensive and scalable scene understanding in large outdoor environments.

Abstract:
Classical graph of convex sets (GCS) formulations rely on pairwise edge costs, which are insufficient to capture higher-order geometric interactions relevant to trajectory refinement. This paper proposes a hyper graph of convex sets (HGCS), which extends GCS by introducing hyperedges over multiple vertices. Using a 3-uniform construction, a second-order smoothness cost is incorporated to favor path sequences that are more suitable for dynamically feasible trajectory generation. To preserve tractability, the HGCS is converted into an equivalent classical GCS, so the resulting shortest-path problem can still be solved with existing GCS methods. The discrete path is then refined by trajectory optimization within the corresponding safe corridor. Numerical simulations and quadrotor experiments show that the proposed method provides better initialization for downstream optimization, achieves shorter trajectory duration than hierarchical GCS baselines, and is faster than joint spatio-temporal optimization.

Abstract:
We address the challenge of enabling bipedal robots to traverse rough terrain by developing probabilistically safe planning and control strategies that ensure dynamic feasibility and centroidal robustness under terrain uncertainty. Specifically, we propose a high-level Model Predictive Control (MPC) navigation framework for a bipedal robot with a specified confidence level of safety that (i) enables safe traversal toward a desired goal location across a terrain map with uncertain elevations, and (ii) formally incorporates uncertainty bounds into the centroidal dynamics of locomotion control. To model the rough terrain, we employ Gaussian Process (GP) regression to estimate elevation maps and leverage Conformal Prediction (CP) to construct calibrated confidence intervals that capture the true terrain elevation. Building on this, we formulate contraction-based reachable tubes that explicitly account for terrain uncertainty, ensuring state convergence and tube invariance. In addition, we introduce a contraction-based flywheel torque control law for the reduced-order Linear Inverted Pendulum Model (LIPM), which stabilizes the angular momentum about the center-of-mass (CoM). This formulation provides both probabilistic safety and goal reachability guarantees. For a given confidence level, we establish the forward invariance of the proposed torque control law by demonstrating exponential stabilization of the actual CoM phase-space trajectory and the desired trajectory prescribed by the high-level planner. Finally, we evaluate the effectiveness of our planning framework through physics-based simulations of the Digit bipedal robot in MuJoCo.

Abstract:
Spoken language instructions are ubiquitous in agent collaboration. However, in real-world human-robot collaboration, following human spoken instructions can be challenging due to various speaker and environmental factors, such as background noise or mispronunciation. When faced with noisy auditory inputs, humans can leverage the collaborative context in the embodied environment to interpret noisy spoken instructions and take pragmatic assistive actions. In this paper, we present a cognitively inspired neurosymbolic model, Spoken Instruction Following through Theory of Mind (SIFToM), which leverages a Vision-Language Model with model-based mental inference to enable robots to pragmatically follow human instructions under diverse speech conditions. We test SIFToM in both simulated environments (VirtualHome) and real-world human-robot collaborative settings. Results show that SIFToM can significantly improve the performance of a lightweight base VLM (Gemini 2.5 Flash), outperforming state-of-the-art VLMs (Gemini 2.5 Pro) and approaching human-level accuracy on challenging spoken instruction following tasks.

Abstract:
Safe and computationally efficient local planning for mobile robots in dense, unstructured human crowds remains a fundamental challenge. Moreover, ensuring that robot trajectories are similar to how a human moves will increase the acceptance of the robot in human environments. In this paper, we present Crowd-FM, a learning-based approach to address both safety and human-likeness challenges. Our approach has two novel components. First, we train a Conditional Flow-Matching (CFM) policy over a dataset of optimally controlled trajectories to learn a set of collision-free primitives that a robot can choose at any given scenario. The chosen optimal control solver can generate multi-modal collision-free trajectories, allowing the CFM policy to learn a diverse set of maneuvers. Secondly, we learn a score function over a dataset of human demonstration trajectories that provides a human-likeness score for the flow primitives. At inference time, computing the optimal trajectory requires selecting the one with the highest score. Our approach improves the state-of-the-art by showing that our CFM policy alone can produce collision-free navigation with a higher success rate than existing learning-based baselines. Furthermore, when augmented with inference-time refinement, our approach can outperform even expensive optimisation-based planning approaches. Finally, we validate that our scoring network can select trajectories closer to the expert data than a manually designed cost function.

Abstract:
Plane segmentation algorithms are widely used in robotics, serving key roles in scenarios such as indoor localization, scene understanding, and robotic manipulation. These applications typically require real-time, precise, and robust plane segmentation processing, which presents a significant challenge. Existing methods based on pixel-wise or fix-sized patch-wise operation is redundant, as planar regions in real-world scenes are of diverse sizes. In this paper, we introduce a highly efficient method for plane segmentation, namely Adaptive Patch-wise Region Growing (APRG). APRG begins with data sampling to construct a data pyramid. To avoid redundant planer fitting in large planar regions, we introduce an adaptive patch-wise plane fitting algorithm with the pyramid accessed in a top-down manner. The largest possible planar patches are obtained in this process. Subsequently we introduce a region growing algorithm specially designed for our patch representation. Overall, APRG achieves more than 600 FPS at a 640x480 resolution on a mid-range CPU without using parallel acceleration techniques, which outperforms the state-of-the-art method by a factor of 1.46. Besides, in addition to its speedup in run-time, APRG significantly improves the segmentation quality, especially on real-world data.

Abstract:
LiDAR sensors are used to provide three-dimensional information about the environment in many robotics applications. The information, accumulated in 3D point clouds, is first acquired by the sensor and then processed further, which leads to high end-to-end latencies and large memory footprints. Streaming approaches tackle this problem by processing partial point cloud data during scanning of the environment. In contrast to existing work that is limited to power hungry, rotating mechanical scanners, in this paper, we present a streaming method for more efficient scanline-based LiDAR sensors. We process the sequence of scanlines in form of SpikeClouds with a Spiking Neural Network (SNN) backbone and perform 3D object detection from the accumulated information using a Convolutional Neural Network (CNN) head. Our method acieves close to state-of-the-art detection performance on datasets KITTI and JRDB22 while reducing the end-to-end latency by 10% and the average memory footprint by 95% on standard GPU hardware. Additionally, when ported onto neuromorphic hardware, our backbone requires 25× less energy compared to reference backbones. SpikeClouds achieves fast and efficient environmental perception for robotic applications by streaming LiDAR to enable spike-based processing.

Abstract:
This paper addresses the challenge of Lidar-Inertial Odometry (LIO) in dynamic environments, where conventional methods often fail due to their static-world assumptions. Traditional LIO algorithms perform poorly when dynamic objects dominate the scenes, particularly in geometrically sparse environments. Current approaches to dynamic LIO face a fundamental challenge: accurate localization requires a reliable identification of static features, yet distinguishing dynamic objects necessitates precise pose estimation. Our solution breaks this circular dependency by integrating dynamic awareness directly into the point cloud registration process. We introduce a novel dynamic-aware iterative closest point algorithm that leverages spatio-temporal normal analysis, complemented by an efficient spatial consistency verification method to enhance static map construction. Experimental evaluations demonstrate significant performance improvements over state-of-the-art LIO systems in challenging dynamic environments with limited geometric structure. The code and dataset are available at urlhttps://github.com/thisparticle/btsa.

Abstract:
This work was developed under the need for an acoustic localization system to monitor marine protected areas (MPAs) with the help of autonomous underwater vehicles (AUVs). Although the use of acoustic signals for underwater localization has been previously studied, most of the solutions rely on filter-based optimization, which is prone to linearization problems in long-term applications. Instead, we implemented a Modular Acoustic Graph Simultaneous Localization and Mapping (SLAM) algorithm that, using a factor graph framework, tracks acoustic beacons with either ranges or bearings. In addition, we developed several novel methods, like a delayed-position update for ultra-short baseline (USBL) position factor integration process, an initialization algorithm for acoustic landmarks, and the creation of a new 3D bearing factor that combines two angles. After developing the algorithm, field experiments were carried out in different areas on the coast of Catalonia. Besides the localization, some monitoring tasks were also tested, such as visual mapping of localized landmarks or optical transmission of data with seafloor stations, which helped validate the accuracy of the acoustic localization system. The results of such experiments are presented and discussed.

Abstract:
Model-based manipulation of deformable objects has traditionally dealt with objects while neglecting their dynamics, thus mostly focusing on very lightweight objects at steady state. At the same time, soft robotic research has made considerable strides toward general modeling and control, despite soft robots and deformable objects being very similar from a mechanical standpoint. In this work, we leverage these recent results to develop a control-oriented, fully dynamic framework of slender deformable objects grasped at one end by a robotic manipulator. We introduce a dynamic model of this system using functional strain parameterizations and describe the manipulation challenge as a regulation control problem. This enables us to define a fully model-based control architecture, for which we can prove analytically closed-loop stability and provide sufficient conditions for steady state convergence to the desired state. The nature of this work is intended to be markedly experimental. We provide an extensive experimental validation of the proposed ideas, tasking a robot arm with controlling the distal end of six different cables, in a given planar position and orientation in space.

Abstract:
Precisely grasping an object is a challenging task due to pose uncertainties. Conventional methods have used cameras and fixtures to reduce object uncertainty. They are effective but require intensive preparation, such as designing jigs based on the object geometry and calibrating cameras with high-precision tools fabricated using lasers. In this study, we propose a method to reduce the uncertainty of the position and orientation of a grasped object without using a fixture or a camera. Our method is based on the concept that the flat finger pads of a parallel gripper can reduce uncertainty along its opening/closing direction through flat surface contact. Three orthogonal grasps by parallel grippers with flat finger pads collectively constrain an object's position and orientation to a unique state. Guided by the concepts, we develop a regrasp planning and admittance control approach that sequentially finds and leverages three orthogonal grasps of two robotic arms to actively reduce uncertainties in the object pose. We evaluated the proposed method on different initial object uncertainties and verified that it had good repeatability. The deviation levels of the experimental trials were on the same order of magnitude as those of an optical tracking system, demonstrating strong relative inference performance.

Abstract:
Robotic grippers aim to replicate the remarkable functionalities of the human hand by providing advanced perception, adaptability, stability, and dexterity for complex tasks. Achieving these capabilities demands a sophisticated design hierarchy and robust perception mechanisms that ensure accurate manipulation. This paper introduces Active Soft Polyhedral Networks (ActiveSPN), a gripper design that leverages an active, non-biomimetic surface for precise in-hand manipulation. A vision system integrated directly into the fingers further facilitates accurate pose estimation of the in-finger object. The proposed system includes: (i) a soft polyhedral network featuring a transparent active belt to deliver complete three-dimensional adaptation and dexterous in-finger motion, and (ii) a generative learning-based pipeline for in-finger pose estimation. Experimental results demonstrate the ability of ActiveSPN to execute multi-degree-of-freedom in-finger manipulations, including two-axis rotation and one-axis translation. Moreover, the integrated vision-based pose estimation provides robust, real-time predictions, supporting consistent closed-loop control. Across diverse objects, the system achieves mean translational errors of 2.59 mm and rotational errors of 7 degrees, highlighting a promising paradigm for compact, efficient, and dexterous robotic manipulation. Codes are available at https://github.com/ancorasir/ActiveSPN.

Abstract:
We present a method that enhances the safety and responsiveness of robotic manipulators through constrained Variable Admittance Control (VAC) combined with proximity perception. Recent studies have demonstrated that manipulators equipped with proximity sensors can avoid close obstacles in real-time. However, unavoidable collisions still remain a significant challenge in human-robot interaction (HRI). As a safety fallback, conventional reactive motion algorithms aim to avoid obstacles but often suffer from inefficient avoidance and don't consider collision. Our approach integrates proximity-based pre-contact detection and VAC with QP-based motion constraints to proactively adjust the robots impedance parameters while maintaining stable and controlled motion. By dynamically modulating stiffness and damping based on sensor feedback, the system improves both obstacle avoidance efficiency and smooth contact handling. Additionally, a passivity-preserving energy tank mechanism prevents instability caused by parameter variations, ensuring robust and adaptive behavior. Furthermore, experiments involving HRI demonstrate that the proposed method ensures both safe avoidance and smooth contact handling. These findings suggest that the proposed approach is well-suited for safety-critical applications in collaborative and industrial robotics.

Abstract:
This study introduces a novel kinematic synergy-based exoskeleton designed for gait rehabilitation studies in rats. The exoskeleton assists all three hindlimb joints of the rat (hip, knee and ankle) while ensuring proper interjoint coordination and the natural quadrupedal posture. This assistance is realized through a 2-DOF bar mechanism that emulates the biomechanics of rats. Engineered to be compact, lightweight, backdrivable, and sufficiently powerful, the proposed system minimizes physical stress on the animal while allowing a wide range of assistive forces to be applied. These features are achieved through a combination of a cable power transmission system and direct-drive motors positioned outside the exoskeletal structure. The desktop experiments demonstrated that the exoskeleton could precisely replicate the rats kinematic gait patterns and remain backdrivable whether powered or unpowered. The feasibility of gait assistance was further confirmed in an anesthetized rat, where synergistic gait patterns were observed between the joints. Hence, the system holds the potential to enable controlled comparative neurorehabilitation studies in rats. These studies can help unveil neural recovery mechanisms and design optimal exoskeleton control strategies for rehabilitation in humans.

Abstract:
Rovers have been a mainstay of planetary exploration missions, significantly expanding our knowledge in planetary science. However, past rover missions have involved significant human supervision to oversee rover operations, a state-of-practice that scales poorly for the next generation of missions. In this work, we present the development of Constrained Roving Exploration via Safe Collision-free and Environment-aware Trajectory optimization (CRESCENT), a motion planning algorithm developed for the upcoming multiagent Cooperative Autonomous Distributed Robotic Exploration (CADRE) Lunar rover mission. CRESCENT was designed to safely drive a miniature rover platform in a highly cluttered unmapped Lunar environment, executing complex motion directives from CADREs team-level autonomy while meeting far stricter dynamical and temporal constraints than existing onboard planetary rover planning algorithms are capable of satisfying. Our hierarchical approach formulates an efficient numerical trajectory optimization-based motion planning algorithm that makes use of nonlinear optimization to solve the planning problem in real time. We demonstrate the efficiency of our proposed approach through extensive simulations and hardware testing in a representative Lunar environment. Following CADREs upcoming deployment on the Lunar surface, CRESCENT will be the first nonlinear optimization-based trajectory optimization approach used on another celestial body.

Abstract:
Handling fragile objects remains a major challenge for robotic manipulation. Tactile sensing and soft robotics can improve delicate object handling, but typically involve high integration complexity or slow response times. We address these issues through FORTE, an easy-to-fabricate tactile sensing system. FORTE uses 3D-printed fin-ray grippers with internal air channels to provide low-latency force and slip feedback. This feedback allows us to apply just enough force to grasp objects without damaging them. We can accurately estimate grasping forces from 08 N with an average error of 0.2 N, and detect slip events within 100 ms of occurring. FORTE can grasp a wide range of slippery, fragile, and deformable objects, including raspberries and potato chips with 92% success and achieves 93% accuracy in detecting slip events. These results highlight FORTEs potential as a robust solution for delicate robotic manipulation. Project page: https://merge-lab.github.io/FORTE/

Abstract:
Soft robot perception integrates information from distributed, multi-modal sensors, broadening their application to active interaction. Our work introduces recurrent learning models for tactile-based object recognition, demonstrating comparable performance in virtual and real-world scenarios. The work focuses on soft grippers, which facilitate adaptation to objects of varying shapes and sizes thanks to passive finger compliance. Our model successfully identifies over sixteen heterogeneous objects. Findings underscore the significance of sensory multi-modality over single. We highlight how spatial distribution and sensory signal dynamics influence overall estimation accuracy, and which is the minimal grasp set to achieve certain recognition.

Abstract:
Addressing the global caregiver shortage through socially assistive robots necessitates a deep understanding of their psychological and physiological impacts on older adults. This study addresses whether social robots can serve as effective interaction partners compared to humans, and if "positive prompts" can similarly enhance these interactions. We conducted a comparative study with 35 participants (aged 70+) to evaluate responses during both human-human and human-robot encounters, including an assessment of "positive prompts" for cognitive reappraisal. Our multi-modal analysis, integrating facial expression data, heart rate variability, and subjective questionnaires, revealed no significant differences in overall stress levels between human and robot interactions. Facial expression analysis confirmed that the robot was accepted as a valid interaction partner, while physiological data showed slightly lower heart rates during robot interactions, suggesting a more relaxed state compared to human-led sessions. These findings indicate that social robots can engage older adults without inducing psychological strain and are capable of alleviating caregiver burden by performing structured tasks, such as health-sensing surveys. Future work should address the identified "appearance-content mismatch" in robot design to facilitate even more natural and effective interactions.

Abstract:
In modern industrial production, multiple robots often collaborate to complete complex manufacturing tasks. Large language models (LLMs), with their strong reasoning capabilities, have shown potential in coordinating robots for simple household and manipulation tasks. However, in industrial scenarios, stricter sequential constraints and more complex dependencies within tasks present new challenges for LLMs. To address this, we propose IMR-LLM, a novel LLM-driven Industrial Multi-Robot task planning and program generation framework. Specifically, we utilize LLMs to assist in constructing disjunctive graphs and employ deterministic solving methods to obtain a feasible and efficient high-level task plan. Based on this, we use a process tree to guide LLMs to generate executable low-level programs. Additionally, we create IMR-Bench, a challenging benchmark that encompasses multi-robot industrial tasks across three levels of complexity. Experimental results indicate that our method significantly surpasses existing methods across all evaluation metrics.

Abstract:
Tactile sensors are increasingly integrated into dexterous robotic manipulators to enhance contact perception. However, learning manipulation policies that rely on tactile sensing remains challenging, primarily due to the trade-off between fidelity and computational cost of soft-body simulations. To address this, we present ETac, a tactile simulation framework that models elastomeric soft-body interactions with both high fidelity and efficiency. ETac employs a lightweight data-driven deformation propagation model to capture soft-body contact dynamics, achieving high simulation quality and boosting efficiency that enables large-scale policy training. When serving as the simulation backend, ETac produces surface deformation estimates comparable to FEM and demonstrates applicability for modeling real tactile sensors. Then, we showcase its capability in training a blind grasping policy that leverages large-area tactile feedback to manipulate diverse objects. Running on a single RTX 4090 GPU, ETac supports reinforcement learning across 4,096 parallel environments, achieving a total throughput of 869 FPS. The resulting policy reaches an average success rate of 84.45% across four object types, underscoring ETac's potential to make tactile-based skill learning both efficient and scalable.

Abstract:
Magnetically actuated micro/nanorobot swarms have exhibited considerable promise for targeted biomedical delivery and localized therapies, attributed to their advantages of remote manipulation and robust penetration through biological tissues. However, achieving the simultaneous enhancement of both collective structural stability and efficient propulsion under a single-mode magnetic field remains a critical challenge. This paper presents a rotationalgradient superimposed magnetic actuation strategy that enables precise superposition of a uniform rotating field and a directional gradient magnetic field, using a tri-axial electromagnetic coil system. This approach significantly enhances the motility of micro/nanorobots while preserving their collective stability. Experimental results reveal that a co-directional gradient magnetic field can increase cluster velocity by 1.5-2 times without compromising cluster stability, while a counter-directional gradient magnetic field enables effective deceleration or anchoring of the swarm. Also, this paper elucidates the impact of the gradient magnetic field on swarm stability. Moreover, this paper demonstrates the formation of the chain-like structure of micro/nanorobots, which possess axial movement capability under the superimposed gradient magnetic field. This work provides a systematic theoretical and experimental foundation for multi-field synergistic actuation of micro/nanorobot swarms, and paves new paths for their application in biomedical micromanipulation.

Abstract:
Multi-UAV pursuit-evasion, where pursuers aim to capture evaders, poses a key challenge for UAV swarm intelligence. Multi-agent reinforcement learning (MARL) has demonstrated potential in modeling cooperative behaviors, but most RL-based approaches remain constrained to simplified simulations with limited dynamics or fixed scenarios. Previous attempts to deploy RL policy to real-world pursuit-evasion are largely restricted to two-dimensional scenarios, such as ground vehicles or UAVs at fixed altitudes. In this paper, we propose a novel MARL-based algorithm that learns online planning for multi-UAV pursuit-evasion in unknown environments (OPEN). OPEN introduces an evader prediction-enhanced network to tackle partial observability in cooperative policy learning. Additionally, OPEN proposes an adaptive environment generator within MARL training, enabling higher exploration efficiency and better policy generalization across diverse scenarios. Simulations show our method significantly outperforms all baselines in challenging scenarios, generalizing to unseen scenarios with a 100% capture rate. Finally, after integrating calibrated dynamics models of UAVs into training, we derive a feasible policy via a two-stage reward refinement and deploy the policy on real quadrotors in a zero-shot manner. To our knowledge, this is the first work to derive and deploy an RL-based policy using collective thrust and body rates control commands for multi-UAV pursuit-evasion in unknown environments. The open-source code and videos are available at https://sites.google.com/view/pursuit-evasion-rl.

Abstract:
This paper introduces a lightweight, three-degrees- of-freedom exoskeleton for wrist rehabilitation powered by Twisted String Actuators (TSAs), specifically designed to support flexion/extension, radial/ulnar deviation, and pronation/supination movements. Leveraging the high power-to-weight ratio of TSA actuation system, the exoskeleton ensures effective, comfortable, and personalized rehabilitation exercises. The device comprises five TSAs arranged in a tendon-driven configuration, enabling precise control and adaptability to various user anatomies. The experimental evaluations was conducted on a prototype demonstrating the devices ability to accurately replicate wrist movements guided by a physiotherapist, achieving low tracking errors (RMSE 1°). The exoskeleton effectively achieves the desired wrist range of motion115° for flexion/extension, 70° for radial/ulnar deviation, and 150° for pronation/supinationwith torque capabilities suitable for rehabilitation purposes (0.35 Nm for flexion/extension and radial/ulnar deviation, and 0.06 Nm for pronation/supination). These preliminary results validate the exoskeleton as a promising solution, offering improved comfort, flexibility, and effectiveness compared to traditional rehabilitation devices.

Abstract:
This letter proposes a control framework to enhance the robustness of a locomotion policy against uncertainties by integrating it with a deep disturbance observer (DOB) network and a deep state estimator network. The deep DOB approximates the inverse model of a quadrupedal robot. The locomotion policy is trained to produce optimal actions, with the deep DOB estimating the overall uncertainties of the robot, and the deep state estimator estimates the body's linear velocities. All networks are trained under nominal conditions in Isaac Gym. Subsequently, all the trained networks are transferred to Gazebo and a real robot with ROS2 are used to validate their robustness under uncertain conditions without additional tuning. Furthermore, validation results show that the proposed control framework performs best in velocity tracking compared to the baseline method in terms of lowest estimation errors. This emphasizes the effectiveness of the proposed control framework in improving robustness of the locomotion policy. Videos on Isaac Gym and Gazebo simulation, and real robot experiment are available at Project page: bit.ly/3CF3OTQ.

Abstract:
Effective movement primitives should be capable of encoding and generating a rich repertoire of trajectories -- typically collected from human demonstrations -- conditioned on task-defining parameters such as vision or language inputs. While recent methods based on the motion manifold hypothesis, which assumes that a set of trajectories lies on a lower-dimensional nonlinear subspace, address challenges such as limited dataset size and the high dimensionality of trajectory data, they often struggle to capture complex task-motion dependencies, i.e., when motion distributions shift drastically with task variations. To address this, we introduce Motion Manifold Flow Primitives (MMFP), a framework that decouples the training of the motion manifold from task-conditioned distributions. Specifically, we employ flow matching models, state-of-the-art conditional deep generative models, to learn task-conditioned distributions in the latent coordinate space of the learned motion manifold. Experiments are conducted on language-guided trajectory generation tasks, where many-to-many text-motion correspondences introduce complex task-motion dependencies, highlighting MMFP's superiority over existing methods.

Abstract:
Grasping unknown objects from a single view has remained a challenging topic in robotics due to the uncertainty of partial observation. Recent advances in large-scale models have led to benchmark solutions such as GraspNet-1Billion. However, such learning-based approaches still face a critical limitation in performance robustness for their sensitivity to sensing noise and environmental changes. To address this bottleneck in achieving highly generalized grasping, we abandon the traditional learning framework and introduce a new perspective: similarity matching, where similar known objects are utilized to guide the grasping of unknown target objects. We newly propose a method that robustly achieves unknown-object grasping from a single viewpoint through three key steps: 1) Leverage the visual features of the observed object to perform similarity matching with an existing database containing various object models, identifying potential candidates with high similarity; 2) Use the candidate models with pre-existing grasping knowledge to plan imitative grasps for the unknown target object; 3) Optimize the grasp quality through a local fine-tuning process. To address the uncertainty caused by partial and noisy observation, we propose a multi-level similarity matching framework that integrates semantic, geometric, and dimensional features for comprehensive evaluation. Especially, we introduce a novel point cloud geometric descriptor, the C-FPFH descriptor, which facilitates accurate similarity assessment between partial point clouds of observed objects and complete point clouds of database models. In addition, we incorporate the use of large language models, introduce the semi-oriented bounding box, and develop a novel point cloud registration approach based on plane detection to enhance matching accuracy under single-view conditions.

Abstract:
Swarm intelligence for uncrewed aerial vehicles (UAVs) significantly improves the success rate of executing intricate tasks using distributed platforms and aggregated effects. However, the high experimental costs and safety risks constrain its development. This paper introduces SIGMA (Swarm Intelligence Generic simulator for Multi-UAVs), a high-fidelity distributed UAV swarm simulator for swarm intelligence algorithms. As an agent-based modeling simulator (ABMS), SIGMA has three key innovations: First, an automatic model tuning method improves aircraft dynamics fidelity. Second, a bidirectional discrete-event simulation (BiDES) architecture resolves the time alignment challenges in distributed systems. Third, a multiagent learning toolbox ensures algorithm compatibility via an episodic training structure and a memory replay mechanism. In the verification part, the fidelity and scalability of the simulator are verified by quantitative simulations and experiments, and several successful applications demonstrate the practicality of the proposed simulator.

Abstract:
Recent studies demonstrate the potential of blockchain to enable robots in a swarm to achieve secure consensus about the environment, particularly when robots are homogeneous and perform identical tasks. Typically, robots receive rewards for their contributions to consensus achievement, but no studies have yet targeted heterogeneous swarms, in which the robots have distinct physical capabilities suited to different tasks. We present a novel framework that leverages domain knowledge to decompose the swarm mission into a hierarchy of tasks within smart contracts. This allows the robots to reach a consensus about both the environment and the action plan, allocating tasks among robots with diverse capabilities to improve their performance while maintaining security against faults and malicious behaviors. We refer to this concept as equitable and secure task allocation. Validated in Simultaneous Localization and Mapping missions, our approach not only achieves equitable task allocation among robots with varying capabilities, improving mapping accuracy and efficiency, but also shows resilience against malicious attacks.

Abstract:
Trajectory planning is a core task in autonomous driving. However, in diverse extreme scenarios characterized by unstructured obstacles, there is a lack of solutions that provide efficient computation, safety, and scene generalization capabilities. To address this issue, we propose a two-stage spatio-temporal joint trajectory planning method based on environmental assessment. In the first stage, we introduce the EAHybrid A algorithm, which generates high-quality initial trajectories by evaluating environmental complexity, thereby significantly improving computational efficiency. The second stage formulates the trajectory planning problem as an optimal control problem, utilizing environmental assessment for joint spatio-temporal optimization, ensuring kinematic feasibility and obstacle avoidance. Experiments demonstrate that our method achieves higher success rates and planning speeds in extreme scenarios compared to state-of-the-art planning methods. Moreover, we have deployed and validated this approach in the CARLA simulator and real vehicles, proving its effectiveness and robustness in handling extreme environments.

Abstract:
Efficient exploration of unknown environments is crucial for autonomous robots, especially in confined and large-scale scenarios with limited communication. To address this challenge, we propose a collaborative exploration framework for a marsupial ground-aerial robot team that leverages the complementary capabilities of both platforms. The framework employs a graph-based path planning algorithm to guide exploration and deploy the aerial robot in areas where its expected gain significantly exceeds that of the ground robot, such as large open spaces or regions inaccessible to the ground platform, thereby maximizing coverage and efficiency. To facilitate large-scale spatial information sharing, we introduce a bandwidth-efficient, task-driven map compression strategy. This method enables each robot to reconstruct resolution-specific volumetric maps while preserving exploration-critical details, even at high compression rates. By selectively compressing and sharing key data, communication overhead is minimized, ensuring effective map integration for collaborative path planning. Simulation and real-world experiments validate the proposed approach, demonstrating its effectiveness in improving exploration efficiency while significantly reducing data transmission.

Abstract:
Tactile and kinesthetic perceptions are crucial for human dexterous manipulation, enabling reliable grasping of objects via proprioceptive sensorimotor integration. For robotic hands, even though acquiring such tactile and kinesthetic feedback is feasible, establishing a direct mapping from this sensory feedback to motor actions remains challenging. In this article, we propose a novel glove-mediated tactilekinematic perceptionprediction framework for grasp skill transfer from human intuitive and natural operation to robotic execution based on imitation learning, and its effectiveness is validated through generalized grasping tasks, including those involving deformable objects. First, we integrate a data glove to capture tactile and kinesthetic data at the joint level. The glove is adaptable for both human and robotic hands, allowing data collection from natural human hand demonstrations across different scenarios. It ensures consistency in the raw data format, enabling evaluation of grasping for both human and robotic hands. Second, we establish a unified representation of multimodal inputs based on graph structures with polar coordinates. We explicitly integrate the morphological differences into the designed representation, enhancing the compatibility across different demonstrators and robotic hands. Furthermore, we introduce the tactilekinesthetic spatio-temporal graph networks, which leverage multidimensional subgraph convolutions and attention-based long short-term memory (LSTM) layers to extract spatio-temporal features from graph inputs to predict node-based states for each hand joint. These predictions are then mapped to final commands through a force-position hybrid mapping. Comparative experiments and ablation studies demonstrate that our approach surpasses other methods in grasp success rate, finger coordination, contact force management, and both grasp and computational efficiency, achieving results most akin to human grasping.

Abstract:
The integration of pre-trained visual representations (PVRs) has significantly advanced visuomotor policy learning. However, effectively leveraging these models remains a challenge. We identify temporal entanglement as a critical, inherent issue when using these time-invariant models in sequential decision-making tasks. This entanglement arises because PVRs, optimised for static image understanding, struggle to represent the temporal dependencies crucial for visuomotor control. In this work, we quantify the impact of temporal entanglement, demonstrating a strong correlation between a policy's success rate and the ability of its latent space to capture task-progression cues. Based on these insights, we propose a simple, yet effective disentanglement baseline designed to mitigate temporal entanglement. Our empirical results show that traditional methods aimed at enriching features with temporal components are insufficient on their own, highlighting the necessity of explicitly addressing temporal disentanglement for robust visuomotor policy learning.

Abstract:
The rapid growing number of product categories in large-scale e-commerce makes accurate object identification for automated packing in warehouses substantially more difficult. As the catalog grows, intra-class variability and a long tail of rare or visually similar items increase, andwhen combined with diverse packaging, cluttered containers, frequent occlusion, and large viewpoint changesthese factors amplify discrepancies between query and reference images, causing sharp performance drops for methods that rely solely on 2D appearance features. Thus, we propose RoboEye, a two-stage identification framework that dynamically augments 2D semantic features with domain-adapted 3D reasoning and lightweight adapters to bridge trainingdeployment gaps. In the first stage, we train a large vision model to extract 2D features for generating candidate rankings. A lightweight 3D-feature-awareness module then estimates 3D features quality and predicts whether 3D re-ranking is necessary, preventing performance degradation and avoiding unnecessary computation. When invoked, the second stage uses our robot 3D retrieval transformer, comprising a 3D feature extractor that produces geometry-aware dense features and a keypoint-based matcher that computes keypoint-correspondence confidences between query and reference images instead of conventional cosine-similarity scoring. Experiments show that RoboEye improves Recall@1 by 7.1% over the prior state of the art (RoboLLM). Moreover, RoboEye operates using only RGB images, avoiding reliance on explicit 3D inputs and reducing deployment costs. The code used in this paper is publicly available at: https://github.com/longkukuhi/RoboEye.

Abstract:
This paper presents InsSo3D, an accurate and efficient method for large-scale 3D Simultaneous Localisation and Mapping (SLAM) using a 3D Sonar and an Inertial Navigation System (INS). Unlike traditional sonar, which produces 2D images containing range and azimuth information but lacks elevation information, 3D Sonar produces a 3D point cloud, which therefore does not suffer from elevation ambiguity. We introduce a robust and modern SLAM framework adapted to the 3D Sonar data using INS as prior, detecting loop closure and performing pose graph optimisation. We evaluated InsSo3D performance inside a test tank with access to ground truth data and in an outdoor flooded quarry. Comparisons to reference trajectories and maps obtained from an underwater motion tracking system and visual Structure From Motion (SFM) demonstrate that InsSo3D efficiently corrects odometry drift. The average trajectory error is below 21cm during a 50-minute-long mission, producing a map of 10m by 20m with a 9cm average reconstruction error, enabling safe inspection of natural or artificial underwater structures even in murky water conditions.

Abstract:
Estimating 7-DoF grasping poses (6-DoF with gripper width) in cluttered scenes is a critical challenge for robotic manipulation. In such environments, object edges often contain many promising grasp candidates, but relying solely on incomplete single-view point cloud to infer them is difficult. While neural networks excel at learning edge features from RGB images, simply combining these with point clouds often fails to generalize to novel scenes. To address these challenges, we propose EdgeGrasp, which enhances edge perception by allowing each modality to contribute to the most suitable edge information source for improving grasping performance. The internal edge features are extracted through voxel-based sparse 3D convolution on the aggregated point cloud from the edge interior, ensuring a rich geometric representation while mitigating incompleteness at the edge. For external edge and junction, vision foundation model is employed to extract local zero-shot semantic features, capturing fine-grained details and improving cross-object generalization. Finally, edge spatial attention fuses these features into edge-enhanced features by encoding edge distance for estimating 7-DoF grasping poses. Experimental results demonstrate our method's effectiveness, achieving state-of-the-art performance on the Graspnet-1Billion benchmark. Real-world robotic experiments further validate its practical applicability.

Abstract:
In this paper, we demonstrate that mobile manipulation policies utilizing a 3D latent map achieve stronger spatial and temporal reasoning than policies relying solely on images. We introduce Seeing the Bigger Picture (SBP), an end-to-end policy learning approach that operates directly on a 3D map of latent features. In SBP, the map extends perception beyond the robot's current field of view and aggregates observations over long horizons. Our mapping approach incrementally fuses multiview observations into a grid of scene-specific latent features. A pre-trained, scene-agnostic decoder reconstructs target embeddings from these features and enables online optimization of the map features during task execution. A policy, trainable with behavior cloning or reinforcement learning, treats the latent map as a state variable and uses global context from the map obtained via a 3D feature aggregator. We evaluate SBP on scene-level mobile manipulation and sequential tabletop manipulation tasks. Our experiments demonstrate that SBP (i) reasons globally over the scene, (ii) leverages the map as long-horizon memory, and (iii) outperforms image-based policies in both in-distribution and novel scenes, e.g., improving the success rate by 15% for the sequential manipulation task.

Abstract:
Cooperative localization is essential for swarm applications like collaborative exploration and search-and-rescue missions. However, maintaining real-time capability, robustness, and computational efficiency on resource-constrained platforms presents significant challenges. To address these challenges, we propose D-GVIO, a buffer-driven and fully decentralized GNSS-Visual-Inertial Odometry (GVIO) framework that leverages a novel buffering strategy to support efficient and robust distributed state estimation. The proposed framework is characterized by four core mechanisms. Firstly, through covariance segmentation, covariance intersection and buffering strategy, we modularize propagation and update steps in distributed state estimation, significantly reducing computational and communication burdens. Secondly, the left-invariant extended Kalman filter (L-IEKF) is adopted for information fusion, which exhibits superior state estimation performance over the traditional extended Kalman filter (EKF) since its state transition matrix is independent of the system state. Thirdly, a buffer-based re-propagation strategy is employed to handle delayed measurements efficiently and accurately by leveraging the L-IEKF, eliminating the need for costly re-computation. Finally, an adaptive buffer-driven outlier detection method is proposed to dynamically cull GNSS outliers, enhancing robustness in GNSS-challenged environments.

Abstract:
The present study proposes a method to identify fat and lung shadows and reduce fat shadows in obese patients to obtain clear ultrasound (US) images. The shadow identification method focuses on the difference between the luminance value of the shadow and the direction of image loss caused by the degree of reflection and absorption of US waves in fat and lungs, thereby registering the degree of image blurring caused by each shadow. The method for reducing fat shadows was based on preliminary experiments to derive the appropriate probe pressing pressure for the patients BMI, enabling the acquisition of US images with fewer fat shadows while relieving the patient from feeling pressure. Verification tests were conducted using the proposed method. With regard to the shadow identification method, it was possible to detect fat shadows and lung shadows with an accuracy of 92.7% and 96.8% of the F-measure, respectively. The introduction of the shadow reduction method increased the number and range of mitral valve detections. These results underline the usefulness of the proposed method.

Abstract:
This paper presents a graph-based multi-agent reinforcement learning framework for scalable UAV formation control and target tracking. The framework introduces a conflict-aware graph representation that aggregates neighborhood information through attention-based message passing, enabling each UAV to reason about both local interactions and global formation geometry. To generate agile and stable maneuvers, a hierarchical policy is designed that first selects motion primitives from a structured library and then refines them with continuous trajectory adjustments, ensuring smooth and dynamically feasible flight in cluttered environments. Extensive simulations and real-world experiments validate the proposed approach, demonstrating accurate target tracking, stable formation maintenance, and robust adaptation across varying swarm sizes and obstacle densities. In particular, policies trained on smaller swarms generalize effectively to larger ones without retraining, highlighting the scalability and practicality. The demonstration video is available on the project website: https://swift520.github.io/Formation-Tracking/.

Abstract:
Mobile manipulation systems have advanced significantly in recent years. However, substantial gaps remain that prevent state-of-the-art platforms from achieving widespread real-world deployment, particularly in reliably grasping items in unstructured environments. To help bridge this gap, we develop SHOPPER, a mobile manipulation robot platform designed to push the boundaries of reliable and generalizable grasp strategies. We develop these grasp strategies and deploy them in a real-world grocery store--an exceptionally challenging setting chosen for its vast diversity of manipulable items, fixtures, and layouts. In this work, we present our detailed approach to designing general grasp strategies towards picking any item in a real grocery store. Additionally, we provide an in-depth analysis of our latest real-world field test, discussing key findings related to fundamental failure modes over hundreds of distinct pick attempts. Through our detailed analysis, we aim to offer valuable practical insights and identify key grasping challenges, which can guide the robotics community towards pressing open problems in the field. Lastly, we provide a dataset of 1200+ grasp attempts in unseen grocery stores.

Abstract:
Robots operating in dynamic and shared environments benefit from anticipating contact before it occurs. We present GenTact-Prox, a fully 3D-printed artificial skin that integrates tactile and proximity sensing for contact detection and anticipation. The artificial skin platform is modular in design, procedurally generated to fit any robot morphology, and can cover the whole body of a robot. The skin achieved detection ranges of up to 18 cm during evaluation. To characterize how robots perceive nearby space through this skin, we introduce a data-driven framework for mapping the Perisensory Space---the body-centric volume of space around the robot where sensors provide actionable information for contact anticipation. We demonstrate this approach on a Franka Research 3 robot equipped with five GenTact-Prox units, enabling online object-aware operation and contact prediction.

Abstract:
Python bindings are a critical bridge between high-performance C++ libraries and the flexibility of Python, enabling rapid prototyping, reproducible experiments, and integration with simulation and learning frameworks. Yet, generating bindings for large codebases is a tedious process that creates a heavy burden for a small group of maintainers. In this work, we investigate the use of Large Language Models (LLMs) to assist in generating Nanobind wrappers, with human experts kept in the loop. Our workflow mirrors the structure of the C++ codebase, scaffolds empty wrapper files, and employs LLMs to fill in binding definitions. Experts then review and refine the generated code to ensure correctness, compatibility, and performance. Through a case study on a large C++ motion planning library, we document common failure modes including mismanaging shared pointers, overloads, and trampolines and show how in-context examples and careful prompt design improve reliability. Experiments demonstrate that the resulting bindings achieve runtime performance comparable to legacy solutions. Beyond this case study, our results provide general lessons for applying LLMs to binding generation in large-scale C++ projects.

Abstract:
Aerial manipulators, which combine robotic arms with multi-rotor drones, face strict constraints on arm weight and mechanical complexity. In this work, we study a lightweight 2-degree-of-freedom (DoF) arm mounted on a quadrotor via a differential mechanism, capable of full six-DoF end-effector pose control. While the minimal design enables simplicity and reduced payload, it also introduces challenges such as underactuation and sensitivity to external disturbances. To address these, we employ reinforcement learning, training a Proximal Policy Optimization (PPO) agent in simulation to generate feedforward commands for quadrotor acceleration and body rates, along with joint angle targets. These commands are tracked by an incremental nonlinear dynamic inversion (INDI) attitude controller and a PID joint controller, respectively. Flight experiments demonstrate centimeter-level position accuracy and degree-level orientation precision, with robust performance under external force disturbancesincluding manipulation of heavy loads and pushing tasks. The results highlight the potential of learning-based control strategies for enabling contact-rich aerial manipulation using simple, lightweight platforms. Videos of the experiment and the method are summarized in https://youtu.be/bWLTPqKcCOA.

Abstract:
Zero-shot object navigation in unknown environments presents significant challenges, mainly due to two key limitations: insufficient semantic guidance leads to inefficient exploration, while limited spatial memory resulting from environmental structure causes entrapment in local regions. To address these issues, we propose SSR-ZSON, a spatial-semantic relative zero-shot object navigation method based on the TARE hierarchical exploration framework, integrating a viewpoint generation strategy balancing spatial coverage and semantic density with an LLM-based global guidance mechanism. The performance improvement of the proposed method is due to two key innovations. First, the viewpoint generation strategy prioritizes areas of high semantic density within traversable sub-regions to maximize spatial coverage and minimize invalid exploration. Second, coupled with an LLM-based global guidance mechanism, it assesses semantic associations to direct navigation toward high-value spaces, preventing local entrapment and ensuring efficient exploration. Deployed on hybrid Habitat-Gazebo simulations and physical platforms, SSR-ZSON achieves real-time operation and superior performance. On Matterport3D and Habitat-Matterport3D datasets, it improves the Success Rate(SR) by 18.5% and 11.2%, and the Success weighted by Path Length(SPL) by 0.181 and 0.140, respectively, over state-of-the-art methods.

Abstract:
Comprehensive situational awareness is essential for autonomous vehicles operating in safety-critical environments, as it enables the identification and mitigation of potential risks. Although recent Multimodal Large Language Models (MLLMs) have shown promise on general visionlanguage tasks, our findings indicate that zero-shot MLLMs still underperform compared to domain-specific methods in fine-grained, spatially grounded risk assessment. To address this gap, we propose DriveSafe, a framework for risk-aware scene understanding that leverages structured natural language descriptions. Specifically, our method first generates spatially grounded captions enriched with multimodal contextincluding motion, spatial, and depth cues. These captions are then used for downstream risk assessment, explicitly identifying hazardous objects, their locations, and the unsafe behaviors they imply, followed by actionable safety suggestions. To further improve performance, we employ captionrisk pairings to fine-tune a lightweight adapter module, efficiently injecting domain-specific knowledge into the base LLM. By conditioning risk assessment on explicit language-based scene representations, DriveSafe achieves significant gains over both zero-shot MLLMs and prior domain-specific baselines. Exhaustive experiments on the DRAMA benchmark demonstrate state-of-the-art performance, while ablation studies validate the effectiveness of our key design choices. Project page: urlhttps://cvit.iiit.ac.in/research/projects/cvit-projects/drivesafe.

Abstract:
Diverse and realistic data are essential for developing reliable autonomous driving (AD) systems, yet collecting and annotating large-scale real-world driving datasets is costly and time-consuming. Recent advances in synthetic scene generation and editing have enabled the creation of diverse driving scenarios. However, fully synthetic scenes often lack real-world grounding, while existing editing approaches are either limited to pure video manipulation or involve cumbersome manual operations. To solve this, we present LangEditor, the natural language-driven 4D editing framework for dynamic driving scenes. LangEditor automatically grounds free-form language instructions to target vehicles and their editable attributes, generating physically plausible trajectories consistent with scene semantics. To ensure spatiotemporal coherence and visual fidelity, we propose a joint refinement strategy that integrates a Dynamic Illumination-Aware Shadow Module for lighting consistency across space-time, and an Appearance Refinement module for synthesizing high-quality textures and material properties. Extensive experiments on realistic driving datasets demonstrate that LangEditor enables intuitive, fine-grained, and photorealistic 4D scene manipulation, outperforming existing baselines in both editing quality and controllability. Our approach bridges the gap between realistic scene editing and user-friendly controllability, offering a powerful tool for data augmentation and simulation in AD research.

Abstract:
Pedestrian intention and trajectory prediction are critical for the safe deployment of autonomous driving systems, directly influencing navigation decisions in complex traffic environments. Recent advances in large visionlanguage models offer a powerful new paradigm for these tasks by combining high-capacity visual understanding with flexible natural language reasoning. In this work, we introduce PedestrianQA, a large-scale video-based dataset that formulates pedestrian intention and trajectory prediction as questionanswering tasks augmented with structured rationales. PedestrianQA expresses richly annotated pedestrian sequences, in natural language, enabling VLMs to learn from visual dynamics, contextual cues, and interactions among traffic agents while generating concise explanations of their predictions without needing specialized architectures tailored for each task. Empirical evaluations across PIE, JAAD, TITAN, and IDD-PeD show that finetuning state-of-the-art VLMs on PedestrianQA significantly improves intention classification, trajectory forecasting accuracy, and the quality of explanatory rationales, demonstrating the strong potential of VLMs as a unified and explainable framework for safety-critical pedestrian behavior modeling. Dataset and models are available at https://github.com/botmahn/PedestrianQA

Abstract:
To effectively operate in human-centered environments, robots must possess the capability to rapidly adapt to novel and changing situations. Techniques such as Learning from Demonstration enable fast learning without the need for explicit coding. However, in certain cases they exhibit limitations in generalizing beyond the set of demonstrations, which constrains their ability to rapidly adapt to unforeseen scenarios. In this work, we present a movement primitive learning algorithm based on Gaussian Processes, combined with a zero-shot adaptation to new via-points without requiring retraining, through Pathwise Conditioning. The algorithm not only learns the movement policy but is also capable of adapting it rapidly while preserving prior knowledge. The method has been evaluated through comparisons against other state-of-the-art approaches, experiments in simulated environments, as well as on a real robotic platform, generating new solutions for learned tasks by modifying via-points in both position and orientation.

Abstract:
This paper presents a fully integrated autonomous docking system validated through closed-loop sea trials on the milliAmpere1 research ferry operating in a live maritime harbour with moving vessels. Real harbour environments require continuous situational awareness and adaptive decision-making under dynamic traffic conditions. The proposed architecture combines cartographic land masking, LiDAR-based clustering, probabilistic multi-target tracking (JIPDA), dynamic footprint estimation, adaptive docking pose selection, and real-time path replanning within a finite state machine framework. Rather than introducing new algorithms, the contribution lies in system-level integration and operational validation of a complete perception-to-control pipeline under realistic maritime constraints. The system is demonstrated in multiple closed-loop experiments including collision avoidance and adaptive docking with moving obstacles. Results highlight both performance characteristics and practical deployment considerations, including runtime behaviour, sensor limitations, and integration trade-offs. The work provides empirical evidence that robust autonomous docking in dynamic harbour environments can be achieved through carefully engineered integration of established methods.

Abstract:
This paper introduces a path planning algorithm for executing robotic manipulation tasks with maximum dexterity in the workspace. This is achieved by using the workspace density of the end-effector as the objective function in a sampling-based planner. In doing so, the path planning algorithm prioritizes joint configurations that correspond to the highest density of local end-effector positions. This results in a singularity avoidant path planning algorithm that favors redundancy, making it a favorable approach for manipulation scenarios in which dexterity is paramount. However, due to the exponential relationship between the number of possible end-effector positions and the number of joints, computing the workspace density via traditional methods is computationally intractable for most modern industrial robots. In this paper, a newly developed approach is taken wherein the workspace density is approximated by a Gaussian mixture model that solves for the optimal workspace density function subject to higher-order statistical moment constraints. The statistical moments of the workspace density function are computed recursively with a minimum number of sample points by using a non-product quadrature rule known as the Conjugate Unscented Transform (CUT). This results in a computationally efficient framework that allows the user to trade accuracy and computation time by varying the number of mixture components and the number of statistical moments used in the workspace density approximation. To demonstrate, the algorithm is implemented on the Precision Assembled Space Structure (PASS) platform at NASA illustrating its effectiveness in dexterous robotic assembly tasks.

Abstract:
Flying fish have often served as an inspiration for engineering designs due to their remarkable ability for cross-domain locomotion between water and air. Previous observations and simulations suggest that taxiing behavior before takeoff is indispensable, yet the mechanics of this transition remain unclear. In this work, we present the design and dynamic modeling of a robotic flying fish to investigate swimming and taxiing locomotion, with a particular focus on tail pitching. We develop a bio-inspired prototype with an active tail-pitching structure and a high-power-density tail-beating propulsion system. We further formulate a cross-domain dynamic model that couples hydrodynamic and aerodynamic forces during taxiing. Simulations show that pitching the tail downward increases peak height and enables the robot to leave the water at a more aerodynamically favorable angle of attack. Experiments with the robotic prototype validate these trends and show that downward tail pitching increases upward force and body elevation with only minor loss of forward speed. These results provide insight into the role of tail pitching in flying fish taxiing and takeoff preparation.

Abstract:
Long-horizon robotic manipulation remains a critical challenge in robotics. Hierarchical reinforcement learning offers a promising solution, but often suffers from an imbalance dilemma: simplifying skill learning increases the complexity of planning, thereby expanding the solution space and computational burden of planning. To tackle this challenge, we propose a Hierarchical Reinforcement Learning framework with Dynamic Kolmogorov-Arnold Network (DyKAN) based Actor Critic, named HIKER. Firstly, HIKER innovates with a dual-chain design that decomposes the complex task into two intersecting sub-chains, reducing the optimization conflict across subtasks and alleviating the burden on the planning model. Secondly, we develop DyKAN, a scalable neural network for both actor and critic in the skill model of HIKER. DyKAN adaptively adjusts grids and basis functions while preserving learned knowledge, enabling efficient learning of complex manipulation skills. Furthermore, to optimize DyKAN's performance, we design a per-layer update module that uses Dynamic Tanh (DyT) and low-rank decomposition to ensure stable, low-cost updates during training. Finally, experiments on long-horizon robotic manipulation tasks demonstrate that HIKER significantly improves efficiency and robustness, yielding higher-quality skill models and achieving a 10.9% increase in task success rate under the high noise condition. Further insights are available on the website: https://sites.google.com/view/hikerdykan.

Abstract:
Multi-agent robotic manipulation remains challenging due to the combined demands of coordination, grasp stability, and collision avoidance in shared workspaces. To address these challenges, we propose the Adaptive Dynamic Modality Diffusion Policy (ADM-DP), a framework that integrates vision, tactile, and graph-based (multi-agent pose) modalities for coordinated control. ADM-DP introduces four key innovations. First, an enhanced visual encoder merges RGB and point-cloud features via Feature-wise Linear Modulation (FiLM) modulation to enrich perception. Second, a tactile-guided grasping strategy uses Force-Sensitive Resistor (FSR) feedback to detect insufficient contact and trigger corrective grasp refinement, improving grasp stability. Third, a graph-based collision encoder leverages shared tool center point (TCP) positions of multiple agents as structured kinematic context to maintain spatial awareness and reduce inter-agent interference. Fourth, an Adaptive Modality Attention Mechanism (AMAM) dynamically re-weights modalities according to task context, enabling flexible fusion. For scalability and modularity, a decoupled training paradigm is employed in which agents learn independent policies while sharing spatial information. This maintains low interdependence between agents while retaining collective awareness. Across seven multi-agent tasks, ADM-DP achieves 12-25% performance gains over state-of-the-art baselines. Ablation studies show the greatest improvements in tasks requiring multiple sensory modalities, validating our adaptive fusion strategy and demonstrating its robustness for diverse manipulation scenarios.

Abstract:
The optimization of flexure-based planar parallel robots with traditional methods demands significant computational resources due to the coupling of kinematic and structural effects. This work presents a decoupled optimization approach, in which the geometry of the manipulator is optimized for workspace via a kinematic model approach. A novel approach is presented via a modified Moore boundary following algorithm which provides for an efficient calculation of the workspace. With this approach, the optimal kinematic design parameters, namely the leg ratio and initial orientation of the end effector are determined and presented for a number of cases described by the combination of joint angle constraints.

Abstract:
We introduce SldprtNet, a large-scale dataset comprising over 242,000 industrial parts, designed for semantic-driven CAD modeling, geometric deep learning, and the training/fine-tuning of multimodal models for 3D design. The dataset provides 3D models in both .step and .sldprt formats to support di-verse training and testing. To enable parametric modeling and facilitate dataset scalability, we developed two supporting tools, encoder and decoder, which support 13 types of CAD commands and enable lossless transformation between 3D models and a structured text representation. Additionally, each sample is paired with a composite image created by merging seven rendered views from different view-points of the 3D model, effectively reducing input token length and accelerating inference. By combining this image with the parameterized text output from the encoder, we employ the lightweight multi-modal language model Qwen2.5-VL-7B to generate a natural language description of each parts appearance and functionality. To ensure accuracy, we manually verified and aligned the generated descriptions, rendered images, and 3D models. These descriptions, along with the parameterized modeling scripts, rendered images, and 3D model files, are fully aligned to construct SldprtNet. To assess its effectiveness, we fine-tuned baseline models on a dataset subset, com-paring image-plus-text inputs with text-only inputs. Results confirm the necessity and value of multi-modal datasets for CAD generation. It features care-fully selected real-world industrial parts, supporting tools for scalable dataset expansion, diverse modalities, and ensured diversity in model complexity and geometric features, making it a comprehensive multimodal dataset built for semantic-driven CAD modeling and cross-modal learning.

Abstract:
Robotic manipulation, especially of fragile and irregularly shaped objects, remains a significant challenge due to the need for both adaptability and precise tactile feedback. In this work, we introduce INTACT-GRIP, a robotic gripper that combines soft manipulation and high-resolution tactile sensing for inflation-based soft grasping. INTACT-GRIP integrates inflatable balloons with vision-based tactile feedback, enabling fingertip stiffness modulation for stable and damage-free manipulation of fragile and irregularly shaped objects. To evaluate its performance, we conducted a series of qualitative and quantitative experiments. In these experiments, inflation pressure was manually controlled by a human operator, who adjusted and stopped the pressure based on real-time visual feedback of the captured texture features. The results demonstrate the systems ability to safely conform to fragile and irregularly shaped objects with varying stiffness, enabling pressure-controlled grasping and high-resolution tactile imaging during contact. Furthermore, a case study with a robotic arm highlighted the systems potential as a versatile solution for precise and soft manipulation of delicate objects, supported by pressure-adjustable fingertips and real-time visualtactile feedback.

Abstract:
The simultaneous arrival of multiple mobile robots at their respective target points is crucial for cooperative tasks such as encirclement, interception, and disaster relief. Although the problem of simultaneous arrival is inherently complex, it becomes even more challenging in multi-robot systems with curvature-constrained trajectories and constant-speed requirements that may differ among robots, along with the need for distributed, real-time, and low-communication control. These constraints are typical for a multi-robot system, such as one consisting of fixed-wing UAVs or car-like mobile robots. To address this challenge, this paper proposes a distributed switching control method based on the maximum consensus protocol. Inspired by the optimization principles and geometric properties of Dubins paths, we introduce a virtual time variable and design a hybrid control law that combines optimal control with saturated proportional control. Under the proposed control law, each robot is driven to approach the maximum virtual time among its neighbors, thereby achieving simultaneous arrival under mild conditions. Furthermore, we prove that in certain cases the proposed method achieves a theoretically optimal arrival time, and its effectiveness and robustness are validated through extensive simulations and real-world experiments.

Abstract:
Planning collision free trajectories in complex environments remains a core challenge in robotics. Existing corridor based planners which rely on decomposition of the free space into collision free subsets scale poorly with environmental complexity and require explicit allocations of time windows to trajectory segments. We introduce a new trajectory parameterization that represents trajectories in a nonconvex collision free corridor as being in a convex cartesian product of balls. This parameterization allows us to decouple problem size from geometric complexity of the solution and naturally avoids explicit time allocation by allowing trajectories to evolve continuously inside ellipsoidal corridors. Building on this representation, we formulate the Orthogonal Trust Region Problem (Orth-TRP), a specialized convex program with separable block constraints, and develop a solver that exploits this parallel structure and the unique structure of each parallel subproblem for efficient optimization. Experiments on a quadrotor trajectory planning benchmark show that our approach produces smoother trajectories and lower runtimes than state-of-the-art corridor based planners, especially in highly complicated environments.

Abstract:
Unmanned Aerial Vehicles (UAVs) have quickly become common in various airspaces, representing a wide range of applications from recreation flying to commercial photography and package delivery. With the increasing prevalence of UAVs, it becomes critical that both manned and unmanned aircraft can detect UAVs and other flying objects from long range to effectively track movement and ensure safe operation in shared spaces. While several datasets have been introduced for drone detection, the need for expanded high-quality data persists, especially in the area of high-resolution long-range drone data. To address this, we introduce a high-resolution dataset of 102,532 long-range RGB images of drones, sampled at 5 FPS from 128 distinct video clips taken mid flight during 17 different data collection days spread over 8 months to ensure a wide variety of lighting scenarios, flight locations, and background elements. The dataset boasts comprehensive drone range information across the dataset, as well as 29,630 IR images, all paired with RGB counterparts from the base dataset. As one of the first drone detection datasets to leverage 4k image resolution and paired 640x512 IR images, our work represents a significant advancement to enable the detection of drones at long range. For access to the complete dataset, please visit https://research.coe.drexel.edu/ece/imaple/lrddv3

Abstract:
Semi-autonomy in telemanipulation frameworks has the potential to reduce user cognitive load while preserving human perceptual oversight and decision-making capabilities. However, existing semi-autonomous telemanipulation systems are heavily dependent on calibration and hardware configurations, making rapid deployment difficult. Moreover, existing VR-based telemanipulation systems lack intuitive interaction mechanisms, requiring users to manage complex control interfaces. To address these limitations, we introduce an intuitive and immersive semi-autonomous robotic telemanipulation system that leverages a mixed reality (MR) headset with minimal hardware requirements. Requiring only CPU processing and coarse calibration procedures, the system combines human perception with autonomous control strategies through natural hand tracking and finger gestures to achieve precise, reliable task execution. To validate this approach, we conducted thorough evaluations involving complex peg-in-hole tasks and compared performance with and without the proposed control strategy. The results highlight that our system demonstrates robust performance, and the proposed control strategy further enhances its stability and effectiveness.

Abstract:
Remote photoplethysmography (rPPG) is a camera-based technique for measuring physiological signals, particularly cardiac activity. From the remotely measured signals, heart rate can be estimated, which is crucial for health monitoring. In this study, we investigate a driver health monitoring system based on remote heart rate estimation. However, driving environments represent uncontrolled settings where videos are subject to varying illumination conditions and frequent head movements. We introduce MS-rPPG, a multi-spectral framework that combines RGB with near-infrared (NIR) face video to alleviate rPPG estimation under challenging driving conditions. To combine the complementary features from two spectral videos, we propose a cross-spectral linear modulation (CSLM) strategy based on frequency-domain analysis. Moreover, we introduce MS-Mamba, a novel state space model designed to effectively model long-range temporal dependencies while jointly capturing cross-channel interactions between multi-spectral features. We collected a real-world dataset called MS-Drive, which was recorded from 50 participants while driving the vehicle. The proposed method was evaluated on the MR-NIRP Car dataset and MS-Drive datasets. The experimental results indicate that MS-rPPG shows better robustness and heart rate estimation accuracy than previous methods, highlighting its promise for driver health monitoring. The codes are available at this https URL.

Abstract:
In-vivo environments, magnetically actuated soft robots offer advantages such as wireless operation and precise control, showing promising potential for painless detection and therapeutic procedures. We developed a trileg magnetically driven soft robot (TMR) whose multi-legged design enables more flexible gaits and diverse motion patterns. For the silicone made of reconfigurable soft robots, its navigation ability can be separated into sequential motions, namely squatting, rotation, lifting a leg, walking and so on. Its motion and behavior depend on its bending shapes. To bridge motion type description and specific low-level voltage control, we introduced TMR-VLA, an end-to-end multi-modal system for a trileg magnetic soft robot capable of performing hybrid motion types, which is promising for developing a navigation ability by adapting its shape to language-constrained motion types. The TMR-VLA deploys embodied endoluminal localization ability from EndoVLA, and fuses sequential frames and natural language commands as input. Low-level voltage output is generated based on the current observation state and specific motion type description. The result shows the TMR-VLA can predict how the voltage applied to TMR will change the dynamics of a silicon-made soft robot. The TMR-VLA reached a 78% average success rate.

Abstract:
Generalist robot policies must operate safely and reliably in everyday human environments such as homes, offices, and warehouses, where people and objects move unpredictably. We present Dynamic Neural Potential Field (NPField-GPT), a learning-enhanced model predictive control (MPC) framework that couples classical optimization with a Transformer-based predictor of footprint-aware repulsive potentials. Given an occupancy sub-map, robot footprint, and optional dynamic-obstacle cues, our NPField-GPT model forecasts a horizon of differentiable potentials that are injected into a sequential quadratic MPC program via L4CasADi, yielding real-time, constraint-aware trajectory optimization. We additionally study two baselines: NPField-StaticMLP, where a dynamic scene is treated as a sequence of static maps; and NPField-DynamicMLP, which predicts the future potential sequence in parallel with an MLP. In dynamic indoor scenarios from BenchMR and on a Husky UGV in office corridors, NPField-GPT produces more efficient and safer trajectories under motion changes, while StaticMLP/DynamicMLP offer lower latency. We also compare with the CIAO and MPPI baselines. Across methods, the Transformer+MPC synergy preserves the transparency and stability of model-based planning while learning only the part that benefits from data: spatiotemporal collision risk. Code and trained models are available at https://github.com/CognitiveAISystems/Dynamic-Neural-Potential-Field

Abstract:
Recent monocular foundation models excel at zero-shot depth estimation, yet their outputs are inherently relative rather than metric, limiting direct use in robotics and autonomous driving. We leverage the fact that relative depth preserves global layout and boundaries: by calibrating it with sparse range measurements, we transform it into a pseudo metric depth prior. Building on this prior, we design a refinement network that follows the prior where reliable and deviates where necessary, enabling accurate metric predictions from very few labeled samples. The resulting system is particularly effective when curated validation data are unavailable, sustaining stable scale and sharp edges across few-shot regimes. These findings suggest that coupling foundation priors with sparse anchors is a practical route to robust, deployment-ready depth completion under real-world label scarcity.

Abstract:
In this work, we introduce an adaptive hierarchical framework for efficient 3D object detection from point cloud data, designed to dynamically balance computational efficiency and detection performance. Our approach employs a shared feature extractor and multiple detector backbones of varying widths, enabling selective activation of models based on the complexity of the input scene. A novel feature gating mechanism dynamically determines the most relevant features for reduced-width backbones, while a surrogate loss prediction module ranks models in real-time, ensuring optimal backbone selection with minimal overhead. This adaptive strategy reduces computational costs by 41.4% while maintaining a negligible 2.44% reduction in detection accuracy across a range of real-world driving scenes (urban, highway, residential, campus, person) from the KITTI dataset. By addressing runtime adaptabilitya critical gap in existing 3D detection frameworksour method provides a significant improvement for deploying high-performance detection models in resource-constrained environments.

Abstract:
Global trend in robotics has shifted towards deploying humanoid robots and mobile manipulators in industrial settings to automate repetitive and structured tasks traditionally performed by human workers. However, most tools and equipment are designed for human hands, and current grippers or end-effectors are highly specialized, limiting their ability to fully replace human handling of simple tools and tasks. This study proposes a novel frictional and prismatic pin-array gripper developed for universal gripping and tool manipulation. A pin-array structure of the gripper mimics the behavior of soft grippers while incorporating rigid components, enabling adaptability to various shapes and sizes. Each pin features semi-automatic actuation through a compression spring, supporting the underactuated mechanism. Most existing studies on grippers focus on simple pick-and-place tasks, whereas the proposed gripper extends functionality to practical tool usage. Enabled by the pin-array structure, it provides increased contact surface and support points, ensuring stable gripping and enhanced manipulation performance. In the evaluation, the pin-array gripper achieved a payload capacity of 2400 g, significantly outperforming the conventional RG2-FT gripper and the frictional flat gripper, which reached maximum capacities of 800 g and 400 g, respectively. It also exhibited higher grasping forces, measuring 1.17 times greater than the RG2-FT gripper and up to 23 times greater than the frictional flat gripper. For tool manipulation, the pin-array gripper exhibited significantly lower manipulation errors, with 21.67 and 6.59 times fewer errors than the RG2-FT and flat grippers, respectively, when handling the hammer, and 7.69 and 4.45 times fewer for the metal file. Additionally, qualitative demonstrations in universal gripping, omnidirectional gripping, and tool usage further validated the grippers performance in mobile manipulator tasks.

Abstract:
Navigating in crowded environments requires the robot to be equipped with high-level reasoning and planning techniques. Existing works focus on developing complex and heavyweight planners while ignoring the role of human intelligence. Since humans are highly capable agents who are also widely available in a crowd navigation setting, we propose an alternative scheme where the robot utilises people as planners to benefit from their effective planning decisions and social behaviours. Through a set of rule-based evaluations, we identify suitable human leaders who exhibit the potential to guide the robot towards its goal. Using a simple base planner, the robot follows the selected leader through short-horizon subgoals that are designed to be straightforward to achieve. We demonstrate through both simulated and real-world experiments that our novel framework generates safe and efficient robot plans compared to existing planners, even without predictive or data-driven modules. Our method also brings human-like robot behaviours without explicitly defining traffic rules and social norms. Code will be available at https://github.com/centiLinda/PeopleAsPlanner.git.

Abstract:
This paper presents ARTEMIS, an end-to-end autonomous driving framework that combines autoregressive trajectory planning with Mixture-of-Experts (MoE). Traditional modular methods suffer from error propagation, while existing end-to-end models typically employ static one-shot inference paradigms that inadequately capture the dynamic changes of the environment. ARTEMIS takes a different method by generating trajectory waypoints sequentially, preserves critical temporal dependencies while dynamically routing scene-specific queries to specialized expert networks. It effectively relieves trajectory quality degradation issues encountered when guidance information is ambiguous, and overcomes the inherent representational limitations of singular network architectures when processing diverse driving scenarios. Additionally, we use a lightweight batch reallocation strategy that significantly improves the training speed of the Mixture-of-Experts model. Through experiments on the NAVSIM dataset, ARTEMIS exhibits superior competitive performance, achieving 87.0 PDMS and 83.1 EPDMS with ResNet-34 backbone, demonstrates state-of-the-art performance on multiple metrics.

Abstract:
Optimizing high-degree-of-freedom robotic manipulators requires searching complex, high-dimensional configuration spaces, a task that is computationally challenging for classical methods. This paper introduces a quantum-native framework that integrates Quantum Machine Learning (QML) with Grover's algorithm to solve kinematic optimization problems efficiently. A parameterized quantum circuit is trained to approximate the forward kinematics model, which then constructs an oracle to identify optimal configurations. Grover's algorithm leverages this oracle to provide a quadratic reduction in search complexity. Demonstrated on 1-DoF, 2-DoF, and dual-arm manipulator tasks, the method achieves significant speedupsup to 93x over classical optimizers like Nelder-Meadas problem dimensionality increases. This work establishes a foundational, quantum-native framework for robot kinematic optimization, effectively bridging quantum computing and robotics problems.

Abstract:
Inspired by human workers who perform complicated sewing tasks by repeating relatively simple operations, this paper proposes a fixture-free automated sewing system using a dual-arm manipulator and an ordinary sewing machine to sew two aligned fabrics along the edges, a common task in garment production. The proposed sewing system has a five-layer architecture: perception, dual-arm sewing Petri net, fundamental operations, control primitives, and hardware layers. This architecture decomposes various complex sewing tasks into sequences of fundamental operations. To meet the real-time requirement of automated sewing, a High-speed Fabric Edge Detection System (Hi-FEDS) is further proposed for the perception layer, which formulates the fabric edge detection problem for sewing as a classification problem of predefined distributed anchors. The anchor distribution is modeled by the Gaussian Uniform Mixture Model (GUMM). The proposed method achieves high-speed fabric edge detection at an average of 120 FPS, with an average error of about one pixel. An experimental robotic sewing platform is developed and the sewing results show that the proposed system achieves high-quality sewing across fabrics of various shapes and materials.

Abstract:
Self-supervised monocular depth estimation has achieved notable success under daytime conditions. However, its performance deteriorates markedly at night due to low visibility and varying illumination, e.g., insufficient light causes textureless areas, and moving objects bring blurry regions. To this end, we propose a self-supervised framework named DASP that leverages spatiotemporal priors for nighttime depth estimation. Specifically, DASP consists of an adversarial branch for extracting spatiotemporal priors and a self-supervised branch for learning. In the adversarial branch, we first design an adversarial network where the discriminator is composed of four devised spatiotemporal priors learning blocks (SPLB) to exploit the daytime priors. In particular, the SPLB contains a spatial-based temporal learning module (STLM) that uses orthogonal differencing to extract motion-related variations along the time axis and an axial spatial learning module (ASLM) that adopts local asymmetric convolutions with global axial attention to capture the multiscale structural information. By combining STLM and ASLM, our model can acquire sufficient spatiotemporal features to restore textureless areas and estimate the blurry regions caused by dynamic objects. In the self-supervised branch, we propose a 3D consistency projection loss to bilaterally project the target frame and source frame into a shared 3D space, and calculate the 3D discrepancy between the two projected frames as a loss to optimize the 3D structural consistency and daytime priors. Extensive experiments on the Oxford RobotCar and nuScenes datasets demonstrate that our approach achieves state-of-the-art performance for nighttime depth estimation. Ablation studies further validate the effectiveness of each component.

Abstract:
The dynamics coupling between motion and force subspaces in robotic control poses significant challenges to ensuring force control robustness, particularly under large external disturbances. While actively shaping the system inertia can eliminate this coupling, it introduces additional disturbances due to modeling uncertainties and force sensing errors. Inspired by how humans naturally adjust their elbow postures to facilitate motion force operations, we propose a quadratic programming-based nullspace optimization method that minimizes dynamics coupling for redundant torque-controlled robots. Integrated into an impedance motion force control framework, our approach minimizes an objective function defined by the Frobenius norm of the projection matrix representing inertia coupling in Cartesian space, yielding human-like postures that passively decouple task dynamics. Experimental results demonstrate that the proposed nullspace optimization significantly improves force control stability and tracking performance under conditions of high friction and external disturbances, outperforming conventional motion force control combined with traditional nullspace tracking approaches.

Abstract:
Coaxial bi-copter unmanned aerial vehicles (UAVs) have garnered attention due to their potential for improved rotor system efficiency and compact form factor. However, balancing efficiency, maneuverability, and compactness in coaxial bi-copter systems remains a key design challenge, limiting their practical deployment. This letter introduces COMET, a coaxial bi-copter UAV platform featuring a dual swashplate mechanism. The coaxial bi-copter system's efficiency and compactness are optimized through bench tests, and the whole prototype's efficiency and robustness under varying payload conditions are verified through flight endurance experiments. The maneuverability performance of the system is evaluated in comprehensive trajectory tracking tests. The results indicate that the dual swashplate configuration enhances tracking performance and improves flight efficiency compared to the single swashplate alternative. Successful autonomous flight trials across various scenarios verify COMET's potential for real-world applications.

Abstract:
Suspended multirotor platforms are fascinating systems that can be employed in construction applications to provide safe transportation of heavy loads. Such a system comprising a cable-suspended platform with attached load features seven degrees of freedom (DoF) motion for the whole system. In this paper, we propose a composite whole-body control framework for the stabilization of the suspended multirotor platform system, leveraging singular perturbation theory to exploit the inherent three time-scale dynamics of the system. The control strategy computes the underactuated 3-DoF wrench space generated by the platforms actuation units for the stabilization of the complete system. Building upon this, we develop a superposition-based shared control approach and then compare the two controllers. Moreover, to address specific cases where the time-scale separation between two dynamics of the triple-spherical pendulum becomes negligible, we design an operational space controller. The control approaches are validated using both extensive numerical simulations and experiments in different scenarios. We also carried out numerical robustness and stability analysis of the whole system. Note that our system relies on only onboard sensors for state estimation, which makes it effective for real-life outdoor applications.

Abstract:
We present a bounded-memory receding horizon approach to robot control for complex specifications in dynamic environments. We use Signal Temporal Logic, a logic that quantifies how robustly trajectories satisfy the specification, to specify robot behavior. To handle unbounded specifications, we consider a short planning horizon, only searching for nonviolating trajectories. We identify the subset of Signal Temporal Logic for which this approach needs only a bounded memory of the past, and leverage syntactic separation to summarize the robust satisfaction of the trajectory as it evolves. We implement our approach using receding horizon control in dynamic environments. We demonstrate the effectiveness and scalability of our approach compared to the state-of-the-art approach in several case studies.

Abstract:
In pinhole assembly processes, precise alignment or compliance mechanisms are typically required. This paper proposes a method for connecting objects by utilizing caging to constrain their motion, enabling the insertion of a pin into a hole to adjust the allowable relative pose for assembly. This approach eliminates the need for force control, even with low-degree-of-freedom manipulators, and reduces deflection caused by misalignment during connection. Although previous research has extensively studied appropriate finger configurations for caging, the behavior of caged objects under external forces remains insufficiently investigated. Furthermore, when connecting caged objects by contact, pose estimation often requires complex collision computations that account for intricate object geometries, which are computationally expensive and may fail to converge when small gaps are present. To address this issue, we propose a geometric method that approximates pose changes of caged three-dimensional objects under external forces as rotations about contact points. As a representative case, we focus on objects composed of cuboid elements. The estimated results for simple objects, including caged objects with small clearances, were consistent with geometrically derived theoretical solutions, and are obtained within 0.6 seconds, indicating a practical computation time.

Abstract:
This paper proposes a control framework for a two-section, concentrically tendon-driven continuum robot. The tracking control is formulated based on a Jacobian approach using zeroing dynamics. The Jacobian is estimated online from inputoutput data, using applied tendon force as inputs and measured tip positions as outputs, without requiring an explicit model. Tendon force is employed as the control input to better account for the deformation under external loads. The proposed framework is experimentally validated on a real-scale continuum robot, with tip-position feedback from a stereo-vision system. Experimental results demonstrate that the robot tip follows the desired trajectory with an RMSE of about 1.18 mm, while maintaining good performance under unknown external loads.

Abstract:
Despite recent advancements, existing prosthetic limbs are unable to replicate the dexterity and intuitive control of the human hand. Current control systems for prosthetic hands are often limited to grasping, and commercial prosthetic hands lack the precision needed for dexterous manipulation or applications that require fine finger motions. Thus, there is a critical need for accessible and replicable prosthetic designs that enable individuals to interact with electronic devices and perform precise finger pressing, such as keyboard typing or piano playing, while preserving current prosthetic capabilities. This paper presents a low-cost, lightweight, 3D-printed robotic prosthetic hand, specifically engineered for enhanced dexterity with electronic devices such as a computer keyboard or piano, as well as general object manipulation. The robotic hand features a mechanism to adjust finger abduction/adduction spacing, a 2-D wrist with the inclusion of controlled ulnar/radial deviation optimized for typing, and control of independent finger pressing. We conducted a study to demonstrate how participants can use the robotic hand to perform keyboard typing and piano playing in real time, with different levels of finger and wrist motion. This supports the notion that our proposed design can allow for the execution of key typing motions more effectively than before, aiming to enhance the functionality of prosthetic hands.

Abstract:
Liquid sampling from the gastrointestinal (GI) tract offers significant diagnostic advantages. This study presents a novel magnetically actuated Robotic Capsule Chain (RCC) for large volume liquid sample collection within the GI tract. The RCC incorporates a wirelessly powered, on-demand motorized sampling mechanism that eliminates the need for onboard batteries or microcontrollers. The system demonstrated reliable operation at distances up to 60 mm from the transmitter coil. Validation experiments confirmed effective sealing of the sampling chamber and successful collection of up to 375 µL of fluids with viscosities comparable to those in the GI tract. Navigation and sampling were further demonstrated in a synthetic bowel model. These findings highlight the potential of robotic capsule chains to enable wireless, minimally invasive diagnostic procedures in the GI tract.

Abstract:
For Minimally Invasive Surgical (MIS) robots, accurate haptic interaction force feedback is essential for ensuring the safety of interacting with soft tissue. However, the majority of existing MIS robotic systems cannot facilitate direct measurement of the interaction force with hardware sensors due to space limitations. This letter introduces an effective vision-based scheme that utilizes a One-Shot structured light projection with a designed pattern on soft tissue coupled with haptic information processing through a trained image-to-force neural network. The images captured from the endoscopic stereo camera are analyzed to reconstruct high-resolution 3D point clouds for soft tissue deformation. The proposed methodology involves a modified PointNet-based force estimation method, which has demonstrated proficiency in accurately representing the intricate mechanical properties of soft tissue. To validate the efficacy of the proposed methodology, numerical force interaction experiments were conducted on three silicon materials with varying stiffness levels. The experimental results substantiate the efficacy of the proposed methodology.

Abstract:
Online scene perception and topology reasoning are critical for autonomous vehicles to understand their driving environment, particularly for mapless driving systems that endeavor to reduce reliance on costly High-Definition (HD) maps. However, recent advances in online scene understanding still face limitations, especially in long-range or occluded scenarios, due to the inherent constraints of onboard sensors. To address this challenge, we propose a Standard-Definition (SD) map Enhanced scene Perception and Topology reasoning (SEPT) framework, which explores how to effectively incorporate the SD map as prior knowledge into existing perception and reasoning pipelines. Specifically, we introduce a novel hybrid feature fusion strategy that combines SD maps with Birds-Eye-View (BEV) features, considering both rasterized and vectorized representations, while mitigating potential misalignment between SD maps and BEV feature spaces. Additionally, we leverage the SD map characteristics to design an auxiliary intersection-aware keypoint detection task, which further enhances the overall scene understanding performance. Experimental results on the large-scale OpenLane-V2 dataset demonstrate that by effectively integrating SD map priors, our framework significantly improves both scene perception and topology reasoning, outperforming existing methods by a substantial margin.

Abstract:
When interacting with each other, humans adjust their behavior based on perceived trust. To achieve similar adaptability, robots must accurately estimate human trust at sufficiently granular timescales while collaborating with humans. Beta reputation is a popular way to formalize a mathematical estimation of human trust. However, it relies on binary performance, which updates trust estimations only after each task concludes. Additionally, manually crafting a reward function is the usual method of building a performance indicator, which is labor-intensive and time-consuming. These limitations prevent efficient capture of continuous trust changes at more granular timescales throughout the collaboration task. Therefore, this letter presents a new framework for the estimation of human trust using beta reputation at fine-grained timescales. To achieve granularity in beta reputation, we utilize continuous reward values to update trust estimates at each timestep of a task. We construct a continuous reward function using maximum entropy optimization to eliminate the need for the laborious specification of a performance indicator. The proposed framework improves trust estimations by increasing accuracy, eliminating the need to manually craft a reward function, and advancing toward the development of more intelligent robots.

Abstract:
Aerial manipulators are advancing beyond traditional inspection roles to enable complex interactions with flexible structures. Applications such as structural health monitoring, and especially agricultural tasks like fruit harvesting or environmental monitoring, require inducing controlled vibrations into flexible elements. However, current solutions for controlled shaking of trees with aerial manipulators are limited to push and pull forces applied through translational movements, without exploiting the fully-capabilities of aerial platforms. This paper introduces a controlled shaking strategy that enables interaction with trees using both linear movements generated by forces (translation strategy) and rotational movements generated by torques (rotation strategy) thus exploiting the different interaction capabilities of the platform. These two strategies open a previously unexplored question: which strategy is more effective given a specific interaction point? To address this, the two interaction strategies are integrated with the Rayligh-Ritz model of the tree, obtaining the closed-loop dynamics of the system during the vibration. These closed-loop dynamics are then analyzed for the two shaking strategies, deriving which one is better for achieving higher oscillation amplitudes or frequencies. This analysis shows that, for a given interaction point of the tree trunk, this decision depends only on the platform's physical characteristics, such as mass and inertia. Finally, the theoretical analysis is experimentally validated with a hand-made bamboo tree and a fully-actuated platform through indoors flights.

Abstract:
Dexterous in-hand manipulation remains a long-standing challenge in robotics, primarily due to the complex contact dynamics and partial observability. While humans synergize vision and touch for such tasks, robotic approaches often prioritize one modality, therefore limiting adaptability. This paper introduces Flow Before Imitation (FBI), a visuotactile imitation learning framework that dynamically fuses tactile interactions with visual observations through motion dynamics. Unlike prior static fusion methods, FBI establishes a causal link between tactile signals and object motion via a dynamics-aware latent model. FBI employs a transformer-based interaction module to fuse flow-derived tactile features with visual inputs, training a one-step diffusion policy for real-time execution. Extensive experiments demonstrate that the proposed method outperforms the baseline methods in both simulation and the real world on two customized in-hand manipulation tasks and three standard dexterous manipulation tasks. Code, models, and more results are available on the website urlhttps://sites.google.com/view/dex-fbi.

Abstract:
This paper presents a methodology to predict metric depth from monocular RGB images and an inertial measurement unit (IMU). To enable collision avoidance during autonomous flight, prior works either leverage heavy sensors (e.g., LiDARs or stereo cameras) or data-intensive and domain-specific fine-tuning of monocular metric depth estimation methods. In contrast, we propose several lightweight zero-shot rescaling strategies to obtain metric depth from relative depth estimates via the sparse 3D feature map created using a visual-inertial navigation system. These strategies are compared for their accuracy in diverse simulation environments. The best performing approach, which leverages monotonic spline fitting, is deployed in the real-world on a compute-constrained quadrotor. We obtain on-board metric depth estimates at 15 Hz and demonstrate successful collision avoidance after integrating the proposed method with a motion primitives-based planner.

Abstract:
Reliable monitoring of excavator operations in real-world environments requires accurate excavation counting to ensure productivity, efficient computation for real-time inference, and cost-effective on-board sensinga combination that most prior systems fail to achieve. We present EXOM (EXcavator Operation Monitoring), a lightweight and deployable framework that relies solely on a factory-installed cabin camera and built-in hydraulic sensors. EXOM integrates two embedded-friendly modules: a Video data Processing Module (VPM), where an ECSE algorithm leverages bucket detection to estimate excavation sections and counts from state transitions, and a Sensor data Processing Module (SPM), where an Adaptive Window (AW) process sparsifies time-series signals and drives a segmentation model through a learnable sparse tensor. To capture deployability, we introduce EXOM-I, a unified index that combines section-level F1 and normalized excavation counting accuracy. Experiments with real-world data demonstrate that EXOM consistently outperforms previous approaches, achieving state-of-the-art performance with real-time latency on resource-limited embedded excavator hardware.

Abstract:
Assessing human muscle fatigue is critical for optimizing performance in physical humanrobot interaction (pHRI) tasks and mitigating safety risks for the human operator. This paper presents a data-driven framework for estimating muscle fatigue in dynamic pHRI tasks using surface electromyography (sEMG) sensors attached to the human arm. Subject-specific machine learning (ML) regression models were developed to estimate fatigue during cyclic (i.e., repetitive) pHRI tasks. Machine learning models were trained to estimate the fraction of cycles to fatigue (FCF) using EMG features. Their performance was compared with a CNN model that processes spectrogram representations of EMG signals. Unlike most earlier data-driven approaches that primarily formulated fatigue estimation as a classification problem, our method models the continuous progression of fatigue through regression, enabling tracking of gradual physiological changes rather than discrete states, which is critical for timely intervention and adaptive control in dynamic pHRI tasks. Experiments were conducted with ten participants who interacted with a collaborative robot operated under an admittance controller, performing lateral (left-right) cyclic movements of the end effector until the onset of muscular fatigue. The results demonstrate that the root mean square error (RMSE) of FCF estimation across participants was 20.8 ± 4.3%, 23.3 ± 3.8%, 24.8 ± 4.5%, and 26.9 ± 6.1% for the CNN, Random Forest, XGBoost, and Linear Regression models, respectively. To examine cross-task generalization, additional experiments were performed with one participant who executed vertical and circular repetitive movements. Models trained solely on the lateral-movement data were directly tested on these unseen tasks. The results indicate that the proposed models are robust to variations in movement direction, arm kinematics, and muscle recruitment patterns, while the Linear Regression model performed poorly.

Abstract:
Visual-inertial fusion is crucial for a large amount of intelligent and autonomous applications, such as robot navigation and augmented reality. To bootstrap and achieve optimal state estimation, the spatial-temporal displacements between IMU and cameras must be calibrated in advance. Most existing calibration methods adopt continuous-time state representation, more specifically the B-spline. Despite these methods achieve precise spatial-temporal calibration, they suffer from high computational cost caused by continuous-time state representation. To this end, we propose a novel and extremely efficient calibration method that unleashes the power of discrete-time state representation. Moreover, the weakness of discrete-time state representation in temporal calibration is tackled in this paper. With the increasing production of drones, cellphones and other visual-inertial platforms, if one million devices need calibration around the world, saving one minute for the calibration of each device means saving 2083 work days in total. To benefit both the research and industry communities, the open-source implementation is released at https://github.com/JunlinSong/DT-VI-Calib.

Abstract:
Autonomously controlling and handling a vehicle at and beyond its stability limit is a mathematically and computationally demanding task. Prior demonstrations of automated drifting have been limited to research platforms with instantaneous torque delivery and independently actuated wheels, leaving their applicability to production vehicles with actuator latencies and mechanically coupled axles uncertain. To overcome these issues, we design a predictor to compensate for powertrain delays, develop a revised control formulation to accommodate higher actuation latencies as well as a differential coupling, and introduce brake-based velocity stabilization. This paper presents the controller framework, the model extensions, and real-world experimental results. We observe that our controller enables a production sports car with a combustion engine to robustly sustain circular and figure-eight drifts, limiting lateral error to 1.1 m and sideslip overshoot to 0.06 rad despite actuator delays exceeding 250 ms, while mitigating oscillations and maintaining stable path and sideslip tracking. In conclusion, our results establish that autonomous drifting is feasible on production-ready vehicles, opening pathways to advanced safety systems capable of stabilizing cars in scenarios where traditional control fails.

Abstract:
Accurate pedestrian navigation on edge devices is a critical problem. While artificial neural networks (ANNs) have been shown to effectively solve this problem with acceptable accuracy, their energy consumption limits applications on low-power computation platforms. Spiking neural networks (SNNs) are promising alternatives, while their applicability in using noisy, high-frequency IMU data is hindered by two key issues: information loss during spike encoding and simplistic neuron dynamics that fail to capture complex motion. This paper introduces Spike-IMU, an SNN-based velocity estimation network designed to overcome these issues for the pedestrian navigation problem. In particular, a dynamic spiking neuron (DSN) is introduced based on the integer firing mechanism. In addition, a temporal feature fusion spike encoder (TFFSE) and a dynamic spiking long short-term memory network (DSLSTM) are proposed to encode and process IMU data into spike sequences. Our experiments on the RoNIN dataset show that Spike-IMU surpasses classical ANNs, reducing positioning error by 20% while consuming 70.3% less energy. This work demonstrates a novel pipeline to design SNNs that achieves both superior accuracy and energy efficiency, pushing applications of IMU-based pedestrian navigation to real-world low-power edge devices.

Abstract:
Cardiopulmonary resuscitation (CPR) is a critical life-saving procedure, and effective training benefits from self-directed practice beyond instructor-led sessions. In this paper, we propose a closed-loop CPR training glove that integrates a high-resolution tactile sensing array and vibrotactile actuators for self-directed practice. The tactile sensing array measures distributed pressures across the palm and dorsum to enable real-time estimation of compression rate, force, and hand pose. Based on these estimations, the glove delivers immediate haptic feedback to guide the user for proper CPR, reducing reliance on external audio-visual displays. We quantified the tactile sensor performance by measuring wide-range sensitivity (�?.85 over 0-600 N), computing hysteresis (56.04%), testing stability (11.05% drift over 300 cycles), and estimating global signal-to-noise ratio (18.90 ± 2.41 dB at 600 N). Our closed-loop pipeline provides continuous modeling and feedback of key performance metrics essential for high-quality CPR. Our lightweight statistical models achieves >92% accuracy for force estimation and hand pose classification within sub-millisecond inference time. Our user study (N=8) showed that haptic feedback reduced visual distraction compared to audio-visual cues, though simplified patterns were required for reliable perception under dynamic load. These results highlight the feasibility of the proposed system and offer design insights for future haptic CPR self-training system.

Abstract:
Motion planning in unstructured environments remains a challenging task, particularly in scenarios with dense obstacles and discontinuous freespace, due to the need to ensure both safety and real-time performance for robots. To address these challenges, this paper proposes a multimodal fusion-guided diffusion policy framework, abbreviated as M-DP, synergistically guided by images, LiDAR points, and goal targets. A multimodal early-fusion mechanism is designed to combine visual and LiDAR data, leveraging the complementary nature of sensor observations to enhance obstacle perception. The fused feature vectors are utilized to guide the diffusion policy to generate multiple trajectories, and the Denoising Diffusion Implicit Model (DDIM) is employed for inference to improve real-time performance. Semantic and geometric constraints are incorporated to determine the optimal trajectory, enabling the selection of collision-free paths that balance safety, goal reaching, and bumpiness. Additionally, dynamic constraints are introduced to ensure the safety of robots in rugged and obstacle-dense environments. Real-world experimental evaluations demonstrate the safety and effectiveness of the framework compared to baseline methods, with ablation studies validating the contributions of key components. Codes and our self-collected dataset are available on https://github.com/xhy1599/M-DP.

Abstract:
Visual challenges in underwater environments significantly hinder the accuracy of vision-based localisation and the high-fidelity dense reconstruction. In this paper, we propose VISO, a robust underwater SLAM system that fuses a stereo camera, an inertial measurement unit (IMU), and a 3D sonar to achieve accurate 6-DoF localisation and enable efficient dense 3D reconstruction with high photometric fidelity. We introduce a coarse-to-fine online calibration approach for extrinsic parameters estimation between the 3D sonar and the camera. Additionally, a photometric rendering strategy is proposed for the 3D sonar point cloud to enrich the sonar map with visual information. Extensive experiments in a laboratory tank and an open lake demonstrate that VISO surpasses current state-of-the-art underwater and visual-based SLAM algorithms in terms of localisation robustness and accuracy, while also exhibiting real-time dense 3D reconstruction performance comparable to the offline dense mapping method.

Abstract:
Deploying autonomous agents in the real world is complicated, especially when it comes to navigation, where systems must adapt to situations they havent encountered before. Traditional learning approaches require a substantial amount of data, constant tweaking, and sometimes starting over for every new task. That makes them hard to scale and not very flexible. Recent breakthroughs in foundation models, such as large language models and vision-language models, enable systems to attempt new navigation tasks without requiring additional training. However, many of these methods only work with specific types of inputs, employ relatively basic reasoning, and fail to fully utilize the details they observe or the structure of the spaces. Here, we introduce T2-Nav, a zero-shot navigation system that combines various types of data and employs graph-based reasoning. By leveraging visual information directly into the graph and matching it to the environment, our approach enables the system to find a good balance between exploration and reaching its goal. This strategy allows robust obstacle avoidance, reliable loop closure detection, and efficient path planning while eliminating redundant exploration patterns. The system demonstrates flexibility by handling goals specified through reference images of target object instances, making it particularly suitable for real-world deployment scenarios where agents must navigate to visually similar but spatially distinct instances. Experiments demonstrate that our approach worked efficiently and adapted well in complex, unfamiliar settings, moving toward practical zero-shot instance-image navigation capabilities.

Abstract:
Reinforcement learning has enabled significant progress in complex domains such as coordinating and navigating multiple quadrotors. However, even well-trained policies remain vulnerable to collisions in obstacle-rich environments. Addressing these infrequent but critical safety failures through retraining or fine-tuning is costly and risks degrading previously learned skills. Inspired by activation steering in large language models and latent editing in computer vision, we introduce a framework for inference-time Latent Activation Editing (LAE) that refines the behavior of pre-trained policies without modifying their weights or architecture. The framework operates in two stages: (i) an online classifier monitors intermediate activations to detect states associated with undesired behaviors, and (ii) an activation editing module that selectively modifies flagged activations to shift the policy towards safer regimes. In this work, we focus on improving safety in multi-quadrotor navigation. We hypothesize that amplifying a policys internal perception of risk can induce safer behaviors. We instantiate this idea through a latent collision world model trained to predict future pre-collision activations, thereby prompting earlier and more cautious avoidance responses. Extensive simulations and real-world Crazyflie experiments demonstrate that LAE achieves statistically significant reduction in collisions (nearly 90% fewer cumulative collisions compared to the unedited baseline) and substantially increases the fraction of collision-free trajectories, while preserving task completion. More broadly, our results establish LAE as a lightweight paradigm, feasible on resource-constrained hardware, for post-deployment refinement of learned robot policies. Our project page with videos and code is available at https://lae-robotics.github.io/.

Affiliations: The Chinese University of Hong Kong; Sun Yat-sen University; Alibaba DAMO Academy; University of Strasbourg; Monash University; Centre for Artificial Intelligence and Robotics (CAIR) Hong Kong Institute of Science & Innovation Chinese Academy of Sciences; Shanghai AI Laboratory; Institute of Automation, Chinese Academy of Sciences; Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences; TU Munich; Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS)

Abstract:
Surgical scene understanding is critical for surgical training and robotic decision-making in robot-assisted surgery. Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated great potential for advancing scene perception in the medical domain, facilitating surgeons to understand surgical scenes and procedures. However, these methods are primarily oriented towards image-based analysis or global video understanding, overlooking the fine-grained video reasoning that is crucial for analyzing specific processes and capturing detailed task execution within a surgical procedure. To bridge this gap, we propose SurgVidLM, the first video language model designed to address both full and fine-grained surgical video comprehension. To train our SurgVidLM, we construct the SVU-31K that is a large-scale dataset with over 31K video-instruction pairs, enabling both holistic understanding and detailed analysis of surgical procedures. Building on this resource, SurgVidLM incorporates a two-stage StageFocus mechanism: the first stage extracts global procedural context, while the second stage performs high-frequency local analysis guided by temporal cues. We also develop the Multi-frequency Fusion Attention to effectively integrate low- and high-frequency visual tokens, ensuring the preservation of critical task-specific details. Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs of comparable parameter scale in both full and fine-grained video understanding tasks, showcasing its superior capability in capturing the context of complex robot-assisted surgeries. Our code and dataset will be publicly accessible soon.

Abstract:
Deep reinforcement learning has emerged as the dominant paradigm for training legged robots to locomote, however, when deployed in unstructured, dynamically varying real-world environments, the safety of neural network based controllers remains insufficiently guaranteed. Prior studies have demonstrated that sequential adversarial attacks, formulated via reinforcement learning, can effectively expose latent vulnerabilities in controllers and thus serve as a valuable complement to Domain Randomization techniques. These methods, however, are inherently constrained by the assumption that both the adversary and the locomotion policy share identical state space inputs. In contrast, our approach overcomes this limitation by incorporating privileged information into the adversarial network's observation input, thereby more than doubling the attack success rate. Furthermore, we mitigate the controllers tendency toward overly conservative behavior under attacks by introducing stochastic termination criteria. We validate the proposed method in real-world deployments, showing that it not only significantly enhances robustness but also preserves original task performance.

Abstract:
Hybrid leg-wheel robots offer exceptional mobility, but their complex mechanics and extended contact surfaces challenge modern control frameworks that rely on simple point-foot models. Accurately estimating both the contact state and the precise contact location using only proprioceptive sensors is a critical and unresolved problem for these platforms. To address this, we present the complete, proprioception-only framework that provides both contact state and contact point information for this class of robot. The framework is executed on a computationally efficient and simplified ynamic model of the complex 11-bar leg mechanism as an example, which enables a discrete-time Generalized Momentum Observer (GMO) to accurately estimate external wrenches. An optimization-based algorithm then precisely localizes the contact point by finding the location along the rim that best explains the full-body dynamics. The framework's performance was validated in high-fidelity simulations across diverse gaits. For contact state validation, the detector demonstrates over 97% single-leg accuracy during a dynamic 0.4 m/s trot. For contact point validation, the localization stage confirms the accurate estimation throughout the stance phase with RMS 0.0173 m. Our work provides the essential contact information required to provide advanced model-based control for these challenging platforms.

Abstract:
We build a low-level reflex control layer driven by fast tactile feedback for multifinger grasp stabilization. Our hybrid approach combines learned tactile slip detection with model-based internal-force control to halt in-hand slip while preserving the object-level wrench. The multimodal tactile stack integrates piezoelectric sensing (PzE) for fast slip cues and piezoresistive arrays (PzR) for contact localization, enabling online construction of a contact-centric grasp representation without prior object knowledge. Experiments demonstrate reactive stabilization of multifingered grasps under external perturbations, without explicit friction models or direct force sensing. In controlled trials, slip onset is detected after 20.4 ± 6 ms. The framework yields a theoretical grasp response latency on the order of 30 ms, with grasp-model updates in less than 5 ms and internal-force selection in about 4 ms. The analysis supports the feasibility of sub-50 ms tactile-driven grasp responses, aligned with human reflex baselines.

Abstract:
Accurate lane topology perception is crucial for safe autonomous driving, yet vision-based models such as BEVFormer and TopoNet degrade under heavy occlusion and other visibility degradations (e.g., ambiguous road markings). Existing approaches augment vision with global priors like Standard Definition (SD) maps, but these rely on precise GNSS localization and global alignment, which can be unreliable in urban canyons, tunnels, or GNSS-denied areas. Fiducial markers provide a complementary alternative: compact infrastructure-embedded tags that encode structurally complete local lane graphs, mitigating blind spots in topology reasoning where visual pipelines fail. However, marker detections are not always reliablepose estimates may degrade with distance, and detections may be intermittent under occlusion. To address these challenges, we propose a Confidence-Gated Marker Fusion framework that integrates marker-derived priors into BEV features through a dynamic gating mechanism, regulating the contribution of noisy long-range inputs. In addition, we introduce a temporal marker memory that caches and decays reliable priors across frames, propagating topology guidance during short-term detection gaps. Evaluated on a marker-augmented OpenLane-V2 benchmark, our method outperforms both vision-only and SD map-augmented baselines, achieving notable gains (27%) in lane graph completeness and occlusion robustness. These results demonstrate that fiducial marker priors, when fused with vision-based reasoning, provide a practical and reliable pathway toward resilient lane topology prediction in GNSS-denied urban scenarios.

Abstract:
Robotic vehicles (RVs) have increasingly deployed in critical missions. Yet, RV control software is prone to logic bugs that cause unexpected physical behaviors, deviating from the developers intentions. For instance, Hakuto-R Mission 1 lunar lander physically crashed on the lunar surface due to a misinterpretation of sensor data. To discover such bugs, developers leverage bug-finding tools, from formal methods to fuzzing. To use these tools, human experts first need to manually create formal specifications (e.g., temporal logic) as bug oracles. Yet, such manual efforts are time-consuming and error-prone. Previous efforts to automatically generate such specifications merely translate natural-language documentation into formal specifications. In turn, they overlook the cyber-physical interplay inherent in RVs, which is often absent from the documentation, e.g., altitude changes caused by air pressure and servo lag. To tackle this limitation, we introduce RVSpec, an automatic specification generation framework. It first constructs a cyberphysical interplay graph (CPG). It captures the quantification about how much internal (control software-dependent and hardware-specific properties intrinsic to an RV) and external factors (environmental conditions) influence the RVs physical states. Then, RVSpec uses the CPG to guide large language model agents, enabling the generation of cyber-physical interplay aware formal specifications. We evaluated RVSpec on four popular RV control software packages, including ArduPilot and PX4 for aerial vehicles, openpilot for autonomous vehicles, and cFS for spacecrafts. The evaluation showed that specifications created by RVSpec achieved an accuracy of 80.7%, whereas the baselines ones attained 51.6%. When applying the specifications for fuzzing, those generated by RVSpec reduced the number of false positives from 4,790 (baseline) to 964 (79.9% reduction) while preserving the bug-finding capability.

Abstract:
Direct physical guidance is a natural means of teaching and interacting with robots, and robotic skins make a key contribution by enabling sensitive contact sensing and localization. This paper presents a tactileproprioceptive sensor fusion framework for revnatural physical human-robot interaction. Tactile cues from pneumatic skin pads serve as contact indicators that bypass the ambiguity between frictional residues and applied external forces, enabling highly sensitive contact detection without explicit friction identification. We fuse these cues with motor-currentbased proprioception to reconstruct multi-axis contact forces on the robot surface. To maintain accuracy during motion, we revemploy a temporal convolutional network (TCN) to mitigate friction hysteresis during stickslip transitions, reducing uncertainty at contact onset and yielding smooth, responsive guidance. We validate the approach on a skin-integrated robot arm: (i) multi-axis forces are reconstructed in stationary contacts, and (ii) simultaneous force estimation and kinesthetic teaching are demonstrated. Results indicate improved sensitivity and responsiveness across diverse contact conditions compared with tactile-only and proprioceptive-only baselines, supporting tactileproprioceptive fusion as a reliable pathway to safe, intuitive physical humanrobot interaction.

Abstract:
Collision avoidance and navigation in dynamic and dense environments remain highly challenging for swarm robotics. To address this, we propose SwarmNav, a novel goal-region amplification navigation policy that leverages LiDAR-based position data to generate velocity commands guiding robots toward their goals while actively avoiding obstacles. SwarmNav is trained within a deep reinforcement learning actor-critic framework. In this framework, the reward function integrates a goal-region amplification term with the reciprocal velocity obstacles formulation, enabling goal-directed navigation under dynamic obstacle uncertainty. Extensive simulations demonstrate that SwarmNav significantly outperforms state-of-the-art approaches, including both reinforcement learning-based and traditional velocity obstacle-based methods, in terms of success rate and computational efficiency. Real-world experiments across diverse scenarios further confirm its effectiveness in dynamic and dense environments.

Abstract:
In this paper, we study the problem of methodically obtaining a sufficient set of kinesthetic demonstrations, one at a time, such that a robot can be confident of its ability to perform a complex manipulation task in a given region of its workspace. Although programming by demonstration has been an active area of research, the problems of checking whether a set of demonstrations is sufficient and systematically seeking additional demonstrations have remained open. We present an approach for the robot to incrementally and actively ask for new demonstration examples, one at a time, until the robot can assess with high confidence that it can perform the task successfully. Our approach uses (i) a screw geometric representation of motion to generate manipulation plans from demonstrations, which makes the sufficiency of a set of demonstrations measurable; (ii) a sampling strategy based on PAC-learning from multi-armed bandit optimization to evaluate the robot's ability to generate manipulation plans in a subregion of its task space; and (iii) a heuristic to seek additional demonstration from areas of weakness. We present results of a user study conducted with 22 participants (without any background in robotics) on two example manipulation tasks, namely pouring and scooping, to assess the utility and usability of our approach. The results show that a handful of examples (fewer than 10) were needed to successfully teach the robot to plan tasks. A video supplement is available on YouTube: https://youtu.be/ncsb_m6CCNY

Abstract:
Ensuring safe real-time control of ship-mounted cranes in unstructured transportation environments requires handling multiple safety constraints while maintaining effective payload transfer performance. Unlike traditional crane systems, ship-mounted cranes are consistently subjected to significant external disturbances affecting underactuated crane dynamics due to the ship's dynamic motion response to harsh sea conditions, which can lead to robustness issues. To tackle these challenges, we propose a robust and safe model predictive control (MPC) framework and demonstrate it on a 5-DOF crane system, where a Stewart platform simulates the external disturbances that ocean surface motions would have on the supporting ship. The crane payload transfer operation must avoid obstacles and accurately place the payload within a designated target area. We use a robust zero-order control barrier function (R-ZOCBF)-based safety constraint in the nonlinear MPC to ensure safe payload positioning, while time-varying bounding boxes are utilized for collision avoidance. We introduce a new optimization-based online robustness parameter adaptation scheme to reduce the conservativeness of R-ZOCBFs. Experimental trials on a crane prototype demonstrate the overall performance of our safe control approach under significant perturbing motions of the crane base. While our focus is on crane-facilitated transfer, the methods more generally apply to safe robotically-assisted parts mating and parts insertion.

Abstract:
Efficient 3D LiDAR point cloud compression (LPCC) and streaming are critical for edge server-assisted robotic systems, enabling real-time communication with compact data representations. A widely adopted approach represents LiDAR point clouds as range images, enabling the direct use of mature image and video compression codecs. However, because these codecs are designed with human visual perception in mind, they often compromise geometric details, which downgrades the performance of downstream robotic tasks such as mapping and object detection. Furthermore, rate-distortion optimization (RDO)-based rate control remains largely underexplored for range image compression (RIC) under dynamic bandwidth conditions. To address these limitations, we propose D-Compress, a new detail-preserving and fast RIC framework tailored for real-time streaming. D-Compress integrates both intra- and inter-frame prediction with an adaptive discrete wavelet transform approach for precise residual compression. Additionally, we introduce a new RDO-based rate control algorithm for RIC through new rate-distortion modeling. Extensive evaluations on various datasets demonstrate the superiority of D-Compress, which outperforms state-of-the-art (SOTA) compression methods in both geometric accuracy and downstream task performance, particularly at compression ratios exceeding 100x, while maintaining real-time execution on resource-constrained hardware. Moreover, evaluations under dynamic bandwidth conditions validate the robustness of its rate control mechanism.

Abstract:
Adaptive Compressive Sensing (ACS) has attracted increasing attention for its ability to progressively improve image reconstruction quality by dynamically adjusting sampling allocation. Multi-stage sampling is a promising strategy that leverages intermediate reconstructions to guide sampling without relying on image prior information. However, existing multi-stage methods often struggle to capture global structural information, resulting in biased sampling and suboptimal performance. Furthermore, the strong dependency between intermediate reconstruction for sampling guidance and the final reconstruction can hinder targeted optimization. To address these issues, we propose LGDR-Net, a Lightweight Guidance Sampling and Deep Refinement Reconstruction Network. Specifically, the Gradient-Fused Cross-Attention (GFCA) module, embedded within a lightweight guidance network, leverages globally fused information to compensate for incomplete content during multi-stage sampling. Then, sampling resource allocation is driven by inter-stage reconstruction differences, effectively exploiting image sparsity information. Finally, the Deep Refinement Network incorporates a Decoder Dense Feedback Mechanism (DDFM) to reduce cross-layer structural bias and a Multi-Branch Attention Fusion (MBAF) module for improved fine-texture representation. Extensive experiments demonstrate that our proposed LGDR-Net outperforms state-of-the-art methods, achieving an excellent trade-off between computational cost and reconstruction quality.

Abstract:
This paper addresses the Pitch Variation Problem in Two-Wheeled Self-Balancing (TWSB) robots that use 2D LiDAR for Simultaneous Localization And Mapping (SLAM). The issue arises from sudden accelerations or decelerations, leading to abrupt pitch variations that cause the 2D LiDAR to capture data from unintended surfaces, such as the ground or ceiling, destabilizing the robots position estimation. To mitigate this, we propose a novel preprocessing method that efficiently removes point clusters affected by pitch variation by leveraging their distinct characteristics, without the need for an alignment process. Experimental results demonstrate that our method reduces errors by at least 24.31% across various scan matching algorithms. Furthermore, as the proposed method operates independently of SLAM, it can be seamlessly integrated into a wide range of systems and has been shown to substantially enhance SLAM performance when used alongside existing algorithms.

Abstract:
Aerial transportation using quadrotors with cable-suspended payloads holds great potential for applications in disaster response, logistics, and infrastructure maintenance. However, their hybrid and underactuated dynamics pose significant control and perception challenges. Traditional approaches often assume a taut cable condition, limiting their effectiveness in real-world applications where slack-to-taut transitions occur due to disturbances. We introduce ES-HPC-MPC, a model predictive control framework that enforces exponential stability and perception-constrained control under hybrid dynamics. Our method leverages Exponentially Stabilizing Control Lyapunov Functions (ES-CLFs) to enforce stability during the tasks and Control Barrier Functions (CBFs) to maintain the payload within the onboard cameras field of view (FoV). We validate our method through both simulation and real-world experiments, demonstrating stable trajectory tracking and reliable payload perception. We validate that our method maintains stability and satisfies perception constraints while tracking dynamically infeasible trajectories and when the system is subjected to hybrid mode transitions caused by unexpected disturbances.

Abstract:
This paper presents a new approach for 6DoF Direct LiDAR-Inertial Odometry (D-LIO) based on the simultaneous mapping of truncated distance fields on CPU. Such continuous representation (in the vicinity of the points) enables working with raw 3D LiDAR data online, avoiding the need of LiDAR feature selection and tracking, simplifying the odometry pipeline and easily generalizing to many scenarios. The method is based on the proposed Fast Truncated Distance Field (Fast-TDF) method as a convenient tool to represent the environment, employing binary masks that encodes the L1 distance. Such representation enables i) solving the LiDAR point-cloud registration as a nonlinear optimization process without the need of selecting/tracking LiDAR features in the input data, ii) simultaneously producing an accurate truncated distance field map of the environment, and iii) updating such map at constant time independently of its size. The approach is tested using open datasets, aerial and ground. It is also benchmarked against other state-of-the-art odometry approaches, demonstrating the same or better level of accuracy with the added value of an online-generated TDF representation of the environment, that can be used for other robotics tasks as planning or collision avoidance. The source code is publicly available at https://github.com/robotics-upo/D-LIO.git

Abstract:
Quadruped robots have received increasing attention in recent years. Most existing trajectory planning algorithms for quadruped robots focus on how to avoid obstacles and achieve shortest trajectory or time, which is similar to the planning algorithms for mobile robots. These algorithms could not take full advantage of the agility and flexibility of quadruped robots. This letter designs a trajectory planner by taking advantage of the agility and flexibility of quadruped robots. With our trajectories, quadruped robots could navigate through complex terrains with more stability (e.g., less momentum variations along Z-axis). To achieve this goal, we use ground features at the landing point of the feet end to construct objective function, rather than using the center point of the robot body. Current discrete map representations, such as grid map or cost map, are difficult for optimization algorithms to introduce environment constraints. So, we use the Sparse Variational Gaussian Process (SVGP) to predict terrain features with point-cloud data as input, so that the environment constraints can be introduced into the optimization problem. Experimental results in both simulation and real-world environments demonstrate the effectiveness of our method.

Abstract:
Existing polar robots are constrained by limited energy supply, making it difficult to carry out long-term scientific exploration missions, which highlights an urgent demand for energy conservation. An energy-efficient multi-mode motion polar robot is proposed to address this challenge. Both increasing external assistance and reducing the driving force are critical for lowering energy consumption. A foldable sail is designed to provide external assistance. When unfolded, the sail generates assistive force. When folded, it maintains stability in extreme polar climates. The sail shape is designed based on a symmetrically extended NACA0018 airfoil, and the influence of different sail parameters on performance is discussed. The transformable tracks realize switching between traction and sliding modes through the separation of the track and teeth chain, using the sliding mode to reduce driving force. The effect of teeth parameter variations on traction performance is analyzed. The system kinematics and dynamics are model, and stability conditions are determined. Based on this, an energy-saving motion control framework for multi-mode motion is proposed. Finally, experiments are conducted to evaluate the energy-saving contribution of each independent mode under different configurations. Comprehensive experiments in multi-mode motion demonstrate an overall energy-saving rate of approximately 24%, verifying the effectiveness of the energy-saving motion control strategy. With its energy-saving advantages, this robot shows strong potential for enabling long-term scientific exploration in polar regions.

Abstract:
Enabling robots to walk on yielding terrain is vital for applications ranging from disaster response to planetary exploration. While bipedal robots hold immense potential, their locomotion on deformable surfaces remains limited as current simulators fail to capture the spatiotemporal heterogeneity of such yielding substrates. We present MILD, featuring a physics-grounded discrete-element contact solver that accurately simulates spatially varying foot-terrain interactions. Complementing this model, we train a terrain-aware locomotion controller via deep reinforcement learning with latent modulation and proprioceptive estimation. Quantitative comparisons against state-of-the-art methods show our approach generates more diverse and realistic contact scenarios during training, resulting in controllers that exhibit natural adaptation on real deformable surfaces. Through hardware experiments, we demonstrate the system's capability for online terrain identification and adaptation across a wide range of surface stiffness.

Abstract:
Liquid crystal elastomer (LCE) is a promising material for developing thermally-driven soft actuators due to its high force density, large elastic strain limit, and mechanically programmable nature. However, the complex trade-off between the force generated and the response speed (i.e., cooling rate), along with the lack of systematic design guidelines necessary to build actuators using LCE, has significantly limited its widespread adoption, especially for soft robotic applications at the meso-scale (i.e., cm-scale). In this work, we developed thermally-driven soft actuators by bundling liquid crystal elastomer units with integrated cooling that increased the response speed by over 400% when compared to relying only on passive cooling. We developed and experimentally validated an electro-thermo-mechanical model to predict the forces and cooling rates of our actuator and established systematic design guidelines to build our actuators for different soft robotic applications. Using our proposed guidelines, we present an inchworm-inspired locomotion robot that can achieve a top speed of 6 body lengths per minute. We also present a textile forearm cuff with integrated haptic feedback that can provide over 4 mm of skin stretch feedback with a cooling rate of 1 second. Overall, the presented actuator, experimental results, and design guidelines expand the potential use cases for thermally-driven actuators in soft robotic applications at the meso-scale.

Abstract:
We present a novel framework that integrates Large Language Models (LLMs) with automated planning and formal verification to streamline the creation and use of Markov Decision Processes (MDP). Our system leverages LLMs to extract structured knowledge in the form of a Prolog knowledge base from natural language (NL) descriptions. It then automatically constructs an MDP through reachability analysis, and synthesizes optimal policies using the Storm model checker. The resulting policy is exported as a state-action table for execution. We validate the framework in two human-robot interaction scenarios, demonstrating its ability to produce executable policies with minimal manual effort. This work highlights the potential of combining language models with formal methods to enable more accessible and scalable probabilistic planning in robotics.

Abstract:
This paper presents a single-motor underactuated gripper with variable stiffness, designed for food bin-picking tasks. The gripper employs highly compliant fingers that can passively adapt to cluttered environments and enclose target objects. Upon grasping, tendon-driven actuation increases structural stiffness, enabling low-damage, error-tolerant manipulation. Experiments on a variety of food items demonstrate robust grasping performance and stable object handling.

Abstract:
Robotic agri-food manipulation remains challenging because food items vary substantially in geometry, compliance, mass distribution, and surface properties, while their fragile nature makes grasping sensitive to small pose errors. This work presents a compact simulation-based study of how grasp topology affects robustness and mechanics-level interaction behavior in a reconfigurable four-finger gripper. Using AGX Dynamics, we evaluate three grasp configurations across representative agri-food objects under controlled yaw and planar-offset perturbations. The results show that spherical grasping is most robust to planar misplacement, torque is more perturbation-sensitive than force, and friction demand is governed more by object geometry than by grasp configuration. These findings provide an interpretable basis for robust and damage-aware configuration selection in agri-food manipulation.

Abstract:
This study addresses the challenge of estimating the three-dimensional(3D) position of a needle tip from two-dimensional(2D) X-ray images. We propose a classical image processingbased framework for needle tip localization and 3D reconstruction. The method first detects a circular marker attached to the robotic end-effector that controls the needle insertion and identifies the needle head position within the marker. Preprocessing steps, including bilateral filtering, thresholding, and iterative morphological operations, are applied to improve image quality and ensure the continuity of the needle shaft. A flood-fill algorithm is then used to segment the needle body, after which the needle trajectory is extracted using the A^star algorithm. Finally, the 3D position of the needle tip is reconstructed by Triangulation from multiple X-ray images acquired at different viewing angles.

Abstract:
Bipedal robots are inherently prone to falling due to their higher center of mass and narrower support polygon, making automatic fall recovery a long-standing challenge. Existing approaches often rely on posture-specific strategies or exhibit limited robustness and generalization, restricting their real-world applicability. We present a unified Mixture-of-Experts (MoE) framework that trains a single policy capable of recovering from diverse fallen configurations. By leveraging base height estimation and proprioceptive history within a gating mechanism, the framework dynamically allocates recovery tasks to specialized experts, yielding smooth and stable motions. Extensive real-world experiments show that the policy transfers zero-shot to hardware and consistently achieves recovery not only under repeated disturbances, but also from highly challenging postures and even on inclined slopesdemonstrating robustness and generalization beyond prior methods.

Abstract:
With the growing employment of learning algorithms in robotic applications, research on reinforcement learning for bipedal locomotion has become a central topic for humanoid robotics. While recently published contributions achieve high success rates in locomotion tasks, scarce attention has been devoted to the development of methods that enable to handle hardware faults that may occur during the locomotion process. However, in real-world settings, environmental disturbances or sudden occurrences of hardware faults might yield severe consequences. To address these issues, this paper presents TOLEBI: A fault-tolerant control framework for bipedal locomotion that handles faults and external disturbances on the robot during operation. Specifically, joint locking, power loss and external disturbances are injected in simulation to learn fault-tolerant locomotion strategies. In addition to transferring the learned policy to the real robot via sim-toreal transfer, an online joint status estimator incorporated. This module enables to classify joint conditions by referring to the actual observations at runtime under real-world conditions. The validation experiments conducted both in real-world and simulation with the humanoid robot TOCABI highlight the applicability of the proposed approach. To our knowledge, this manuscript provides the first learning-based fault-tolerant framework for bipedal locomotion, thereby fostering the development of efficient learning methods in this field.

Abstract:
Accurate trajectory tracking is crucial in aerial robotics. Optimal control methods such as Nonlinear Model Predictive Control (NMPC) are able to track trajectories exploiting the full nonlinear dynamics while respecting constraints. However, the NMPC model-based nature makes it sensitive to mismatches among nominal and real models. A common workaround to mitigate the effects of model uncertainties is to implement an inner-loop controller which robustifies the NMPC outer-loop. However, this inner-loop is usually based on purely feedback-based controllers such as PID or Incremental Nonlinear Dynamic Inversion (INDI), which do not allow to consider any constraint (such as limited actuation) or optimization criteria. In contrast, in this work we propose an optimization-based inner-loop controller inspired by Time Delay Control (TDC), that, thanks to a Quadratic Program (QP) formulation, is able to respect constrains and can thus preserve stability in presence of input saturation and model mismatches. Furthermore, thanks to the use of acceleration feedback, the knowledge of inertial parameters is not required by the proposed inner-loop which therefore makes it even more robust against model uncertainties. The overall architecture is validated on a fully-actuated hexarotor under model mismatches and aggressive trajectories. The experiments clearly show that our QP-based inner-loop improves the NMPC tracking performance while preserving the stability in conditions where a non-optimal (and more classical) inner-loop controllers would fail.

Abstract:
This paper presents a novel approach that combines the advantages of both model-based and learning-based frameworks to achieve robust locomotion. The residual modules are integrated with each corresponding part of the model-based framework, a footstep planner and dynamic model designed using heuristics, to complement performance degradation caused by a model mismatch. By utilizing a modular structure and selecting the appropriate learning-based method for each residual module, our framework demonstrates improved control performance in environments with high uncertainty, while also achieving higher learning efficiency compared to baseline methods. Moreover, we observed that our proposed methodology not only enhances control performance but also provides additional benefits, such as making nominal controllers more robust to parameter tuning. To investigate the feasibility of our framework, we demonstrated residual modules combined with model predictive control in a real quadrupedal robot. Despite uncertainties beyond the simulation, the robot successfully maintains balance and tracks the commanded velocity.

Abstract:
Monocular 3D object detection has gained attention for its cost-efficiency and simpler setup compared to multi-sensor systems. In this task, accurate depth estimation is crucial for precise object localization, yet extracting sufficient depth cues from a single image remains inherently challenging. Moreover, when occlusions occur, structural cues become limited, making precise object localization even more difficult. To address these problems, we propose MonoKey, a keypoint-based monocular 3D object detection method robust to occlusion. MonoKey leverages 2D keypoints due to their suitability for recovering occluded regions. The occlusion-robust 2D keypoint detection approach estimates keypoints and reconstructs occluded ones by using prior information. The frequency-based global-local depth predictor estimates 3D cues using fast Fourier convolution to incorporate both global and local context. These 3D cues and keypoints are then fused in a 3D detection decoder. Additionally, relational graph refinement adjusts initial bounding boxes for improved localization. Experimental results indicate that MonoKey outperforms the existing monocular 3D object detection methods. The source code is available at https://anonymous.4open.science/r/MonoKey-B72B.

Abstract:
We present FilterDDP, a differential dynamic programming algorithm for solving discrete-time, optimal control problems (OCPs) with nonlinear equality constraints. Unlike prior methods based on merit functions or the augmented Lagrangian class of algorithms, FilterDDP uses a step filter in conjunction with a line search to handle equality constraints. We identify two important design choices for the step filter criteria which lead to robust numerical performance: 1) we use the Lagrangian instead of the cost in the step acceptance criterion and, 2) in the backward pass, we perturb the value function Hessian. Both choices are rigorously justified, for 2) in particular by a formal proof of local quadratic convergence. In addition to providing a primal-dual interior point extension for handling OCPs with both equality and inequality constraints, we validate FilterDDP on three contact implicit trajectory optimisation problems which arise in robotics.

Abstract:
Tactile sensing allows robots to gather detailed geometric information about objects through physical interaction, complementing vision-based approaches. However, efficiently acquiring useful tactile data remains challenging due to the time-consuming nature of physical contact and the need to strategically choose contact locations that maximize information gain while minimizing physical interactions. This paper studies how different contact modes affect object shape reconstruction using a tactile-enabled dexterous gripper. We compare three contact interaction modes: grasp-releasing, sliding induced by finger-grazing, and palm-rolling. These contact modes are combined with an information-theoretic exploration framework that guides subsequent sampling locations using a shape completion model. Our results show that the improved tactile sensing efficiency of finger-grazing and palm-rolling translates into faster convergence in shape reconstruction, requiring 34% fewer physical interactions while improving reconstruction accuracy by 55%. We validate our approach using a UR5e robot arm equipped with an Inspire-Robots Dexterous Hand, showing robust performance across primitive object geometries.

Abstract:
Stereo disparity estimation is crucial for obtaining depth information in robot-assisted minimally invasive surgery (RAMIS). While current deep learning methods have made significant advancements, challenges remain in achieving an optimal balance between accuracy, robustness, and inference speed. To address these challenges, we propose the StereoMamba architecture, which is specifically designed for stereo disparity estimation in RAMIS. Our approach is based on a novel Feature Extraction Mamba (FE-Mamba) module, which enhances long-range spatial dependencies both within and across stereo images. To effectively integrate multi-scale features from FE-Mamba, we then introduce a novel Multidimensional Feature Fusion (MFF) module. Experiments against the state-of-the-art on the ex-vivo SCARED benchmark demonstrate that StereoMamba achieves superior performance on EPE of 2.64 px and depth MAE of 2.55 mm, the second-best performance on Bad2 of 41.49% and Bad3 of 26.99%, while maintaining an inference speed of 21.28 FPS for a pair of high-resolution images (12801024), striking the optimum balance between accuracy, robustness, and efficiency. Furthermore, by comparing synthesized right images, generated from warping left images using the generated disparity maps, with the actual right image, StereoMamba achieves the best average SSIM (0.8970) and PSNR (16.0761), exhibiting strong zero-shot generalization on the in-vivo RIS2017 and StereoMIS datasets.

Abstract:
Motion profile optimization is a powerful optimization technique that allows to reduce the energy consumption of robotic systems by changing the temporal profile of the joint position setpoints. However, despite the extensive exploration of these techniques for robotic systems following constrained paths, many existing methodologies rely on complex optimization processes or a larger number of design parameters. This paper introduces a novel approach that leverages the Chebyshev basis to optimize the motion along a fixed geometric path, thereby achieving a measured torque difference of -15% while requiring only a limited number of design parameters. By employing the Chebyshev basis, the formulation leads to a smooth objective function and enables the definition of linear inequality constraints that accurately enclose the feasible design space. This unique combination of features not only simplifies the optimization problem but also enhances the probability of locating the global optimum, particularly illustrated in the two-dimensional case. The methodology is established in a generic and model-independent manner, setting a promising direction for future research in motion profile optimization for constrained-path robotic systems.

Abstract:
Object-Goal Navigation in dynamic environments remains challenging because many existing approaches rely primarily on reactive mapping and lack the ability to retain historical experience or establish structured memory associations. To address this limitation, we introduce MemClaw-RAG, an embodied multimodal framework. MemClaw-RAG includes three main components: (1) a Memory Graph Retrieval (MGR) module that leverages multimodal knowledge graphs to support semantic association and target retrieval; (2) a SelfClaw cognitive module that manages skill scheduling and task execution through memory-aware reasoning; and (3) a Hybrid Adaptive Locomotion Policy (HALP) based on deep reinforcement learning that enables efficient locomotion for wheeled-legged robots across different terrain conditions.On Habitat benchmarks, MemClaw-RAG achieves a Success Rate (SR) of 0.81 and a Success-weighted Path Length (SPL) of 0.51 on the Gibson and HM3D datasets. In the more challenging multi-layer environments of MP3D, the proposed method achieves an SR of 0.76 and an SPL of 0.48, outperforming several representative memory-based and end-to-end navigation approaches. Real-world deployment on a Unitree wheeled-legged robot further demonstrates the practicality of the system, achieving an average per-step inference latency of 55 ms on a Jetson Orin platform while maintaining stable navigation behavior in dynamic indoor environments.

Abstract:
This paper presents an Adaptive Gain Nonlinear Observer (AGNO) for estimating the external interaction wrench (forces and torques) in human-UAV physical interaction for assistive payload transportation. The proposed AGNO uses the full nonlinear dynamic model to achieve an accurate and robust wrench estimation without relying on dedicated force-torque sensors. A key feature of this approach is the explicit consideration of the non-constant inertia matrix, which is essential for aerial systems with asymmetric mass distribution or shifting payloads. A comprehensive dynamic model of a cooperative transportation system composed of two quadrotors and a shared payload is derived, and the stability of the observer is rigorously established using Lyapunov-based analysis. Simulation results validate the effectiveness of the proposed observer in enabling intuitive and safe human-UAV interaction. Comparative evaluations demonstrate that the proposed AGNO outperforms an Extended Kalman Filter (EKF) in terms of estimation root mean square errors (RMSE), particularly for torque estimation under nonlinear interaction conditions. This approach reduces system weight and cost by eliminating additional sensing hardware, enhancing practical feasibility.

Abstract:
This paper introduces DynaFlow, a novel framework that embeds a differentiable simulator directly into a flow matching model. By generating trajectories in the action space and mapping them to dynamically feasible state trajectories via the simulator, DynaFlow ensures all outputs are physically consistent by construction. This end-to-end differentiable architecture enables training on state-only demonstrations, allowing the model to simultaneously generate physically consistent state trajectories while inferring the underlying action sequences required to produce them. We demonstrate the effectiveness of our approach through quantitative evaluations and showcase its real-world applicability by deploying the generated actions onto a physical Go1 quadruped robot. The robot successfully reproduces diverse gait present in the dataset, executes long-horizon motions in open-loop control and translates infeasible kinematic demonstrations into dynamically executable, stylistic behaviors. These hardware experiments validate that DynaFlow produces deployable, highly effective motions on real-world hardware from state-only demonstrations, effectively bridging the gap between kinematic data and real-world execution.

Abstract:
Pipeline inspection is essential for maintaining the safety of critical infrastructure, but manual inspection is dangerous and inefficient, and existing robotic solutions struggle to handle curved and constrained surfaces. Traditional planning methods are either computationally expensive or prone to redundancy and discretization artifacts. To address these challenges, this paper proposes a centerline-aligned Frenet graph framework for surface-based path planning in pipeline environments. By embedding the pipeline surface into a structured two-dimensional manifold passing through the pipeline's central axis, the framework enables efficient heuristic search while maintaining geometric consistency. By combining quadratic programming with kinematic limits, an initial geodesic constrained path is generated and optimized, resulting in a smooth and executable trajectory. Extensive experiments on pipelines with sharp bends, intersections, and real-world pipeline environments demonstrate significant improvements in computational efficiency, path quality, and robustness compared to traditional methods.

Abstract:
Mass-casualty incidents demand rapid and accurate triage, but the scale and acuity of injuries often overwhelm available medical personnel. To address this, we present a system that enables ground and aerial robots to localize and assess casualties using non-contact sensors, including color and thermal cameras, millimeter wave radar, and microphones. Injury and vital sign measurements from modality-specific classifiers are fused using a probabilistic model that captures correlations between injury states and supports distributed, asynchronous evidence accumulation. We validate the system through a series of timed mass-casualty field experiments using custom-built drones and Boston Dynamics Spot ground robots customized for robotic medical triage, demonstrating reliable estimation of casualty states and robustness to noisy conditions and sensor drop out.

Abstract:
Robotic grasping requires flexible reconfiguration to handle diverse objects and tasks. This paper proposes a monorail-like reconfiguration framework for robotic grippers, inspired by trainrail relationships, that generates diverse finger layouts. The proposed framework unifies two complementary forms: dynamic reconfiguration, in which finger units move along an arbitrary non-circular track defined by the palm shape (palm track) to change the finger layout, and modular reconfiguration, in which the palm track shape and the number of fingers are modified to alter the achievable finger layout space. We developed a prototype gripper system that embodies the proposed framework and experimentally validated its unified reconfiguration capability. Dynamic reconfiguration with the S-shaped palm achieved seven distinct finger layouts with successful object grasping, while on-the-fly modular reconfiguration expanded the achievable finger layout space, enabling rapid adaptation to different grasping tasks. This work establishes a new design principle for reconfigurable grippers toward highly adaptive and versatile robotic grasping.

Abstract:
In warehouse environments, corridor conflicts often lead to traffic congestion, which result in many Multi-Agent Path Finding (MAPF) algorithms failing to find solutions within a reasonable time limit. Previous works have studied using corridor reasoning techniques or incorporating guidance, such as highways, to address this problem. However, these approaches often either encounter timeouts or yield low-quality solutions. In this work, based on Conflict-Based Search (CBS), we propose a technique called Reversible Lanes, specifically designed to address corridor conflicts by imposing a new constraint that forces agents to use bypasses for conflict resolution. Our approach is motivated by three key observations from prior research: (1) in warehouse maps, the overhead associated with maintaining solution optimality via corridor reasoning technique is often disproportionate to the benefits gained; (2) the fixed nature of manually designed highways exhibits a lack of adaptability, leading to poor solution quality on certain instances; and (3) the structural properties of warehouse layouts render direct bypass usage feasible and incur minimal additional costs. Theoretically, we demonstrate the feasibility of our algorithm by analyzing its relationship to both corridor reasoning techniques and highways. Experimentally, the results show that our algorithm provides a more effective approach for resolving corridor conflicts compared to these existing methods, achieving a superior trade off between solution quality and computational efficiency by finding near-optimal solutions with reduced runtime.

Abstract:
Visual SLAM algorithms achieve significant improvements through the exploration of 3D Gaussian Splatting (3DGS) representations, particularly in generating high-fidelity dense maps. However, they depend on a static environment assumption and experience significant performance degradation in dynamic environments. This paper presents GGD-SLAM, a framework that employs a generalizable motion model to address the challenges of localization and dense mapping in dynamic environmentswithout predefined semantic annotations or depth input. Specifically, the proposed system employs a First-In-First-Out (FIFO) queue to manage incoming frames, facilitating dynamic semantic feature extraction through a sequential attention mechanism. This is integrated with a dynamic feature enhancer to separate static and dynamic components. Additionally, to minimize dynamic distractors' impact on the static components, we devise a method to fill occluded areas via static information sampling and design a distractor-adaptive Structure Similarity Index Measure (SSIM) loss tailored for dynamic environments, significantly enhancing the system's resilience. Experiments conducted on real-world dynamic datasets demonstrate that the proposed system achieves state-of-the-art performance in camera pose estimation and dense reconstruction in dynamic scenes.

Abstract:
Most robot manipulation focuses on changing the kinematic state of objects: picking, placing, opening, or rotating them. However, a wide range of real-world manipulation tasks involve a different class of object state changesuch as mashing, spreading, or slicingwhere the objects physical and visual state evolve progressively without necessarily changing its position. We present SPARTA, the first unified framework for the family of object state change manipulation tasks. Our key insight is that these tasks share a common structural pattern: they involve spatially-progressing, object-centric changes that can be represented as regions transitioning from an actionable to a transformed state. Building on this insight, SPARTA integrates spatially progressing object change segmentation maps, a visual skill to perceive actionable vs. transformed regions for specific object state change tasks, to generate a) structured policy observations that strip away appearance variability, and b) dense rewards that capture incremental progress over time. These are leveraged in two SPARTA policy variants: reinforcement learning for fine-grained control without demonstrations or simulation; and greedy control for fast, lightweight deployment. We validate SPARTA on a real robot for three challenging tasks across 10 diverse real-world objects, achieving significant improvements in training time and accuracy over sparse rewards and visual goal-conditioned baselines. Our results highlight progress-aware visual representations as a versatile foundation for the broader family of object state manipulation tasks. More information at https://vision.cs.utexas.edu/projects/sparta-robot

Abstract:
In whole-body mobile manipulation, existing teleoperation systems often suffer from high complexity and cost, while imitation learning approaches are frequently limited by insufficient modeling of long-horizon action sequences and inadequate fusion of multi-receptive-field visual features. These constraints significantly hinder the collection of high-quality demonstration data and the effective transfer of complex robotic skills. To address these challenges, this paper proposes an integrated exoskeleton-VR teleoperation system that enables single-operator whole-body control of mobile manipulators with basic force feedback, substantially reducing the cost of data collection while improving demonstration quality. Furthermore, we introduce MIMO, an encoderdecoder imitation learning framework, which incorporates an Efficient Context Modeling Network (ECM-Net) based on linear-complexity temporal modeling to mitigate error accumulation in long-horizon tasks, and a Multi-Receptive Field Fusion Network (MRF-Net) that employs dual-path attention to achieve precise alignment between multi-scale visual cues and motion phases. Real-world experiments on a mobile manipulator demonstrate that MIMO consistently outperforms state-of-the-art baselines across multiple whole-body mobile manipulation tasks, confirming its effectiveness in long-horizon, fine-grained robotic control.

Abstract:
The paper presents a new approach for constructing a library of optimal trajectories for two robotic manipulators, Two-Arm Optimal Control and Avoidance Library (TOCALib). The optimization takes into account kinodynamic and other constraints within the FROST framework. The novelty of the method lies in the consideration of collisions using the DCOL method, which allows obtaining symbolic expressions for assessing the presence of collisions and using them in gradient-based optimization control methods. The proposed approach is applicable for complex bimanual manipulations that require precision. In this paper we tested TOCALib on Mobile Aloha robot, as an example. The approach can be extended to other bimanual robots, as well as to gait control of bipedal robots. It can also be used to construct training data for machine learning tasks for manipulation.

Abstract:
Wheeled-legged hybrid robots have generated growing interest in the research community due to the need for more efficient and versatile locomotion. Most recent research has focused on active wheels, but passive wheeled systems have great potential in improving energy efficiency. However, skating remains highly complex due to the difficulties of balancing dynamic motion, managing wheel-ground interactions, achieving precise torque control for smooth rolling, and adapting to unpredictable terrain while maintaining stability. We present an end-to-end model-free reinforcement learning approach that enables quadrupedal robots to skate efficiently, achieving agile and robust locomotion on both flat and rough terrain. Our skating-specific policy and sim-to-real pipeline are validated on a physical quadruped across diverse terrains with varying roughness, slopes, and features, consistently demonstrating controlled and efficient traversal. The robot achieves velocities up to 1.5 m/s with a cost of transport 40.9% lower than the skating state of the art and 70.9% lower than standard legged locomotion. These results establish skating as a feasible and efficient alternative mode of urban locomotion for quadrupedal robots, setting a foundation for future wheeled-legged research.

Abstract:
Modeling environmental forces remains a critical challenge in the design and control of robots operating on granular terrain. In pushing locomotion, propulsion is generated by displacing a large number of particles; however, the resulting terrain deformation complicates accurate real-time force prediction. Most existing resistive force models do not explicitly account for these deformation effects. To address this limitation, we develop a force model that incorporates motion-induced terrain deformation for pushing motions in granular media. A wheel lug is adopted as a representative element. We first investigate translational motion using the discrete element method (DEM) to characterize terrain deformation under different velocity directions. The analysis identifies dominant deformation patterns, which are embedded in the force model. Building on this analysis, we examine the rotational motion of a single lug through experiments, DEM simulations, and model predictions. The results demonstrate that the proposed model accurately captures force responses across varying velocity directions, exhibiting closer agreement with DEM and experiments than conventional approaches. This work advances real-time force modeling for robot-granular terrain interactions and highlights the potential of deformation-integrated models in deformable environments.

Abstract:
Robots operating in everyday environments must understand fine-grained human actions, intentions, and contextual cues from broad views where people occupy only small regions, a capability unmet by current systems. While open-vocabulary action recognition methods remain limited to assigning predefined labels, and vision-language models (VLMs) face an inherent trade-off between informational richness and factual fidelity in their outputs, neither approach achieves the deep semantic interpretation required for reliable human-robot interaction. We propose Gold Points Sniper (GPS), a novel framework that empowers lightweight VLMs with self-guided multimodal reasoning capabilities for fine-grained human action understanding. Our approach comprises three key modules: Gold Points Extractor trains VLMs to identify critical action-relevant details, Selective Socratic Ques- tioner validates and refines these details through selective self-questioning, and Semantic Entailment Evaluator quantitatively assesses factual consistency using semantic entailment classification. Extensive experiments on our curated instruction-tuning dataset based on the CAP benchmark demonstrate that GPS-enhanced lightweight VLMs achieve substantial performance improvements, with some models reaching performance comparable to proprietary GPT-4o while maintaining superior factual accuracy. Our work establishes a reliable foundation for fine-grained action understanding in domestic robotics, enabling robots to safely interpret human behavior through information-dense yet factually grounded descriptions. Source code, training configurations, annotation prompts, and dataset details are released at https://github.com/Haodi-Liu/GPS-Gold-Point-Sniper.

Abstract:
Cyber-physical robotic systems are vulnerable to false data injection attacks (FDIAs), in which an adversary corrupts sensor signals while evading residual-based passive anomaly detectors such as the chi-squared test. Such stealthy attacks can induce substantial end-effector deviations without triggering alarms. This paper studies the resilience of redundant manipulators to stealthy FDIAs and advances the architecture from passive monitoring to active defence. We formulate a closed-loop model comprising a feedback-linearized manipulator, a steady-state Kalman filter, and a chi-squared-based anomaly detector. Building on this passive monitoring layer, we propose an active control-level defence that attenuates the control input through a monotone function of an anomaly score generated by a novel actuation-projected, measurement-free state predictor. The proposed design provides probabilistic guarantees on nominal actuation loss and preserves closed-loop stability. From the attacker perspective, we derive a convex QCQP for computing one-step optimal stealthy attacks. Simulations on a 6-DOF planar manipulator show that the proposed defence significantly reduces attack-induced end-effector deviation while preserving nominal task performance in the absence of attacks.

Abstract:
Continuum robots are promising for assistive manipulation, but often lack the stiffness and payload capacity required for real-world tasks. This paper investigates the feasibility of a novel dual-mode, gravity-assisted ceiling-mounted articulated discrete serial robot that transitions between passive and active states using a friction-based shape-locking mechanism. In passive mode, joints are unlocked, allowing for chain-like flexibility similar to that of ceiling hoists. In active mode, joints are locked, allowing for rigid and accurate manipulation. To evaluate feasibility, we implemented a reduced-scale prototype with two passive joints and one active joint. We tracked its accuracy across 300 iterations of point-to-point motion in a 2D plane. Results show high repeatability and robustness, highlighting the potential of this architecture for ceiling-mounted manipulation. Beyond healthcare tasks such as patient handling, this approach contributes a scalable actuation and shape-locking strategy for articulated discrete serial robots in constrained environments.

Abstract:
Imitation from videos often fails when expert demonstrations and learner environments exhibit domain shifts, such as discrepancies in lighting, color, or texture. While visual randomization partially addresses this problem by augmenting training data, it remains computationally intensive and inherently reactive, struggling with unseen scenarios. We propose a different approach: instead of randomizing appearances, we eliminate their influence entirely by rethinking the sensory representation itself. Inspired by biological vision systems that prioritize temporal transients (e.g., retinal ganglion cells) and by recent sensor advancements, we introduce event-inspired perception for visually robust imitation. Our method converts standard RGB videos into a sparse, event-based representation that encodes temporal intensity gradients, discarding static appearance features. This biologically grounded approach disentangles motion dynamics from visual style, enabling robust visual imitation from observations even in the presence of visual mismatches between expert and agent environments. By training policies on event streams, we achieve invariance to appearance-based distractors without requiring computationally expensive and environment-specific data augmentation techniques. Experiments across the DeepMind Control Suite and the Adroit platform for dynamic dexterous manipulation show the efficacy of our method.

Abstract:
We present a framework that integrates EEG-based visual and motor imagery (VI/MI) with robotic control to enable real-time, intention-driven grasping and placement. Motivated by the promise of BCI-driven robotics to enhance human-robot interaction, this system bridges neural signals with physical control by deploying offline-pretrained decoders in a zero-shot manner within an online streaming pipeline. This establishes a dual-channel intent interface that translates visual intent into robotic actions, with VI identifying objects for grasping and MI determining placement poses, enabling intuitive control over both what to grasp and where to place. The system operates solely on EEG via a cue-free imagery protocol, achieving integration and online validation. Implemented on a Base robotic platform and evaluated across diverse scenarios, including occluded targets or varying participant postures, the system achieves online decoding accuracies of 40.23% (VI) and 62.59% (MI), with an end-to-end task success rate of 20.88%. These results demonstrate that high-level visual cognition can be decoded in real time and translated into executable robot commands, bridging the gap between neural signals and physical interaction, and validating the flexibility of a purely imagery-based BCI paradigm for practical humanrobot collaboration.

Abstract:
Autonomous driving in complex traffic requires reasoning under uncertainty. Common approaches rely on prediction-based planning or risk-aware control, but these are typically treated in isolation, limiting their ability to capture the coupled nature of action and inference in interactive settings. This gap becomes especially critical in uncertain scenarios, where simply reacting to predictions can lead to unsafe maneuvers or overly conservative behavior. Our central insight is that safe interaction requires not only estimating human behavior but also shaping it when ambiguity poses risks. To this end, we introduce a hierarchical belief model that structures human behavior across coarse discrete intents and fine motion modes, updated via Bayesian inference for interpretable multi-resolution reasoning. On top of this, we develop an active probing strategy that identifies when multimodal ambiguity in human predictions may compromise safety and plans disambiguating actions that both reveal intent and gently steer human decisions toward safer outcomes. Finally, a runtime risk-evaluation layer based on Conditional Value-at-Risk (CVaR) ensures that all probing actions remain within human risk tolerance during influence. Our simulations in lane-merging and unsignaled intersection scenarios demonstrate that our approach achieves higher success rates and shorter completion times compared to existing methods. These results highlight the benefit of coupling belief inference, probing, and risk monitoring, yielding a principled and interpretable framework for planning under uncertainty.

Abstract:
Achieving safe and reliable automated driving in real-world conditions requires the ability to handle rare and unpredictable situations, commonly known as long-tail scenarios. These cases are often underrepresented in training data and remain a major challenge for conventional motion planning systems. In this work, we present VisuaLLMPlanner, a maneuver planning framework that integrates a multimodal large language model (MLLM) into the high-level decision-making loop of an automated driving pipeline. The system is triggered when the ego vehicle encounters a situation with an obstacle that cannot be resolved by a standard lane-following planner. At this point, a structured input comprising a birds-eye view image and a textual scene description is generated and passed to the MLLM. Rather than generating plans directly, the model selects from a discrete set of pre-generated and validated maneuver options, allowing for interpretable and structured decision-making. We evaluate our approach on the interPlan benchmark, which focuses explicitly on long-tail scenarios, and demonstrate that VisuaLLMPlanner achieves strong performance in comparison to prior LLM-based planners. The results highlight both the potential and current limitations of foundation models for high-level reasoning in automated vehicle planning.

Abstract:
Physical HumanRobot Interaction (pHRI) requires control frameworks that balance accuracy, compliance, and safety under variable human behaviors. This paper proposes a novel Model Predictive Variable Admittance (MPVA) framework that integrates trajectory tracking, interaction force directionality, and passivity constraints into an online real-time optimization scheme. The proposed architecture is implemented on a 7-DoF Kinova Jaco-2 robot and validated experimentally through mixed assistive and resistive modes with multiple subjects performing pHRI tasks. Results supported by both objective metrics and subjective evaluation through a NASA TLX survey show that the MPVA achieves competitive tracking accuracy, reducing physical effort with minimal passivity violations compared to other algorithmic baselines such as fixed-gain admittance and fuzzy-based adaptive admittance. This demonstrates safe and effective human-robot physical interaction across diverse modes.

Abstract:
We present a method to generate videoaction pairs that follow text instructions, starting from an initial image observation and the robots joint states. Our approach automatically provides action labels for video diffusion mod- els, overcoming the common lack of action annotations and enabling their full use for robotic policy learning. Existing methods either adopt two-stage pipelines, which limit tightly coupled cross-modal information sharing, or rely on adapting a single-modal diffusion model for joint distribution that cannot fully leverage pretrained video knowledge. To overcome these limitations, we (1) extend a pretrained video diffusion model with a parallel, dedicated action diffusion model that preserves pretrained knowledge, (2) introduce a Bridge Attention mecha- nism to enable effective cross-modal interaction, and (3) design an action refinement module to convert coarse actions into precise controls for low-resolution datasets. Experiments on public benchmarks and real-world datasets demonstrate that our method produces higher-quality videos, more accurate ac- tions, and significantly outperforms existing baselines, offering a scalable framework for leveraging large-scale video data for robotic learning

Abstract:
Modern manufacturing demands robotic assembly systems with enhanced flexibility and reliability. However, traditional approaches often rely on programming tailored to each product by experts for fixed settings, which are inherently inflexible to product changes and lack the robustness to handle variations. As Behavior Trees (BTs) are increasingly used in robotics for their modularity and reactivity, we propose a novel hierarchical framework, Video-to-BT, that seamlessly integrates high-level cognitive planning with low-level reactive control, with BTs serving both as the structured output of planning and as the governing structure for execution. Our approach leverages a Vision-Language Model (VLM) to decompose human demonstration videos into subtasks, from which Behavior Trees are generated. During the execution, the planned BTs combined with real-time scene interpretation enable the system to operate reactively in the dynamic environment, while VLM-driven replanning is triggered upon execution failure. This closed-loop architecture ensures stability and adaptivity. We validate our framework on real-world assembly tasks through a series of experiments, demonstrating high planning reliability, robust performance in long-horizon assembly tasks, and strong generalization across diverse and perturbed conditions. Project website: https://video2bt.github.io/video2bt_page/

Abstract:
Estimating a devices 6DoF pose (i.e., location and orientation) within the environment is a fundamental problem in robotics, and beyond. Millimeter-wave (mmWave) radars have emerged as an attractive alternative to optical sensors (e.g., RGB cameras) in these tasks due to their ability to operate in poor lighting and adverse conditions such as smoke and fog. This paper presents mmNeRF, a view synthesis and 6DoF pose estimation system based on neural radiance fields (NeRF) designed specifically for mmWave radars. mmNeRF requires only radar range-angle heatmaps collected in a given environment to construct its implicit neural representation, ensuring multi-view consistency and producing high-quality view synthesis. It then builds a 6DoF pose estimation framework that queries the neural model with particle filters to perform scan & matching operations to yield an accurate 6DoF pose. We evaluate mmNeRF using over 50K radar frames collected in six different indoor environments on a handheld rig equipped with a radar. Our results show that mmNeRF achieves median translation and rotation errors of 0.34m and 17.15�?for single-spectrum 6DoF pose estimation and Absolute Trajectory Error (ATE) of 0.66m and 0.81 radians for continuous 6DoF pose tracking, considerably outperforming state-of-the-art solutions.

Abstract:
Precise 3D representations of industrial environments enable tasks such as robot localization and digital twin generation. We propose SAVMap, a method for generating a wireframe map of warehouse shelf and light structures using only a panoramic video camera as the sensor input. Sequences of rectified images with shelf and ceiling-facing views are extracted from a panoramic video captured along the warehouse aisles. Using a semantic segmentation network front end, a set of sparse, semantic structure feature points (e.g., corners of shelf structures, centers of lights) are extracted from each image and tracked across the sequences. By accounting for real-world geometric relationships among the points such as Manhattan grids, a constrained structure-from-motion algorithm yields the 3D points that form a wireframe map. We demonstrate the scalability and accuracy of our proposal in a warehouse with 46 shelving rows, each with faces spanning 55m by 7m. From an hour of panoramic video content, we create wireframe maps for over 5000 shelf elements across the rows, achieving an aggregate mean absolute error of 4.8cm with respect to ground-truth.

Abstract:
Tactile sensing is a widely-studied means of implicit communication between robot and human. In this paper, we investigate how tactile sensing can help bridge differences between robotic embodiments in the context of collaborative manipulation. For a robot, learning and executing force-rich collaboration require compliance to human interaction. While compliance is often achieved with admittance control, many commercial robots lack the joint torque monitoring needed for such control. To address this challenge, we present an approach that uses tactile sensors and behavior cloning to transfer policies from robots with these capabilities to those without. We train a single policy that demonstrates positive transfer across embodiments, including robots without torque sensing. We demonstrate this positive transfer on four different tactile-enabled embodiments using the same policy trained on force-controlled robot data. Across multiple proposed metrics, the best performance came from a decomposed tactile shear-field representation combined with a pre-trained encoder, which improved success rates over alternative representations.

Abstract:
We propose PSALM-V, the first autonomous neuro-symbolic learning system able to induce symbolic action semantics (i.e., pre- and post-conditions) in visual environments through interaction. PSALM-V bootstraps reliable symbolic planning without expert action definitions, using LLMs to generate heuristic plans and candidate symbolic semantics. Previous work has explored using large language models to generate action semantics for Planning Domain Definition Language (PDDL)-based symbolic planners. However, these approaches have primarily focused on text-based domains or relied on unrealistic assumptions, such as access to a predefined problem file, full observability, or explicit error messages. By contrast, PSALM-V dynamically infers PDDL problem files and domain action semantics by analyzing execution outcomes and synthesizing possible error explanations. The system iteratively generates and executes plans while maintaining a tree-structured belief over possible action semantics for each action, iteratively refining these beliefs until a goal state is reached. Simulated experiments of task completion in ALFRED demonstrate that PSALM-V increases the plan success rate from 37% (Claude-3.7) to 74% in partially observed setups. Results on two 2D game environments, RTFM and OverCooked-AI, show that PSALM-V improves step efficiency and succeeds in domain induction in multi-agent settings. PSALM-V correctly induces PDDL pre- and post-conditions for real-world robot BlocksWorld tasks, despite low-level manipulation failures from the robot. Videos and resources at https://psalmv.github.io/.

Abstract:
Reinforcement learning has facilitated agile locomotion in quadrupedal robots. However, most works remain highly dependent on the accuracy of simulation models in describing real-world robot dynamics. Consequently, policy transfer from simulation to hardware is still hindered by the well-known sim-to-real gap, which typically arises from modeling errors and the challenges of efficiently obtaining informative data in large state-action spaces. To address these challenges, this work proposes an innovative framework U2E that integrates Uncertainty-aware actuator modeling with an Uncertainty-guided Exploration policy. The actuator model leverages a deep ensemble of neural networks to provide both precise predictions and uncertainty estimates, allowing for the assessment of model confidence and the identification of regions with inadequate data coverage. The exploration strategy then actively guides data collection to autonomously acquire informative real-world samples and refine actuator models, thereby enhancing compensation for simulation discrepancies. Experiments on the quadrupedal locomotion tasks, including jumping and trajectory tracking, demonstrate that our approach reduces the sim-to-real gap and improves performance without the dependence on manually designed trajectories.

Abstract:
Assembly action understanding is a key enabler for effective human-robot collaborative assembly, yet it remains challenging due to subtle motions and fine-grained handobject interactions. We adapt vision-language models (VLMs) to this challenging domain with Compositional Context Fine-Tuning (CCFT), a method that decomposes assembly actions into semantic elements (textitVerb, textitObject, textitTool) and fine-tunes VLMs to recognize each action element using templated question-answering pairs. This approach ensures near-deterministic outputs. To enable efficient and effective multi-task learning under limited data, a Layer-Partitioned Alternating Training (LP-AT) method is presented, which assigns distinct model layers to recognize specific action elements through element-specific low-rank adapters. LP-AT alternates weight updates across element-specific adapters, reducing cross-task interference while enabling per-adapter hyperparameter optimization. Furthermore, we create HA-ViD-VQA and IKEA-ASM-VQA datasets from existing assembly video datasets. Extensive experiments on these datasets demonstrate that our method consistently outperforms strong action recognition baselines while providing interpretable element-level predictions that can support diverse downstream applications. Code and dataset are released at urlhttps://github.com/x-labs-xyz/CCFT.

Abstract:
Learning robotic manipulation policies directly in the real world can be expensive and time-consuming. While reinforcement learning (RL) policies trained in simulation present a scalable alternative, effective sim-to-real transfer remains challenging, particularly for tasks that require precise dynamics. To address this, we propose Phys2Real, a real-to-sim-to-real RL pipeline that combines vision-language model (VLM)-inferred physical parameter estimates with interactive adaptation through uncertainty-aware fusion. Our approach consists of three core components: (1) high-fidelity geometric reconstruction with 3D Gaussian splatting, (2) VLM-inferred prior distributions over physical parameters, and (3) online physical parameter estimation from interaction data. Phys2Real conditions policies on interpretable physical parameters, refining VLM predictions with online estimates via ensemble-based uncertainty quantification. On planar pushing tasks of a T-block with varying center of mass and a hammer with an off-center mass distribution, Phys2Real achieves substantial improvements over a domain randomization baseline: 100% vs 79% success rate for the bottom-weighted T-block, 57% vs 23% in the challenging top-weighted T-block, and 15% faster average task completion for hammer pushing. Ablation studies indicate that the combination of VLM and interaction information is essential for success. Project website: https://phys2real.github.io/.

Abstract:
Small-scale robots are rapidly advancing in diverse fields such as industry and medicine. To be effective, they must be capable of accessing narrow, tortuous, or otherwise hard-to-reach environments and performing precise manipulation. This paper presents a vision-based closed-loop motion control scheme for a developed fiber-driven continuum robot for cross-scale motion. Function-multiplexed optical fibers are employed to achieve macro motion through fiber actuation and micro motion through light transmission within the fibers. An external eye-to-hand camera system observes a fiducial tag to estimate its 3D pose relative to the camera frame. The coordinate transformation between the tag and the end-effector is calibrated, along with the mapping between input laser power and light-induced joint contractions. A two-stage image-based visual servoing strategy is then implemented to guide the tag toward the target image position, thereby realizing closed-loop hybrid macromicro motion through the developed kinematics and visual feedback. Point-tracking experiments demonstrate that the small-scale continuum robot, with an outer diameter of approximately 1.2 mm, can achieve precise cross-scale motion across workspaces ranging from tens of microns to the millimeter scale under the proposed control scheme. This work highlights the potential of hybrid macromicro motion with visual servoing for deep access and high-precision operation in endoluminal interventions.

Abstract:
Climbing robots face significant challenges when navigating unstructured environments, where reliable attachment to irregular surfaces is critical. We present a novel mobile climbing robot equipped with compliant pin-array structured grippers that passively conform to surface irregularities, ensuring stable ground gripping without the need for complicated sensing or control. Each pin features a vertically split design, combining an elastic element with a metal spine to enable mechanical interlocking with microscale surface features. Statistical modeling and experimental validation indicate that variability in individual pin forces and contact numbers are the primary sources of grasping uncertainty. The robot demonstrated robust and stable locomotion in indoor tests on inclined walls (10--30 degrees) and in outdoor tests on natural rocky terrain. This work highlights that a design emphasizing passive compliance and mechanical redundancy provides a practical and robust solution for real-world climbing robots while minimizing control complexity.

Abstract:
Robot person following (RPF) is a core capability in humanrobot interaction, enabling robots to assist users in daily activities, collaborative work, and other service scenarios. However, achieving practical RPF remains challenging due to frequent occlusions, particularly in dynamic and crowded environments. Existing approaches often rely on fixed-point following or sparse candidate-point selection with oversimplified heuristics, which cannot adequately handle complex occlusions caused by moving obstacles such as pedestrians. To address these limitations, we propose an adaptive trajectory sampling method that generates dense candidate points within socially aware zones and evaluates them using a multi-objective cost function. Based on the optimal point, a person-following trajectory is estimated relative to the predicted motion of the target. We further design a prediction-aware model predictive path integral (MPPI) controller that simultaneously tracks this trajectory and proactively avoids collisions using predicted pedestrian motions. Extensive experiments show that our method outperforms state-of-the-art baselines in smoothness, safety, robustness, and human comfort, with its effectiveness further demonstrated on a mobile robot in real-world scenarios.

Abstract:
Robust spatial understanding is crucial for Visual Question Answering (VQA) in autonomous driving that aims to enhance decision-making, reduce positional risks, and ensure road safety by providing answers based on the perception, prediction, and planning of driving scenarios. Despite remarkable success in semantic understanding of images and videos, existing Vision-Language Models (VLMs), as the prevailing paradigms for VQA, are limited in spatial understanding for multi-view scenes due to the lack of latent unified 3D reconstruction capability. They usually resort to additional spatial modalities such as point clouds or prior detection frameworks to enhance spatial understanding ability, but are still challenged by modality misalignment and degraded scalability. To overcome these limitations, in this paper, we propose a Geometrically-Guided Spatial Understanding Chain Framework (GSUC-VLM) for autonomous driving that leverages pretrained VLMs to jointly exploit semantic and spatial information in multi-view images. Specifically, we first design a dual-encoder architecture to fuse the semantic and spatial features separately extracted from multi-view images with a lightweight connector rather than introducing external spatial modalities. Subsequently, we align semantic and spatial features via distillation loss to generate semantic tokens enriched with the spatial information at the latent layer. Furthermore, we develop a projective feature conditioning method that incorporates camera intrinsic and extrinsic parameters to embed projection matrix encoding into the input vectors and introduce 3D position embeddings into the fusion layer for capturing complex spatial relationship across multiple views in autonomous driving. Experimental results show that the proposed GSUC-VLM achieves state-of-the-art performance in VQA tasks while providing Chain-of-Thought (CoT) understanding. Remarkably, GSUC-VLM demonstrates strong generalization on zero-shot VQA tasks.

Abstract:
Robot simulation is a highly efficient approach for scaling data collection for robot learning, but scaling for most household tasks remains bottlenecked by a shortage of simulation-ready 3D assets. While modern robot simulators can model complex phenomena like temperature and fluids, most in-the-wild 3D models lack "simulation affordances" (specialized annotations such as fluid source and heat emitter positions) that are required for these features. As a result, costly manual annotation is required, severely limiting asset scale and variety. We introduce Simulation Affordance Grids (SAGrid), a method that automates the annotation of simulation affordances on in-the-wild 3D meshes. SAGrid leverages pretrained representations (DINOv2, TRELLIS) to predict a dense 3D distance field to the nearest affordance. Our approach operates effectively in a low-data regime, requiring as few as 10 training objects per affordance type to accurately locate these features. We validate our method by processing Objaverse-XL models and integrating them into the BEHAVIOR-1K simulator. Training robot policies on this automatically expanded asset suite significantly improves generalization to unseen objects in complex tasks, demonstrating that automated affordance annotation is crucial for scaling robot learning.

Abstract:
Soft robots are inherently compliant and safe, making them suitable for humaninteractive applications such as surgery. However, their nonlinear and hysteretic behavior poses significant challenges for accurate modeling and control. We present a soft robotic system and propose a hysteresis-aware whole-body neural network model that accurately captures and predicts the soft robots whole-body motion, including hysteresis effects. Based on this model, we construct a highly parallel simulation environment for soft robot control and apply an on-policy reinforcement learning algorithm to efficiently train whole-body motion control policies. The trained policy is deployed on a real soft robot to evaluate its control performance, and it exhibits high precision in trajectory tracking tasks. Furthermore, we develop a soft robotic system for surgical applications and validate it through phantom-based laser ablation experiments. The results demonstrate that the proposed model significantly reduces prediction error compared to conventional methods. The overall framework shows strong performance in phantom-based surgical experiments, and demonstrates its potential for complex scenarios, including future real-world clinical applications.

Abstract:
Shaping thermoplastic sheets into three-dimensional products is challenging since overheating results in failed manufactured parts and wasted material. To this end, we propose an indirect data-driven predictive control approach using Model Predictive Control capable of handling temperature constraints and heating-power saturation while delivering enhanced precision and overshoot control compared to state-of-the-art methods. We employ a Non-linear Auto-Regressive with Exogenous inputs model, revwhich is linearized to define a linear control-oriented model at each operating point. Using a high-fidelity simulator, several simulation studies have been conducted to evaluate the proposed method's robustness and performance under parametric uncertainty, indicating overshoot and average steady-state error less than 2^circ mathrmC and 0.9^circ mathrmC (7^circ mathrmC and 2^circ mathrmC) for the nominal (worst-case) scenario. Finally, we applied the proposed method to a lab-scale thermoforming platform, resulting in a close response to the simulation analysis with overshoot and average steady-state error metrics of 5.3^circ mathrmC and 1^circ mathrmC, respectively.

Abstract:
In this work, we conducted research on deformable object manipulation by robots based on demonstration-enhanced reinforcement learning (RL). We present FADERL (FuzzyAugmented Demonstration-Embedded Reinforcement Learning),a novel framework for robotic manipulation of deformable objects that significantly improves reinforcement learning efficiency through synergistic unification of High-Dimensional Takagi-Sugeno-Kang (HTSK) fuzzy systems, Generative Adversarial Behavior Cloning (GABC), and Conditional Policy Learning (CPL). Compared to the Rainbow-DDPG baseline, FADERL achieves 2:01× higher global average reward and reduces standard deviation to 45% while requiring fewer computational resources. To address the high cost of human demonstration collection, we introduce a Nonlinear Model Predictive Control (NMPC)-based data augmentation method that generates high-quality demonstrations at minimal cost. Simulation results demonstrate that NMPC-generated demonstrations enable FADERL to achieve performance comparable to human demonstrations. Physical experiments on fabric manipulation tasksdiagonal folding, central-axis folding, and flatteningachieve success rates of 83.3%, 80.0%,and 96.7% respectively, validating our approachs effectiveness in real-world scenarios. Unlike computationally intensive large-model approaches, FADERL provides a lightweight, task-specific solution with efficient adaptability, making it suitable for practical robotic applications in manufacturing.

Abstract:
Achieving human-like dexterous robotic manipulation remains a central goal and a pivotal challenge in robotics. The development of Artificial Intelligence (AI) has allowed rapid progress in robotic manipulation. This survey summarizes the evolution of robotic manipulation from mechanical programming to embodied intelligence, alongside the transition from simple grippers to multi-fingered dexterous hands, outlining key characteristics and main challenges. Focusing on the current stage of embodied dexterous manipulation, we highlight recent advances in two critical areas: dexterous manipulation data collection (via simulation, human demonstrations, and teleoperation) and skill-learning frameworks (imitation and reinforcement learning). Then, based on the overview of the existing data collection paradigm and learning framework, three key challenges restricting the development of dexterous robotic manipulation are summarized and discussed.

Abstract:
This paper develops algorithms for planning time-optimal pick-and-place trajectories for multiple cylindrical containers filled with liquid and simultaneously transported by a robot. The considered trajectories comprise 3D translations combined with a 1D rotation about the vertical direction, i.e. SCARA motions. The presented approach minimizes the execution time, while ensuring that the liquid surface within each container remains below an imposed threshold throughout the motion. Two types of optimal trajectories are studied: one optimizes the motion law along a given path, the other optimizes both the path and the motion law. Extensive simulations identify the most efficient optimization setup, whereas experiments validate the approach. The data sets of all simulated and experimental motions are distributed through an external repository.

Abstract:
Colonoscopy is a medical procedure used to examine the inside of the colon for abnormalities, such as polyps or cancer. Traditionally, this is done by manually inserting a long, flexible tube called a colonoscope into the colon. However, this method can cause pain, discomfort, and even the risk of perforation. To address these shortcomings, advancements in technology are needed to develop safer, more intelligent colonoscopes. This paper presents the design, control and evaluation of a self-growing soft robotic colonoscope, leveraging the evertion principle. The device features a tube with an 18 mm diameter, constructed from stretchable fabric, which grows 1.6 m at the tip under pressurization. A pneumatically driven, elastomer-based manipulator enables omni-directional steering over 180 degrees at the tip. An airtight base houses motors and spools that control the material and regulate growth speed. The robot operates in two modes: teleoperation via joysticks and autonomous navigation using sensor inputs, such as a tip-mounted camera. Thorough in-vitro experiments are conducted to assess the system's functionality and performance. Results illustrate that the robot can achieve locom

Abstract:
Autonomous navigation of magnetic microswarms in dynamic and unstructured environments is essential for biomedical applications, such as targeted therapy and minimally invasive interventions. However, existing path planning methods struggle to simultaneously achieve real-time adaptability and path smoothness in dynamic obstacle environments. To address this, we propose a hierarchical Dynamic Rapidly-exploring Random Tree Star (D-RRT) path planning framework that integrates dynamic step size adjustment, local target selection, and local planning that considers microswarms' turning capabilities and energy optimization. Comparative simulations and experiments validate the effectiveness of the proposed planning framework, and results show that it can significantly improve the planning efficiency, path smoothness, and collision avoidance in complex dynamic scenarios.

Abstract:
Although fully autonomous systems still face challenges due to patients' anatomical variability, teleoperated systems appear to be more practical in current healthcare settings. This paper presents an anatomy-aware control framework for teleoperated lung ultrasound. Leveraging biomechanically accurate 3D modelling, the system applies virtual constraints on the ultrasound probe pose and provides real-time visual feedback to assist in precise probe placement tasks. A twofold evaluation, one with 5 naïve operators on a single volunteer and the second with a single experienced operator on 6 volunteers, compared our method with a standard teleoperation baseline. The results of the first one characterised the accuracy of the anatomical model and the improved perceived performance by the naïve operators, while the second one focused on the efficiency of the system in improving probe placement and reducing procedure time compared to traditional teleoperation. The results demonstrate that the proposed framework enhances the physician's capabilities in executing remote lung ultrasound, reducing more than 20% of execution time on 4-point acquisitions, towards faster, more objective and repeatable exams.

Abstract:
This paper proposes a novel anti-parallelogram mechanism (APM)-based cable-driven joint that achieves both the kinematic decoupling characteristic of rolling joints and the surface-contact robustness of link structures. Through the optimization of internal idlers, the cable path length remains nearly constant without complex gears or ligaments. Experimental validation using a 2-DOF prototype demonstrated negligible distal joint interference (0.0815 deg RMSE) during actuation. Furthermore, the system exhibited sub-millimeter positional repeatability (maximum RMSE of 0.2786 mm), establishing the proposed design as a robust, high-precision decoupling solution.

Abstract:
This work presents a hard-constrained optimal control based motion planner for robotic manipulators operating in cluttered industrial environments. The method targets object picking tasks where cycle time, accuracy of the final pose, and collision safety are critical. The planner formulates a time-optimal trajectory generation problem with explicit constraints on collision avoidance, robot kinematics, and the accuracy with which the end-effector reaches its grasp pose. Environments, grippers, and objects are modelled using capsules, cuboids, and planes - geometric primitives commonly available in industrial robotic software - allowing flexible reconfiguration of workcells without modifying the underlying problem transcription. Two complementary initialisation strategies are proposed to reduce the computational complexity of the non-linear, non-convex optimal control problem: a geometric best-guess initialisation and a near-optimal warm-starting approach that leverages previously computed trajectories during repeated task execution. Compared to a cuRobo, the proposed CPU-only planner exhibits higher computational times but produces trajectories with 0.7x lower execution time and guarantees constraint satisfaction due to its hard-constrained formulation.The near-optimal initialisation method is shown to reduce computation times by up to 2.3x relative to the best-guess approach while simultaneously improving the success rates.

Abstract:
Zero-Shot Object Navigation (ZSON) requires agents to navigate to objects specified via open-ended natural language without predefined categories or prior environmental knowledge. While recent methods leverage foundation models or multi-modal maps, they often rely on 2D representations and greedy strategies or require additional training or modules with high computation load, limiting performance in complex environments and real applications. We propose WTRP-Searcher, a novel framework that formulates ZSON as a Weighted Traveling Repairman Problem (WTRP), minimizing the weighted waiting time of viewpoints. Using a Vision-Language Model (VLM), we score viewpoints based on object-description similarity, projected onto a 2D map with depth information. An open-vocabulary detector identifies targets, dynamically updating goals, while a 3D embedding feature map enhances spatial awareness and environmental recall. WTRP-Searcher outperforms existing methods, offering efficient global planning and improved performance in complex ZSON tasks. Code and demos will be available on urlhttps://github.com/lrm20011/WTRP_Searcher.

Abstract:
Recent robot task planners utilize large language models (LLMs) or vision-language models (VLMs) as a failure detector. These methods perform well by leveraging their semantic reasoning capabilities but often assume full understanding, which can lead to unreliable planning in complex scenes lacking explicit structural modeling. To address these limitations, we propose a novel multi-view scene understanding framework that explicitly models object-level relationships, enabling failure detection and effective task replanning. Our approach first captures multi-view images for comprehensive coverage, and generates local 2D scene graphs encoding object identities and relational information. Building on this, we introduce a model based on a graph neural network that merges the local 2D scene graphs into a unified representation. This process results in the unified scene graph, used to detect task success and identify failure causes. For each sub-task, our framework compares the unified scene graph with the expected scene graph predicted by the LLM during the task planning stage, identifying potential failure causes based on their deviations. These causes are then fed back into the LLM to facilitate effective replanning, thereby reducing repetitive failures and enhancing adaptability. We evaluate our framework on five real-world benchmark tasks to demonstrate its applicability. Separately, we compare failure detection and reasoning performance with other methods, showing the benefits of combining multi-view perception with explicit graph-based reasoning. More information can be found in https://sites.google.com/view/scrutinize-robot-manipulation

Abstract:
Robot-assisted minimally invasive surgery (RMIS) provides superior visualization, precision, and flexibility, and it has gained recognition as a technology that enhances therapeutic outcomes, particularly in tumor resection. However, this technology has a limitation in that it predominantly relies on visual feedback, making it challenging for surgeons to accurately detect the location and edges of tumors during surgery. To address this issue, robotic palpation methods have been actively studied. Among these, the sweeping palpation method has the advantage of rapidly exploring a broad region. Nevertheless, conventional sweeping palpation methods can only roughly identify the tumors location and are limited in detecting tumor edges with precision. In this study, we introduce a novel sweeping palpation method to overcome the limitations of conventional sweeping palpation in RMIS and propose a precise tumor localization method based on this approach. The proposed method involves performing sweeping palpation on the tissue surface using the tip of the robotic end effector while utilizing a Laplacian edge detection algorithm to detect abrupt changes in contact force. This method reduces the reliance on preoperative imaging and enables tumor localization to be performed within a single robotic system. To validate the proposed tumor localization method, we conducted three phantom experiments and ex vivo experiment. These validations demonstrate the potential of our proposed method to contribute to precise tumor resection and the establishment of effective treatment plans.

Abstract:
Suturing is a high-precision task performed at the end of procedures when surgeon fatigue may increase errors, highlighting the need for robot assistance. Previous autonomous suturing works, such as STITCH 1.0 [1], struggle to fully close wounds due to inaccurate needle tracking, thread tangling, and poor insertion placement. To address these challenges, we present STITCH 2.0, an improved augmented dexterity pipeline over STITCH 1.0 using the da Vinci Research Kit (dVRK) [2] with seven improvements including an improved EKF needle pose estimation pipeline, new thread untangling methods, and an automated 3D suture alignment algorithm. Experimental results over 15 trials (maximum 450 individual suture trials) find that STITCH 2.0 achieves 74.4% wound closure with an average of 4.87 sutures per trial, representing 66% more completed sutures in 38% less time compared to STITCH 1.0 [1]. When two human interventions are allowed, STITCH 2.0 averages six sutures with a 100% wound closure rate.

Abstract:
Human motion in human-robot interaction (HRI) is inherently uncertain, even when performing the same task repeatedly. This variability poses a significant challenge for prediction, as models must capture a distribution of plausible futures rather than a single deterministic trajectory. Traditional graph convolutional network based models, while effective at capturing spatial temporal dependencies, are fundamentally limited by their deterministic nature and struggle to represent this inherent motion uncertainty. To address this, diffusion models have emerged as a powerful framework for modeling uncertainty. However, their direct application to HRI is hindered by two key limitations: they often prioritize motion diversity over prediction accuracy, potentially generating physically implausible results, and they fail to adequately model the complex, multi-scale spatial temporal coupling between human and robot motions. To overcome these challenges, we propose HRI-DGDM, a HRI motion prediction framework based on a dual-graph guided diffusion model. Our method introduces a dual-graph structurecomprising a structural graph for kinematic priors and a collaboration graph learned from motion dynamicsto guide the denoising process with strong structural priors. A dedicated spatial temporal denoising network (STDN) fuses multi-scale features from both graphs through adaptive fusion and hierarchical spatial temporal modeling. Furthermore, a masking-based conditioning mechanism anchors the observed history during denoising, ensuring temporal consistency and preventing drift. Experiments on HRI scenarios demonstrate that HRI-DGDM outperforms baselines in prediction accuracy.

Abstract:
Deep Learning (DL) has become essential in various robotics applications due to excelling at processing raw sensory data to extract task specific information from semantic objects. For example, vision-based object-relative navigation relies on a DL-based 6D object pose predictor to provide the relative pose between the object and the robot as measurements to the robot's state estimator. Accurately knowing the uncertainty inherent in such Deep Neural Network (DNN) based measurements is essential for probabilistic state estimators subsequently guiding the robot's tasks. Thus, in this letter, we show that we can extend any existing DL-based object-relative pose predictor for aleatoric uncertainty inference simply by including two multi-layer perceptrons detached from the translational and rotational part of the DL predictor. This allows for efficient training while freezing the existing pre-trained predictor. We then use the inferred 6D pose and its uncertainty as a measurement and corresponding noise covariance matrix in an extended Kalman filter (EKF). Our approach induces minimal computational overhead such that the state estimator can be deployed on edge devices while benefiting from the dynamically inferred measurement uncertainty. This increases the performance of the object-relative state estimation task compared to a fix-covariance approach. We conduct evaluations on synthetic data and real-world data to underline the benefits of aleatoric uncertainty inference for the object-relative state estimation task.

Abstract:
Multi-agent path finding (MAPF) problem in warehouse automation consists of optimal task assignment and path planning, where small runtime is necessary. In this paper, we present a new MAPF algorithm related to dynamic start and end positions of the robots, called Position-Selection Enhanced Conflict-Based Search (PS-ECBS). Conflict-Based Search (CBS) is a well-known framework that has been used to find collision-free paths for a given fixed task assignment, while ECBS is a bounded-suboptimal variant of CBS that uses focal search to speed up CBS. The mixed integer linear programming (MILP) is introduced to formulate the dynamic model for optimal task assignment, and the successful combination of MILP and ECBS results in PS-ECBS algorithm. The solving process of the PS-ECBS consists of multiple iterations, and in each iteration an additional constraint is added to modify the model. In the computational experiment, the processes of picking up and putting back shelves in the warehouse could occur at the same time by PS-ECBS. We also analyzed the iterative principle of PS-ECBS, and compared its performance with that of ECBS-TA. The computational results demonstrate that PS-ECBS runs significantly faster and has an obvious advantage in jointly optimizing task assignment and path planning for large-scale warehouse.

Abstract:
BrainMachine Interfaces (BMIs) provide a direct communication pathway between the brain and external devices, enabling humans to control assistive and robotic technologies, with potential applications in rehabilitation, human motor augmentation, and human-centered robotics. However, due to neural drift, the performance of BMIs decreases over time, posing challenges for long-term viability, particularly for invasive BMIs (iBMIs). Existing solutions suffer from two main drawbacks: (i) difficulty in learning robust neural representations, and (ii) neglecting that neural drift varies across motor parameters (textite.g., velocity, direction, and speed). To overcome these limitations, we propose Self-Supervised Consistency enhanced Disentangled Learning (SSCDL), a neural decoding generalization framework built on two key innovations. We first design a backbone model named Consistency enhanced Neural Decoder (CND), using a novel teacher-student consistency constraint with simulated neural signal perturbations to learn robust representations invariant to neural drift. Then, we employ three dedicated CNDs under Complementary-Disentangled Generalization (CDG) mechanism, which disentangles motor signals into velocity, direction and speed with the inspiration of neural preference theory. This disentangled learning enables SSCDL to capture invariant neural representations from diverse neural preference perspectives, significantly enhancing cross-day generalization. Extensive experimental results show that SSCDL delivers state-of-the-art decoding performance, exhibiting high robustness and cross-day stability. These capabilities underscore its strong potential for long-term interaction in human-centric robotic and fine-grained assistive applications.

Abstract:
Autonomous field robots operating in unstructured environments require robust perception to ensure safe and reliable operations. Recent advances in monocular depth estimation have demonstrated the potential of low-cost cameras as depth sensors; however, their adoption in field robotics remains limited due to the absence of reliable scale cues, ambiguous or low-texture conditions, and the scarcity of large-scale datasets. To address these challenges, we propose a depth completion model that trains on synthetic data and uses extremely sparse measurements from depth sensors to predict dense metric depth in unseen field robotics environments. A synthetic dataset generation pipeline tailored to field robotics enables the creation of multiple realistic datasets for training purposes. This dataset generation approach utilizes textured 3D meshes from Structure from Motion and photorealistic rendering with novel viewpoint synthesis to simulate diverse field robotics scenarios. Our approach achieves an end-to-end latency of 53 ms per frame on an Nvidia Jetson AGX Orin, enabling real-time deployment on embedded platforms. Extensive evaluation demonstrates competitive performance across diverse real-world field robotics scenarios.

Abstract:
Object detection in unmanned aerial vehicle (UAV) has become a research highlight at the intersection of computer vision and robotics technology, and its applications in security inspection, agricultural monitoring, disaster relief and others are becoming increasingly widespread. The key to achieving autonomous perception and decision-making of UAV lies in precise and real-time object detection. However, objects from the perspective of UAV often have characteristics such as small scale and dense distribution, coupled with limited onboard computing resources, which poses significant challenges to traditional detection algorithms. To address the trade-offs, this paper proposes LH-DETR, a lightweight hybrid architecture for end-to-end object detection, referring to three specialized innovations. We first put in the Wavelet-Mamba Hybrid Block (WMHB), a novel backbone component that synergistically combines the linear-complexity of Mamba state-space model for capturing long-range dependencies with the multi-scale feature extraction capabilities of wavelet transforms. To better identify small objects, a Frequency-Aware Dynamic FFN (FAD-FFN) is designed to selectively amplify critical high-frequency componentslike edges and texturesby analyzing features in the frequency domain. Additionally, AutoSliding Varifocal Loss (ASVLoss) is defined to stabilize the model's optimization, which is an adaptive loss function that dynamically shifts its focus from medium-quality to high-quality predictions as training progresses. Experiments on public aerial datasets demonstrate that LH-DETR achieves an outstanding balance between accuracy and speed, significantly improving detection performance for small objects while greatly reducing the computational complexity.

Abstract:
Ensuring safety and motion consistency for robot navigation in occluded, obstacle-dense environments is a critical challenge. In this context, this study presents an occlusion-aware Consistent Model Predictive Control (CMPC) strategy. To account for the occluded obstacles, it incorporates adjustable risk regions that represent their potential future locations. Subsequently, dynamic risk boundary constraints are developed online to ensure safety. The CMPC then constructs multiple locally optimal trajectory branches (each tailored to different risk regions) to strike a balance between safety and performance. A shared consensus segment is generated to ensure smooth transitions between branches without significant velocity fluctuations, further preserving motion consistency. To facilitate high computational efficiency and ensure coordination across local trajectories, we use the alternating direction method of multipliers (ADMM) to decompose the CMPC into manageable sub-problems for parallel solving. The proposed strategy is validated through simulations and real-world experiments on an Ackermann-steering robot platform. The results demonstrate the effectiveness of the proposed CMPC strategy through comparisons with baseline approaches in occluded, obstacle-dense environments.

Abstract:
In partially known environments, robots must combine exploration to gather information with task planning for efficient execution. To address this challenge, we propose EPoG, an Exploration-based sequential manipulation Planning framework on Graph-based representations. EPoG integrates a graph-based global planner with a Large Language Model (LLM)-based situated local planner, continuously updating a belief graph using observations and the LLM predictions to represent known and unknown objects. Action sequences are generated by computing graph edit operations between the goal and belief graphs, ordered by temporal dependencies and movement costs. This approach seamlessly combines exploration and sequential manipulation planning. In ablation studies across 46 realistic household scenes and 5 long-horizon daily object transportation tasks, EPoG achieved a success rate of 91.3%, reducing travel distance by 36.1% on average. Furthermore, a physical mobile manipulator successfully executed complex tasks in unknown and dynamic environments, demonstrating EPoGs potential for real-world applications.

Abstract:
The emerging integration of robots into everyday life brings several major challenges. Compared to classical industrial applications, more flexibility is needed in combination with real-time reactivity. Learning-based methods can train powerful policies based on demonstrated trajectories, such that the robot generalizes a task to similar situations. However, these black-box models lack interpretability and rigorous safety guarantees. Optimization-based methods provide these guarantees but lack the required flexibility and generalization capabilities. This work proposes SafeFlowMPC, a combination of flow matching and online optimization to combine the strengths of learning and optimization. This method guarantees safety at all times and is designed to meet the demands of real-time execution by using a suboptimal model-predictive control formulation. SafeFlowMPC achieves strong performance in three real-world experiments on a KUKA 7-DoF manipulator, namely two grasping experiment and a dynamic human-robot object handover experiment. A video of the experiments is available at https://ww.acin.tuwien.ac.at/en/42d6. The code is available at https://github.com/TU-Wien-ACIN-CDS/SafeFlowMPC.

Abstract:
This work introduces a novel method for surface normal estimation from rectified stereo image pairs, leveraging affine transformations derived from disparity values to achieve fast and accurate results. We demonstrate how the rectification of stereo image pairs simplifies the process of surface normal estimation by reducing computational complexity. To address noise reduction, we develop a custom algorithm inspired by convolutional operations, tailored to process disparity data efficiently. We also introduce adaptive heuristic techniques for efficiently detecting connected surface components within the images, further improving the robustness of the method. By integrating these methods, we construct a surface normal estimator that is both fast and accurate, producing a dense, oriented point cloud as the final output. Our method is validated using both simulated environments and real-world stereo images from the MiddleburyfootnoteFor the Middlebury datasets, disparity values are published. and Cityscapes datasets, demonstrating significant improvements in real-time performance and accuracy when implemented on a GPU. The source code is available at https://github.com/mrafifaisal/Surface-Normal-Estimation/.

Abstract:
Robots are increasingly used in diverse application areas, where autonomous navigation plays a central role. As these systems become more widespread, improving their energy efficiency is critical to extending operational time and reducing environmental impact. The Robot Operating System (ROS) is a widely adopted middleware for robotics, offering a rich set of configurable packages. However, this flexibility can result in suboptimal software configurations in dynamic environments, negatively affecting both performance and energy consumption. This paper investigates the impact of ROS 2 package re-configurations on the energy efficiency of mobile robot navigation. We conduct a controlled experiment in two warehouse-like scenarios (small and large) with varying obstacle layouts and Costmap 2D configurations (essential to the Nav2 stack). Through repeated trials, we measure energy usage, power profile, CPU load, memory consumption, and navigation performance. Results show that configurations must be carefully chosen for the specific robotic environment, and we were able to identify critical settings that lead to good and poor performance and energy consumption.

Abstract:
Achieving human-like dexterous manipulation through the collaboration of multi-fingered hands with robotic arms remains a longstanding challenge in robotics, primarily due to the scarcity of high-quality demonstrations and the complexity of high-dimensional action spaces. To address these challenges, we propose FAR-Dex, a hierarchical framework that integrates few-shot data augmentation with adaptive residual refinement to enable robust and precise armhand coordination in dexterous tasks. First, FAR-DexGen leverages the IsaacLab simulator to generate diverse and physically-constrained trajectories from a few demonstrations, providing a data foundation for policy training. Second, FAR-DexRes introduces an adaptive residual module that refines policies by combining multi-step trajectory segments with observation features, thereby enhancing accuracy and robustness in manipulation scenarios. Experiments in both simulation and real-world demonstrate that FAR-Dex improves data quality by 13.4% and task success rates by 7% over state-of-the-art methods. It further achieves over 80% success in real-world tasks, enabling fine-grained dexterous manipulation with strong positional generalization.

Abstract:
With the rapid progress of foundation models and robotics, vision-language navigation (VLN) has emerged as a key task for embodied agents with broad practical applications. We address VLN in continuous environments, a particularly challenging setting where an agent must jointly interpret natural language instructions, perceive its surroundings, and plan low-level actions. We propose a zero-shot framework that integrates a simplified yet effective waypoint predictor with a multimodal large language model (MLLM). The predictor operates on an abstract obstacle map, producing linearly reachable waypoints, which are incorporated into a dynamically updated topological graph with explicit visitation records. The graph and visitation information are encoded into the prompt, enabling reasoning over both spatial structure and exploration history to encourage exploration and equip MLLM with local path planning for error correction. Extensive experiments on R2R-CE and RxR-CE show that our method achieves state-of-the-art zero-shot performance, with success rates of 41% and 36%, respectively, outperforming prior state-of-the-art methods. The source code is available at: https://github.com/michigan-traffic-lab/OMAP-VLN.

Abstract:
Precise pose estimation of optical microrobots is essential for enabling high-precision object tracking and autonomous biological studies. However, current methods rely heavily on large, high-quality microscope image datasets, which are difficult and costly to acquire due to the complexity of microrobot fabrication and the labour-intensive labelling. Digital twin systems offer a promising path for sim-to-real data augmentation, yet existing techniques struggle to replicate complex optical microscopy phenomena, such as diffraction artifacts and depth-dependent imaging. This work proposes a novel physics-informed deep generative learning framework that, for the first time, integrates wave optics-based physical rendering and depth alignment into a generative adversarial network (GAN), to synthesise high-fidelity microscope images for microrobot pose estimation efficiently. Our method improves the structural similarity index (SSIM) by 35.6% compared to purely AI-driven methods, while maintaining real-time rendering speeds (0.022 s/frame). The pose estimator (CNN backbone) trained on our synthetic data achieves 93.9%/91.9% (pitch/roll) accuracy, just 5.0%/5.4% (pitch/roll) below that of an estimator trained exclusively on real data.

Abstract:
Forward-looking sonar is essential for underwater perception especially in turbid waters, yet its images are often strongly degraded by various noises, including speckle, sidelobe, and structural noises, which severely hinder downstream tasks such as underwater reconstruction, positioning, and navigation. Most conventional sonar denoising methods reduce the noise at the expense of loss of fine image features or blurred image, while modern supervised learning methods demand large paired datasets that are impractical to obtain in real underwater conditions. In this paper, we propose SonarGAN, a progressive Generative Adversarial Networks (GAN) based framework that denoises sonar images under multi-type noises in one go. Unlike traditional supervised methods, SonarGAN avoids the need for costly paired datasets by combining unpaired real and simulated images, synthetic noisyclean pairs, and joint refinement for comprehensive denoising. Extensive experiments across multiple types of sonar and underwater environments demonstrate the effectiveness of SonarGAN and its generalization in real-world conditions.

Abstract:
Safety is a fundamental requirement for autonomous systems operating in critical domains. Control barrier functions (CBFs) have been used to design safety filters that minimally alter nominal controls for such systems to maintain their safety. Learning neural CBFs has been proposed as a data-driven alternative for their computationally expensive optimization-based synthesis. However, it is often the case that the failure set of states that should be avoided is non-obvious or hard to specify formally, e.g., tailgating in autonomous driving, while a set of expert demonstrations that achieve the task and avoid the failure set is easier to generate. We use inverse constraint learning (ICL) to train a constraint function that classifies the states of the system under consideration to safe, i.e., belong to a controlled forward invariant set that is disjoint from the unspecified failure set, and unsafe ones, i.e., belong to the complement of that set. We then use that function to label a new set of simulated trajectories to train our neural CBF. We empirically evaluate our approach in four different environments, demonstrating that it outperforms existing baselines and achieves comparable performance to a neural CBF trained with the same data but annotated with ground-truth safety labels.

Abstract:
Playful deception, a common feature in human social interactions, remains underexplored in Human-Robot Interaction (HRI). Inspired by the Turkish Ice Cream (TIC) vendor routine, we investigate how bounded, culturally familiar forms of deception influence user trust, enjoyment, engagement, and willingness-to-pay during robotic handovers. We design a robotic manipulator equipped with a custom end-effector and implement five TIC-inspired trick policies that deceptively delay the handover of an ice cream-shaped object. Through a mixed-design user study with 91 participants, we evaluate the effects of playful deception and interaction duration on user experience. Results reveal that TIC-inspired deception significantly enhances enjoyment and engagement, though reduces perceived safety and trust, suggesting a structured trade-off across the multi-dimensional aspects. Our findings demonstrate that playful deception can be a valuable design strategy for interactive robots in entertainment and engagement-focused contexts, while underscoring the importance of deliberate consideration of its complex trade-offs. Videos and user study snapshots are available on https://hyeonseong-kim98.github.io/turkish-ice-cream-robot/

Abstract:
Vision-based policies are widely applied in robotics for tasks such as manipulation and locomotion. On lightweight mobile robots, however, they face a trilemma of limited scene transferability, restricted onboard computation resources, and sensor hardware cost. To address these issues, we propose a knowledge distillation approach that transfers knowledge from an information-rich, appearance-invariant omni-view depth policy to a lightweight monocular policy. The key idea is to train the student not only to mimic the experts actions but also to align with the latent embeddings of the omni-view depth teacher. Experiments demonstrate that omni-view and depth inputs improve the scene transfer and navigation performance, and that the proposed distillation method enhances the performance of a single-view monocular policy, compared with policies solely imitating actions. Real-world experiments further validate the effectiveness and practicality of our approach. Code will be released publicly.

Abstract:
Autonomous Mobile Robots are typically limited to structured environments, as conventional wheeled propulsion often fails on deformable terrains like sand due to excessive wheel slip and sinkage. To address this mobility challenge, this paper introduces a novel locomotion strategy for a high-degree-of-freedom wheel-legged robot. The proposed method is a gait based on an asymmetric "dynamic compact-and-push" cycle, where the robot's limbs perform a paddling-like motion to actively remodel the granular media. This active terrain remodeling allows the robot to generate net forward thrust where conventional wheeled locomotion is ineffective. We systematically designed and experimentally validated four distinct gaits founded on this principle. The results demonstrate that this approach enables sustained forward motion in an environment where wheeled propulsion is verified to fail, with the asynchronous paddling gait proving most effective. This work contributes a new, validated locomotion mechanism for sand terrains and provides a quantitative comparison of different limb coordination strategies.

Abstract:
Tightly coupled LiDARinertial odometry (LIO) systems are critical for autonomous navigation, yet their performance often degrades due to insufficient adaptability to diverse environments and limitations in map representation. To address these limitations, this paper presents GMM-LIO, a robust and adaptive LIO framework that integrates a novel information-theoretic scan processing module and a high-fidelity Gaussian Mixture Model (GMM) voxel map structure. At its core, GMM-LIO features a two-level adaptive front-end that dynamically modulates voxel resolution based on state uncertainty and adjusts surface covariance estimation according to local point density on a standard voxel grid. Furthermore, GMM-LIO employs a dynamic Gaussian Mixture Model voxel map to accurately model intersecting surfaces. The entire system is formulated as a robust Maximum a Posteriori (MAP)-based estimator, which employs an Iteratively Reweighted Least Squares (IRLS) solver together with a principled anisotropic information matrix to handle measurement outliers. Extensive evaluations on diverse public and self-collected datasets demonstrate that GMM-LIO achieves state-of-the-art accuracy and robustness, with a 36% relative improvement over leading LIO baselines.

Abstract:
Learning from demonstrations has emerged as a promising paradigm for end-to-end robot control, particularly when scaled to diverse and large datasets. However, the quality of demonstration data, often collected through human teleoperation, remains a critical bottleneck for effective data-driven robot learning. Human errors, operational constraints, and teleoperator variability introduce noise and suboptimal behaviors, making data curation essential yet largely manual and heuristic-driven. In this work, we propose Quality over Quantity (QoQ), a grounded and systematic approach to identifying high-quality data by defining data quality as the contribution of each training sample to reducing loss on validation demonstrations. To efficiently estimate this contribution, we leverage influence functions, which quantify the impact of individual training samples on model performance. We further introduce two key techniques to adapt influence functions for robot demonstrations: (i) using maximum influence across validation samples to capture the most relevant state-action pairs, and (ii) aggregating influence scores of state-action pairs within the same trajectory to reduce noise and improve data coverage. Experiments in both simulated and real-world settings show that QoQ consistently improves policy performances over prior data selection methods.

Abstract:
Traditional policy learning methods in cooperative pursuit face fundamental challenges in biomimetic underwater robots, where long-horizon decision making, partial observability, and inter-robot coordination require both expressiveness and stability. To address these issues, a novel framework called Mamba-based multi-agent group relative policy optimization (M2GRPO) is proposed, which integrates a selective state-space Mamba policy with group-relative policy optimization under the centralized-training and decentralized-execution (CTDE) paradigm. Specifically, the Mamba-based policy leverages observation history to capture long-horizon temporal dependencies and exploits attention-based relational features to encode inter-agent interactions, producing bounded continuous actions through normalized Gaussian sampling. To further improve credit assignment without sacrificing stability, the group-relative advantages are obtained by normalizing rewards across agents within each episode and optimized through a multi-agent extension of GRPO, significantly reducing the demand for training resources while enabling stable and scalable policy updates. Extensive simulations and real-world pool experiments across team scales and evader strategies demonstrate that M2GRPO consistently outperforms MAPPO and recurrent baselines in both pursuit success rate and capture efficiency. Overall, the proposed framework provides a practical and scalable solution for cooperative underwater pursuit with biomimetic robot systems.

Abstract:
This paper presents an innovative and practical method for robotic needle steering in radio-frequency ablation (RFA) to treat cancer. One of the main challenges in this process is that tissue shifts and deforms during needle insertion, making it difficult to accurately predict the needle's path in real time. Inverse finite element (iFE) simulations have been used to address this problem. While these methods are accurate, they often require further refinement for effective time performance in real-world robotic systems. This is because when the method is incorporated into a real robot, there can be a delay in command execution. To address this challenge, we propose a machine learning-based solution that learns from offline simulations, shifting the intensive calculations required by iFE methods to an offline training stage and enabling online prediction of tissue deformation with reduced computational time. Our network was trained on data from numerous simulated needle insertions to capture interactions among insertion forces, tissue properties, and resulting motion. Once trained, the model produces predictions almost instantaneously, making it suitable for real-time applications. We validated the approach by steering the needle in a simulated deformable, moving gel to compare it with numerical-based methods, and then performing needle steering within a reconstructed human body that involves multiple structures and integrates the robot's dynamics. The results demonstrated that the developed networks achieved slightly better accuracy in the first scenario while also running faster, resulting in improved performance under the robot's dynamics. These findings show that our method is a promising advancement toward real-time guidance systems for needle-based medical procedures.

Abstract:
While scaling laws for imitation learning have primarily focused on generalization in open-world settings, the relationship between data and precision in closed-world tasks like robotic assembly remains largely unexplored. This paper systematically investigates this relationship and introduces a novel scaling law. We find that to achieve a fixed success rate, the required number of demonstrations N, grows super-exponentially as the target precision P, approaches a limit c. This relationship is accurately captured by the model log(N) �?1/(P-c). Crucially, we reveal that the limit precision c is not a static physical constant of the task but an emergent property of the entire agent system, including its sensors and expert policy. Through experiments on canonical manipulation tasks, we validate this law and demonstrate that improving system componentssuch as adding a wrist camera or using a more effective expertmeasurably lowers c, thus expanding the system's achievable precision. Our work provides a new theoretical framework for precision in robotics and a quantitative metric to evaluate system capabilities. Furthermore, these findings provide a practical methodology for guiding the development and debugging of high-precision manipulation systems.

Abstract:
Place recognition is crucial for loop closure detection and global localization in robotics. Although mainstream algorithms typically rely on cameras and LiDAR, these sensors are susceptible to adverse weather conditions. Fortunately, the recently developed 4D millimeter-wave radar (4D radar) offers a promising solution for all-weather place recognition. However, the inherent noise and sparsity in 4D radar data significantly limit its performance. Thus, in this paper, we propose a novel framework called 4DRaL that leverages knowledge distillation (KD) to enhance the place recognition performance of 4D radar. Its core is to adopt a high-performance LiDAR-to-LiDAR (L2L) place recognition model as a teacher to guide the training of a 4D radar-to-4D radar (R2R) place recognition model. 4DRaL comprises three key KD modules: a local image enhancement module to handle the sparsity of raw 4D radar points, a feature distribution distillation module that ensures the student model generates more discriminative features, and a response distillation module to maintain consistency in feature space between the teacher and student models. More importantly, 4DRaL can also be trained for 4D radar-to-LiDAR (R2L) place recognition through different module configurations. Experimental results prove that 4DRaL achieves state-of-the-art performance in both R2R and R2L tasks regardless of normal or adverse weather.

Abstract:
We present the iMETRO Dynamic Simulation, the first open-source dynamic simulation environment for research in the use of robot manipulators inside space vehicles for maintenance and logistics tasks, or intravehicular robotics (IVR). IVR has great potential to facilitate science and exploration on the Moon by saving crew time, but there are limited open-source resources that would enable researchers to identify the next set of challenges in manipulation for IVR. We provide a full-featured, high-fidelity dynamic simulation of the real-world iMETRO IVR test facility, which includes mockups representative of the interior of a future space vehicle as well as an 8-DoF manipulator that serves as an example robot platform for this research. Our modular simulator enables new software, hardware, and operational paradigms to be tested in a reconfigurable mockup environment. To improve the accessibility and extensibility of this simulation environment, we also provide ROS 2 hardware control interfaces to MuJoCo as well as a model conversion tool such that the same models may be used with ROS 2 and MuJoCo. To evaluate the sim-to-real transfer capabilities of this simulation, we present an open-source example application demonstration developed in the simulation and transfer it to the real-world iMETRO facility in less than a day. Finally, we identify the challenges and opportunities in modeling a real-world facility to aid future simulation efforts. The open- source simulation and application can be found at https: //github.com/NASA-JSC-Robotics. The MuJoCo and ROS 2 integration tools have migrated to the ros-controls organization and can be found at https://github.com/ ros-controls/mujoco_ros2_control.

Abstract:
In biomedical engineering, robotic implants have shown new methods to restore and improve bodily function and regenerate tissue. A significant challenge with the design of these devices is to safely actuate them for weeks or months while they reside in a patient's body. The application of a rotating magnetic field offers a solution to remotely transfer torque. However, this method will cause a net torque on the body within the field, which will cause rotational motion of the implant. Here we present a wirelessly-driven magnetic motor which can be driven with an external magnetic field, using an electromagnetic coil(s), to control a robotic implant. Due to the magnetic torque canceling mechanism, this wireless motor is actuatable with a single coil and produces no net torque on the entire body. When physically tested, the motor was able to produce around 0.5 mNm of torque, which is comparable to conventional ungeared motors of the same size. The motor was demonstrated in a robotic implant and successfully applied force to stretch a porcine esophagus.

Abstract:
Carotid ultrasound is crucial for the assessment of cerebrovascular health, particularly the internal carotid artery (ICA). While previous research has explored automating carotid ultrasound, none has tackled the challenging ICA. This is primarily due to its deep location, tortuous course, and significant individual variations, which greatly increase scanning complexity. To address this, we propose a Hierarchical Transformer-based decision architecture, namely UltraHiT, that integrates high-level variation assessment with low-level action decision. Our motivation stems from conceptualizing individual vascular structures as morphological variations derived from a standard vascular model. The high-level module identifies variation and switches between two low-level modules: an adaptive corrector for variations, or a standard executor for normal cases. Specifically, both the high-level module and the adaptive corrector are implemented as causal transformers that generate predictions based on the historical scanning sequence. To ensure generalizability, we collected the first large-scale ICA scanning dataset comprising 164 trajectories and 72K samples from 28 subjects of both genders. Based on the above innovations, our approach achieves a 95% success rate in locating the ICA on unseen individuals, outperforming baselines and demonstrating its effectiveness. Project website: https://ultrahit-thu.github.io/UltraHiT/.

Abstract:
Rapid deployment of new tactile sensors is essential for scalable robotic manipulation, especially in multi-fingered hands equipped with vision-based tactile sensors. However, current methods for inferring contact properties rely heavily on convolutional neural networks (CNNs), which, while effective on known sensors, require large, sensor-specific datasets. Furthermore, they require retraining for each new sensor due to differences in lens properties, illumination, and sensor wear. Here we introduce TacViT, a novel tactile perception model based on Vision Transformers, designed to generalize on new sensor data. TacViT leverages global self-attention mechanisms to extract robust features from tactile images, enabling accurate contact property inference even on previously unseen sensors. This capability significantly reduces the need for data collection and retraining, accelerating the deployment of new sensors. We evaluate TacViT on sensors for a five-fingered robot hand and demonstrate its superior generalization performance compared to CNNs. Our results highlight TacViTs potential to make tactile sensing more scalable and practical for real-world robotic applications.

Abstract:
Robots interacting with humans must not only generate learned movements in real-time, but also infer the intent behind observed behaviors and estimate the confidence of their own inferences. This paper proposes a unified model that achieves all three capabilities within a single hierarchical predictive-coding recurrent neural network equipped with a class embedding vector, CERNet, which leverages a dynamically updated class embedding vector to unify motor generation and recognition. The model operates in two modes: generation and inference. In the generation mode, the class embedding constrains the hidden state dynamics to a class-specific subspace; in the inference mode, it is optimized online to minimize prediction error, enabling real-time recognition. Validated on a humanoid robot across 26 kinesthetically taught alphabets, our hierarchical model achieves 76% lower trajectory reproduction error than a parameter-matched single-layer baseline, maintains motion fidelity under external perturbations, and infers the demonstrated trajectory class online with 68% Top-1 and 81% Top-2 accuracy. Furthermore, internal prediction errors naturally reflect the models confidence in its recognition. This integration of robust generation, real-time recognition, and intrinsic uncertainty estimation within a single neural network framework offers a compact and extensible approach to motor memory in physical robots, with potential applications in intent-sensitive humanrobot collaboration.

Abstract:
Harvesting, gripping, and handling of fruit and vegetables require end-effectors that ensure grip stability while, at the same time, minimizing surface damage and bruising. Conventional rigid or partially compliant solutions often generate localized load concentrations and require high positioning accuracy, limiting their effectiveness in unstructured environments. This work presents a novel three-finger gripper with a self-centering closing mechanism and soft fingers. The system operates in two stages: first, the fingers slide (FS) along linear guides driven by a dedicated motor to adapt to the object size; second, a separate motor actuates the finger closure to establish the grasp. The finger design is inspired by the handed-shearing auxetic (HSA) actuator, enabling controlled pre-shaping (PS) by adapting to the object geometry. The proposed design was first validated through finite element simulations, comparing pre-shaping (PS) against passive compliance (PC) under matched load conditions. Results demonstrate that PS significantly improves pressure uniformity and grasp stability. A fully functional prototype was then fabricated via additive manufacturing.

Abstract:
While Model Predictive Control (MPC) delivers strong performance across robotics applications, solving the underlying (batches of) nonlinear trajectory optimization (TO) problems online remains computationally demanding. Existing GPU-accelerated approaches either parallelize single solves, handle large batches at sub-real-time rates, or sacrifice model generality for speed. This leaves a large gap in solver performance for many state-of-the-art MPC applications that require real-time batches of tens to low-hundreds of solves. As such, we present GATO, an open source, GPU-accelerated, batched TO solver co-designed across algorithm, software, and computational hardware to deliver real-time throughput for these moderate batch size regimes. Our approach leverages a combination of block-, warp-, and thread-level parallelism within and across solves for ultra-high performance. We demonstrate the effectiveness of our approach through a combination of: simulated benchmarks showing speedups of 18-21x over CPU baselines and 1.4-16x over GPU baselines as batch size increases; case studies highlighting improved disturbance rejection and convergence behavior; and finally a validation on hardware using an industrial manipulator. We open source GATO to support reproducibility and adoption.

Abstract:
The Flexible Job Shop Scheduling Problem (FJSP) originates from real production lines, while some practical constraints are often ignored or idealized in current FJSP studies, among which the limited buffer problem has a particular impact on production efficiency. To this end, we study an extended problem that is closer to practical scenariosthe Flexible Job Shop Scheduling Problem with Limited Buffers and Material Kitting. In recent years, deep reinforcement learning (DRL) has demonstrated considerable potential in scheduling tasks. However, its capacity for state modeling remains limited when handling complex dependencies and long-term constraints. To address this, we leverage a heterogeneous graph network within the DRL framework to model the global state. By constructing efficient message passing among machines, operations, and buffers, the network focuses on avoiding decisions that may cause frequent pallet changes during long-sequence scheduling, thereby helping improve buffer utilization and overall decision quality. Experimental results on both synthetic and real production line datasets show that the proposed method outperforms traditional heuristics and advanced DRL methods in terms of makespan and pallet changes, and also achieves a good balance between solution quality and computational cost. Furthermore, a supplementary video is provided to showcase a simulation system that effectively visualizes the progression of the production line.

Abstract:
Autonomous robotic systems should reason about resource control and its impact on subsequent maneuvers, especially when operating with limited energy budgets or restricted sensing. Learning-based control is effective in handling complex dynamics and represents the problem as a hybrid action space unifying discrete resource usage and continuous maneuvers. However, prior works on hybrid action space have not sufficiently captured the causal dependencies between resource usage and maneuvers. They have also overlooked the multi-modal nature of tactical decisions, both of which are critical in fast-evolving scenarios. In this paper, we propose TART, a Temporal Action Representation learning framework for Tactical resource control and subsequent maneuver generation. TART leverages contrastive learning based on a mutual information objective, designed to capture inherent temporal dependencies in resource-maneuver interactions. These learned representations are quantized into discrete codebook entries that condition the policy, capturing recurring tactical patterns and enabling multi-modal and temporally coherent behaviors. We evaluate TART in two domains where resource deployment is critical: (i) a maze navigation task where a limited budget of discrete actions provides enhanced mobility, and (ii) a high-fidelity air combat simulator in which an F-16 agent operates weapons and defensive systems in coordination with flight maneuvers. Across both domains, TART consistently outperforms hybrid-action baselines, demonstrating its effectiveness in leveraging limited resources and producing context-aware subsequent maneuvers.

Abstract:
This paper presents a framework for multi-session mapping of underwater environments utilizing an affordable action camera. The Visual-Inertial data are augmented by water depth recordings from a dive computer. SVIn2, an open-source VI-SLAM framework is utilized to generate a trajectory and a sparse reconstruction for each session. Utilizing the keyframes extracted from SVIn2, and the estimated camera poses, a Structure-from-Motion (SfM) framework -- COLMAP -- is employed for global optimization and produce a dense reconstruction of the target environment. The presence of calibration targets at fixed locations, when available, is used to estimate the coordinate transformation between different data collection sessions, thus transforming the different sessions into the same coordinate frame. The proposed pipeline is employed for the mapping of a shipwreck off the coast of Barbados. For the first time, both the exterior and the accessible interior parts of the wreck were mapped in two sessions, while a third session employed two cameras with different fields of view.

Abstract:
Resource mapping and prospecting has become the focus of a number of proposed planetary exploration missions, particularly to locate water ice at the lunar south pole. Mobile robots, which are employed for exploration tasks in environments that are inaccessible to humans, collect the information in such missions. In these scenarios, intelligent and adaptive trajectory planning algorithms increase the accuracy of the resulting resource map, along with the efficiency with which information is gathered. In this work, we use ergodic search to generate a mobile robot trajectory that balances exploration and exploitation, while simultaneously mapping the spatial distribution of a resource by using Gaussian process regression with a spectral mixture kernel. The spatial correlation structure learned via Gaussian process regression informs the ergodic search about regions of high information, as well as the frequency components that appear in the map distribution. We call this method spectral mixture ergodic search (SM-ES) and demonstrate how it learns a map and updates the trajectory accordingly on three datasets: synthetic maps, an ice favorability index map for the lunar south polar region, and real mineral data from Cuprite, Nevada.

Abstract:
Online high-definition (HD) map construction is crucial for scaling autonomous driving systems. While Transformer-based methods have become prevalent in online HD map construction, most existing approaches overlook the inherent spatial dependencies and semantic relationships among map elements, which constrains their accuracy and generalization capabilities. To address this, we propose RelMap, an end-to-end framework that explicitly models both spatial relations and semantic priors to enhance online HD map construction. Specifically, we introduce a Class-aware Spatial Relation Prior, which explicitly encodes relative positional dependencies between map elements using a learnable class-aware relation encoder. Additionally, we design a Mixture-of-Experts-based Semantic Prior, which routes features to class-specific experts based on predicted class probabilities, refining instance feature decoding. RelMap is compatible with both single-frame and temporal perception backbones, achieving state-of-the-art performance on both the nuScenes and Argoverse 2 datasets.

Abstract:
Real-time dense scene reconstruction during unstable camera motions is crucial for robotics, yet current RGB-D SLAM systems fail when cameras experience large viewpoint changes, fast motions, or sudden shaking. Classical optimization-based methods deliver high accuracy but fail with poor initialization during large motions, while learning-based approaches provide robustness but lack sufficient accuracy for dense reconstruction. We address this challenge through a combination of learning-based initialization with optimization-based refinement. Our method employs a camera pose regression network to predict metric-aware relative poses from consecutive RGB-D frames, which serve as reliable starting points for a randomized optimization algorithm that further aligns depth images with the scene geometry. Extensive experiments demonstrate promising results: our approach outperforms the best competitor on challenging benchmarks, while maintaining comparable accuracy on stable motion sequences. The system operates in real-time, showcasing that combining simple and principled techniques can achieve both robustness for unstable motions and accuracy for dense reconstruction. Code released: https://github.com/siyandong/PROFusion.

Abstract:
Robots must understand their environment from raw sensory inputs and reason about the consequences of their actions in it to solve complex tasks. Behavior Cloning (BC) leverages task-specific human demonstrations to learn this knowledge as end-to-end policies. However, these policies are difficult to transfer to new tasks, and generating training data is challenging because it requires careful demonstrations and frequent environment resets. In contrast to such policy-based view, in this paper we take a model-based approach where we collect a few hours of unstructured easy-to-collect play data to learn an action-conditioned visual world model, a diffusion-based action sampler, and optionally a reward model. The world model -- in combination with the action sampler and a reward model -- is then used to optimize long sequences of actions with a Monte Carlo Tree Search (MCTS) planner. The resulting plans are executed on the robot via a zeroth-order Model Predictive Controller (MPC). We show that the action sampler mitigates hallucinations of the world model during planning and validate our approach on 3 real-world robotic tasks with varying levels of planning and modeling complexity. Our experiments support the hypothesis that planning leads to a significant improvement over BC baselines on a standard manipulation test environment.

Abstract:
Robotic laser systems enable sub-millimeter, non- contact tissue resection, yet existing platforms lack volumetric planning and intraoperative feedback. We present RATS (Robot-Assisted Tissue Surgery), an intelligent optical coherence tomography (OCT)-guided robotic platform for autonomous volumetric soft tissue resection. RATS integrates macro-scale RGB-D imaging, micro-scale OCT, and a fiber- coupled surgical laser, calibrated through a novel multistage alignment pipeline that achieves OCT-to-laser calibration accuracy of 0.161 ± 0.031 mm. A super-Gaussian lasertissue interaction (LTI) model characterizes ablation morphology with an average RMSE of 0.231 ± 0.121 mm, outperforming Gaussian baselines. A sampling-based model predictive control (MPC) framework operates directly on OCT voxel data to generate closed-loop, constraint-aware resection trajectories, achieving 0.842 mm RMSE (root-mean-square error) and improving intersection-over-union agreement by 64.8% compared to feedforward execution. RATS also detects and preserves subsurface structures, demonstrating the first closed-loop autonomous volumetric robotic laser resection with OCT guidance. To our knowledge, this is the first demonstration of closed-loop autonomous volumetric robotic laser resection with OCT guidance, enabling precise, obstacle-aware tissue removal with potential in neurosurgery.

Abstract:
Achieving quadruped robot locomotion across diverse and dynamic terrains presents significant challenges, primarily due to the discrepancies between simulation environments and real-world conditions. Traditional sim-to-real transfer methods often rely on manual feature design or costly real-world fine-tuning. To address these limitations, this paper proposes the DreamTIP framework, which incorporates Task-Invariant Properties learning within the Dreamer world model architecture to enhance sim-to-real transfer capabilities. Guided by large language models, DreamTIP identifies and leverages Task-Invariant Properties, such as contact stability and terrain clearance, which exhibit robustness to dynamic variations and strong transferability across tasks. These properties are integrated into the world model as auxiliary prediction targets, enabling the policy to learn representations that are insensitive to underlying dynamic changes. Furthermore, an efficient adaptation strategy is designed, employing a mixed replay buffer and regularization constraints to rapidly calibrate to real-world dynamics while effectively mitigating representation collapse and catastrophic forgetting. Extensive experiments on complex terrains, including Stair, Climb, Tilt, and Crawl, demonstrate that DreamTIP significantly outperforms state-of-the-art baselines in both simulated and real-world environments. Our method achieves an average performance improvement of 28.1% across eight distinct simulated transfer tasks. In the real-world Climb task, the baseline method achieved only a 10% success rate, whereas our method attained a 100% success rate. These results indicate that incorporating Task-Invariant Properties into Dreamer learning offers a novel solution for achieving robust and transferable robot locomotion.

Abstract:
Functional grasping with dexterous robotic hands is a key capability for enabling tool use and complex manipulation, yet progress has been constrained by two persistent bottlenecks: the scarcity of large-scale datasets and the absence of integrated semantic and geometric reasoning in learned models. In this work, we present CorDex, a framework that robustly learns dexterous functional grasps of novel objects from synthetic data generated from just a single human demonstration. At the core of our approach is a correspondence-based data engine that generates diverse, high-quality training data in simulation. Based on the human demonstration, our data engine generates diverse object instances of the same category, transfers the expert grasp to the generated objects through correspondence estimation, and adapts the grasp through optimization. Building on the generated data, we introduce a multimodal prediction network that integrates visual and geometric information. By devising a localglobal fusion module and an importance-aware sampling mechanism, we enable robust and computationally efficient prediction of functional dexterous grasps. Through extensive experiments across various object categories, we demonstrate that CorDex generalizes well to unseen object instances and significantly outperforms state-of-the-art baselines.

Abstract:
This paper presents a novel framework designed to enhance key object identification in autonomous driving. Existing methods primarily focus on either detecting objects independently or leveraging visual relationships, but they do not explicitly consider the ego vehicle's perspective in determining object importance. To address this gap, we propose a structured approach that integrates a virtual ego-vehicle representation and a modular object state predictor, enabling a more accurate estimation of object behaviors relative to the ego-vehicle. Subsequently, our framework employs spatial-temporal reasoning to refine key object identification, prioritizing objects based on their states and relative spatial information rather than relying solely on visual relationships. Experimental results on real-world driving datasets demonstrate the effectiveness of our approach in accurately detecting critical objects in complex traffic environments.

Abstract:
Digital twins promise to enhance robotic manipulation by maintaining a consistent link between real-world perception and simulation. However, most existing systems struggle with the lack of a unified model, complex dynamic interactions, and the real-to-sim gap, which limits downstream applications such as model predictive control. Thus, we propose GaussTwin, a real-time digital twin that combines position-based dynamics with discrete Cosserat rod formulations for physically grounded simulation, and Gaussian splatting for efficient rendering and visual correction. By anchoring Gaussians to physical primitives and enforcing coherent SE(3) updates driven by photometric error and segmentation masks, GaussTwin achieves stable predictioncorrection while preserving physical fidelity. Through experiments in both simulation and on a Franka Research 3 platform, we show that GaussTwin consistently improves tracking accuracy and robustness compared to shape-matching and rigid-only baselines, while also enabling downstream tasks such as push-based planning. These results highlight GaussTwin as a step toward unified, physically meaningful digital twins that can support closed-loop robotic interaction and learning.

Abstract:
Many Vision-Language-Action (VLA) models flatten image patches into a 1D token sequence, weakening the 2D spatial cues needed for precise manipulation. We introduce IVRA, a lightweight, training-free method that improves spatial understanding by exploiting affinity hints already available in the models built-in vision encoder, without requiring any external encoder or retraining. IVRA selectively injects these affinity signals into a language-model layer in which instance-level features reside. This inference-time intervention realigns visual-token interactions and better preserves geometric structure while keeping all model parameters fixed. We demonstrate the generality of IVRA by applying it to diverse VLA architectures (LLaRA, OpenVLA, and FLOWER) across simulated benchmarks spanning both 2D and 3D manipulation (VIMA and LIBERO) and on various real-robot tasks. On 2D VIMA, IVRA improves average success by +4.2% over the baseline LLaRA in a low-data regime. On 3D LIBERO, it yields consistent gains over the OpenVLA and FLOWER baselines, including improvements when baseline accuracy is near saturation (96.3%�?7.1%). Code and visualizations are available at: jongwoopark7978.github.io/IVRA

Abstract:
Real-time sampling-based planners increasingly use learned sampling distributions for faster planning in autonomous vehicles. These planners employ a neural network to predict the optimal path and bias some samples toward the path. However, inherent prediction inaccuracies of the network often lead to suboptimal paths, especially in narrow spaces. Learned samples should be used carefully based on accuracy, as inaccurate samples can degrade planning performance. To address this problem, this paper proposes Learned Adaptive Anytime TargetTree-RRT (LA3T) algorithm. The proposed planner introduces the adaptive biasing ratio. The approach learns to assess the reliability of the learned distribution using the network's confidence. This confidence approximates a proper ratio of learned samples used, thereby adaptively maximizing planning performance while considering a level of prediction accuracy. Furthermore, the LA3T algorithm incorporates the target tree algorithm. The goal pose is replaced with a set (target tree) of pre-defined optimal path segments, reducing computational efforts in narrow regions. Experiments in various driving tasks explore the benefits of each component through ablation studies. The proposed algorithm significantly increases the success rate and reduces the path length in simulated and real-world scenarios compared to other sampling-based methods.

Abstract:
This paper tackles the challenges of automating battery removal from small electronic devices, such as heat cost allocators and smoke detectors. The process is critical for mitigating fire hazards caused by lithium batteries in recycling facilities and for supporting a circular economy. We focus on advanced methodologies and robotic technologies designed to overcome the significant hurdles posed by the diverse range of device designs, complex battery compartments, and varying states of damage. Our approach integrates Vision-Language Models (VLMs) for real-time, adaptive disassembly planning, computer vision, tactile skills, soft robotics, and reconfigurable robotic workcells to enhance perception, dexterity, and adaptability in handling diverse device designs and damage states. Additionally, a reconfigurable robotic workcell with modular hardware and standardized interfaces enables seamless adaptation to various devices. Laboratory testing demonstrates improved efficiency and reduced manual intervention, highlighting the potential of AI-driven, reconfigurable robotics for scalable and sustainable e-waste recycling.

Abstract:
The simulation-to-reality (sim-to-real) transfer of large-scale hydraulic robots presents a significant challenge in robotics because of the inherent slow control response and complex fluid dynamics. The complex dynamics result from the multiple interconnected cylinder structure and the difference in fluid rates of the cylinders. These characteristics complicate detailed simulation for all joints, making it unsuitable for reinforcement learning (RL) applications. In this work, we propose an analytical actuator model driven by hydraulic dynamics to represent the complicated actuators. The model predicts joint torques for all 12 actuators in under 1 microsecond, allowing rapid processing in RL environments. We compare our model with neural network-based actuator models and demonstrate the advantages of our model in data-limited scenarios. The locomotion policy trained in RL with our model is deployed on a hydraulic quadruped robot, which is over 300 kg. This work is the first demonstration of a successful transfer of stable and robust command-tracking locomotion with RL on a heavy hydraulic quadruped robot, demonstrating advanced sim-to-real transferability.

Abstract:
Inspired by sled-pulling dogs in transportation, we present a cable-trailer integrated with a quadruped robot system. The motion planning of this system faces challenges due to the interactions between the cable's state transitions, the trailer's nonholonomic constraints, and the system's underactuation. To address these, we develop a hybrid dynamics model that captures the cable's taut and slack states. A search algorithm is then introduced to compute a suboptimal trajectory while incorporating mode transitions. Additionally, we propose a novel collision avoidance constraint based on geometric polygons to formulate the trajectory optimization problem for the hybrid system. The proposed method is implemented on a Unitree A1 quadruped robot with a customized cable-trailer and validated through experiments. The real system demonstrates both agile and safe motion with cable mode transitions.

Abstract:
Measurements and state estimates are often imperfect in control practice, posing challenges for safety-critical applications, where safety guarantees rely on accurate state information. In the presence of estimation errors, several prior robust control barrier function (R-CBF) formulations have imposed strict conditions on the input. These methods can be overly conservative and can introduce issues such as infeasibility, high control effort, etc. This work proposes a systematic method to improve R-CBFs, and demonstrates its advantages on a tracked vehicle that navigates among multiple obstacles. A primary contribution is a new optimization-based online parameter adaptation scheme that reduces the conservativeness of existing R-CBFs. In order to reduce the complexity of the parameter optimization, we merge several safety constraints into one unified numerical CBF via Poissons equation. We further address the dual relative degree issue that typically causes difficulty in vehicle tracking. Experimental trials demonstrate the overall performance improvement of our approach over existing formulations.

Abstract:
Safe and efficient obstacle avoidance in multi-robot navigation is a challenging problem, with deadlock being a key issue due to strict collision-free constraints. This paper proposes a novel Lateral Reciprocal Collision Avoidance (LRCA) strategy based on velocity obstacle theory to mitigate deadlock among multiple robots. Inspired by pedestrian collision avoidance, LRCA incorporates symmetric lateral displacement to resolve deadlock. Unlike Optimal Reciprocal Collision Avoidance (ORCA) algorithm, which computes velocity constraints based on the minimal adjustment across the full velocity obstacle, our proposed method restricts the velocity changes of the agents to one randomly selected side of the relative velocity. This randomized directional selection strategy effectively prevents deadlock while preserving collision avoidance. This approach avoids conflicting velocity changes that could lead to mutual trapping. Theoretical analysis shows how ORCA cause deadlock using quadratic programming, Lagrangian functions, and KKT conditions, and how LRCA effectively prevents this. Extensive simulations across four benchmark multi-robot navigation scenarios show that LRCA outperforms existing algorithms in success rate, time to goal, path length, and computational efficiency.

Abstract:
This study proposes a Stepwise Vision-Language-Action (VLA) framework for the robust detection and tracking of unknown objects in edge device environments (NVIDIA Jetson AGX Orin). Conventional end-to-end VLA models face challenges such as massive memory requirements and a "black-box" nature that complicates debugging. To address these issues, we adopt a modular architecture, specifically integrating Depth-Guided Gaussian Sampling with MobileSAM in the vision module. This approach achieves over 99% detection success for unlearned objects. Furthermore, we demonstrate real-time 6-DOF pose tracking at over 30 FPS through ORB feature matching and ROI-based localization following the initialization phase

Abstract:
While joint torque sensors enable precise robot interactions, insufficient structural stiffness significantly limits control bandwidth and accuracy by reducing overall system rigidity. This study proposes a high-stiffness torque sensor based on a hybrid Scott-Russell (SR) and parallelogram (PL) flexure mechanism. The SR structure performs mechanical displacement amplification, ensuring high sensitivity even within a rigid design. By integrating the PL mechanism, the inherent parasitic rotation typically observed in conventional SR structures is effectively suppressed, ensuring pure translational motion between the capacitive electrodes. This hybrid flexure maximizes the capacitance change and achieves high sensing sensitivity while maintaining the high structural stiffness required for robust robotic joints. The proposed mechanism is validated through simulation, demonstrating its potential to ensure both system-level rigidity and high-resolution torque sensing.

Abstract:
Brain-Machine Interfaces (BMIs), which link the brain to external devices, hold great potential in rehabilitation, human performance augmentation, and human-centered robotics. However, invasive BMIs face a critical challenge for long-term deployment due to neural drift, which degrades decoding performance over time and necessitates frequent recalibration. Existing methods designed to mitigate neural drift typically rely on either domain adaptation (DA) or domain generalization (DG) alone and often fail to capture fine-grained distribution shifts across neural subdomains, resulting in limited performance. To overcome these limitations, we propose Uncertainty-guided Self-paced Cycling (UnSPC), a robust framework that synergizes DA and DG for target domain refining under an Uncertainty-guided Self-paced Pseudo-labeling (UnSPL) mechanism. To handle subdomain neural drift across domains, UNSPL is proposed to iteratively mine reliable pseudo-labeled samples with a noise-robust ranking strategy for further fine-tuning. Leveraging these high-quality samples, we introduce a novel Cycling Adaptation and Generalization (CycAG) strategy, which integrates DA and DG within an iterative cycle to progressively mitigate both global and subdomain drift. This cyclic process enables effective alignment to evolving target distributions while preserving robust and transferable representations, thereby mitigating performance degradation under long-term neural drifts. Extensive experiments on multiple neural decoding datasets demonstrate the effectiveness and robustness of UnSPC. To our knowledge, our proposed UnSPC is the first to cyclically integrate DA and DG with pseudo-labeling, paving the way toward stable long-term BMI controls.

Abstract:
In kinesthetic teaching, a robot is manually guided by a human operator to demonstrate a task. Most methods focus on replaying the recorded motion, but are agnostic to contact transitions, which can be critical when interacting with rigid environments. To overcome this limitation, we propose a framework that allows to teach motions in free space as well as in contact while preventing fast unintended contact transitions. This is accomplished by exploiting a projection-based unilateral damping force that increases close to contact. We derive an explicit analytical expression for the damping characteristics to ensure a safe stop before the contact when no further forces act on the robot. Furthermore, after the teaching, the recorded motion data is utilized to generate a time-optimized trajectory based on convex optimization, in which the contact transitions are explicitly considered. We validated our framework in experiments with a torque-controlled manipulator.

Abstract:
We address trajectory planning for tractor-trailer robots, where additional trailers increase transport capacity but introduce complex nonholonomic kinematics, high-dimensional states, and deformable structures. We propose a lightweight, compact, high-order smooth trajectory representation and an efficiently solvable spatiotemporal optimization formulation. To handle deformability and collision avoidance, we directly deform trajectories in continuous space by exploiting collision-free regions, eliminating the need to build convex safe sets from seed points before each optimization. This avoids loss of feasible space and reduces sensitivity to initial guesses. A multi-terminal fast path search further provides high-quality initialization. Extensive simulations show severalfold efficiency gains over existing methods, with lower curvature and shorter durations. Real-world indoor and outdoor experiments on transport, loading, and unloading validate effectiveness. Code: https://github.com/Tracailer/Tracailer.

Abstract:
This study introduces a visual scene understanding (VSU) pipeline that fuses scene graph generation (SGG) with task planning for agricultural robots. Mask R-CNN detects fruits, leaves, and stems; Object features feed heads for predicates and attributes such as rigidity and ripeness. The resulting graph triggers a rule-based planner that chooses among harvesting, pruning, or thinning and decides on single- or dual-arm execution. Evaluated on a re-annotated custom dataset, the full pipeline reaches 38.9% relationship R@50, 70.1% attribute R@50, 72.3% task-decision accuracy, and 53.7% cooperative-control accuracy. Results show dual-arm selection is twice as sensitive to perception errors as task type assignment. The work provides an agriculture-specific task planning that distinguishes flexible from rigid obstacles, demonstrating that relational and attribute improve perception in agricultural scenes.

Abstract:
Acoustic inspection is crucial for infrastructure maintenance, but its effectiveness is often hampered by environmental noise. Conventional denoising methods rely on prior knowledge or training data, limiting their practicability. This paper presents Zero-Shot Denoiser, a novel approach achieving noise reduction without pre-collected target sound samples or noise knowledge. Our method synergistically combines Blind Signal Separation (BSS) for unsupervised audio decomposition and Artifact-Resilient Attention (AR-Attention) for text-guided audio reconstruction. AR-Attention leverages pre-trained audio-language models and dual normalization to mitigate BSS artifacts and identify target sounds semantically. We introduce pseudo Signal-to-Noise Ratio, derived from the audio-language model, for automatic BSS hyperparameter optimization. In experiments using public datasets, our method, operating in a true zero-shot setting, achieved performance comparable to that of state-of-the-art supervised denoising methods, and experiments targeting hammering tests confirmed the effectiveness of our approach for real-world acoustic inspections. Our approach overcomes the limitations of data-dependent techniques and offers a versatile noise reduction solution for acoustic inspection and broader acoustic tasks.

Abstract:
This paper presents a novel calibration system for odometric sensors using an Adaptive Particle Filter (Adaptive-PF) to achieve high precision pose estimation and improve localization in wheeled mobile robots. The system compensates for intrinsic systematic errors in the odometric sensor by adjusting its parameters in realtime. Likewise, a comparative analysis of resampling methods multinomial, stratified, systematic, and residual resampling is conducted to evaluate their impact on calibration performance. The system validation is demonstrated by its implementation in an autonomous wheelchair, where the localization module integrates wheel encoders, an Inertial Measurement Unit (IMU), and a LIDAR sensor, providing robust navigation in dynamic environments. Experimental results demonstrate that systematic approach and resampling based on the effective number of particles, N(eff), yield the best performance. Additionally, the system dynamically adjusts prediction error based on the differences between LIDAR and odometry data. It also adapts the number of particles according to the dispersion and uncertainty, optimizing computational time without sacrificing accuracy. The proposed system outperforms another well-known method, namely the DKF (Dual Kalman Filter). Consequently, this research introduces a new Adaptive-PF for odometric parameter calibration under changing conditions.

Abstract:
An aircraft's airspeed, angle of attack, and angle of side slip are crucial to its safety, especially when flying close to the stall regime. Various solutions exist, including pitot tubes, angular vanes, and multihole pressure probes. However, current sensors are either too heavy (>30 g) or require large airspeeds (>20 m/s), making them unsuitable for small uncrewed aerial vehicles. We propose a novel multihole pressure probe, integrating sensing electronics in a single-component structure, resulting in a mechanically robust and lightweight sensor (9 g), which we released to the public domain. Since there is no consensus on two critical design parameters, tip shape (conical vs spherical) and hole spacing (distance between holes), we provide a study on measurement accuracy and noise generation using wind tunnel experiments. The sensor is calibrated using a multivariate polynomial regression model over an airspeed range of 3-27 m/s and an angle of attack/sideslip range of +/-35 deg, achieving a mean absolute error of 0.44 m/s and 0.16 deg. Finally, we validated the sensor in outdoor flights near the stall regime. Our probe enabled accurate estimations of airspeed, angle of attack and sideslip during different acrobatic manoeuvres. Due to its size and weight, this sensor will enable safe flight for lightweight, uncrewed aerial vehicles flying at low speeds close to the stall regime.

Abstract:
Multi-modal 3D object detection is pivotal for autonomous driving, integrating complementary sensors like LiDAR and cameras. However, its real-world reliability is challenged by transient data interruptions and missing, where modalities can momentarily drop due to hardware glitches, adverse weather, or occlusions. This poses a critical risk, especially during a simultaneous modality drop, where the vehicle is momentarily blind. To address this problem, we introduce ModalPatch, the first plug-and-play module designed to enable robust detection under arbitrary modality-drop scenarios. Without requiring architectural changes or retraining, ModalPatch can be seamlessly integrated into diverse detection frameworks. Technically, ModalPatch leverages the temporal nature of sensor data for perceptual continuity, using a history-based module to predict and compensate for transiently unavailable features. To improve the fidelity of the predicted features, we further introduce an uncertainty-guided cross-modality fusion strategy that dynamically estimates the reliability of compensated features, suppressing biased signals while reinforcing informative ones. Extensive experiments show that ModalPatch consistently enhances both robustness and accuracy of state-of-the-art 3D object detectors under diverse modality-drop conditions. Code will be available at https://github.com/Castiel-Lee/MM3Det_MD.

Abstract:
Robotic grasp has been employed in various industrial, household, and medical applications. However, neglecting the final grasping state and objects' stiffness and affordance, prevailing strategies predominantly emphasize the grippers initial state upon reaching grasp positions and often fail due to damage or grasp slippage. Here, we propose an AdapGrasp strategy with a dataset named AdapGraspDataset and a corresponding model named AdapGraspNet. The dataset focuses on the object stiffness and grasp affordance. Specifically, for objects with different stiffness properties, the corresponding final grasp width (FGW) is annotated to ensure the object's intactness. For objects grasp affordance properties, higher grasp affordance weight (GAW) is typically annotated closer to the centroid, increasing grasp stability. Meanwhile, to output the set of grasping configurations (initial and final grasp states) more accurately, a denoising principle is introduced to build a corresponding transformer-based model. It enables more accurate convergence of FGW and GAW, achieving a precision of 98.04% and a mean absolute final width error of 2.71 pixels. Finally, extensive real-world experiments are conducted, where the AdapGrasp strategy ensures the intactness of fragile objects and thus enhances grasping stability without any additional sensors. It achieves a grasping accuracy of 95% and yields a 19.5% improvement compared with those without FGW and GAW. The AdapGrasp strategy is publicly available at https://embodied-soft-intelligence.github.io/AdapGrasp/.

Abstract:
Autonomous robot exploration (ARE) is the process of a robot autonomously navigating and mapping an unknown environment. Recent Reinforcement Learning (RL)-based approaches typically formulate ARE as a sequential decision-making problem defined on a collision-free informative graph. However, these methods often demonstrate limited reasoning ability over graph-structured data. Moreover, due to the insufficient consideration of robot motion, the resulting RL policies are generally optimized to minimize travel distance, while neglecting time efficiency. To overcome these limitations, we propose GRATE, a Deep Reinforcement Learning (DRL)-based approach that leverages a Graph Transformer to effectively capture both local structure patterns and global contextual dependencies of the informative graph, thereby enhancing the models reasoning capability across the entire environment. In addition, we deploy a Kalman filter to smooth the waypoint outputs, ensuring that the resulting path is kinodynamically feasible for the robot to follow. Experimental results demonstrate that our method exhibits better exploration efficiency (up to 21.5% in distance and 21.3% in time to complete exploration) than state-of-the-art conventional and learning-based baselines in various simulation benchmarks. We also validate our planner in real-world scenarios.

Abstract:
4D radars, which provide 3D point cloud data along with Doppler velocity, are attractive components of modern automated driving systems due to their low cost and robustness under adverse weather conditions. However, they provide a significantly lower point cloud density than LiDAR sensors. This makes it important to exploit not only local but also global contextual scene information. This paper proposes DRIFT, a model that effectively captures and fuses both local and global contexts through a dual-path architecture. The model incorporates a point path to aggregate fine-grained local features and a pillar path to encode coarse-grained global features. These two parallel paths are intertwined via novel feature-sharing layers at multiple stages, enabling full utilization of both representations. DRIFT is evaluated on the widely used View-of-Delft (VoD) dataset and a proprietary internal dataset. It outperforms the baselines on the tasks of object detection and/or free road estimation. For example, DRIFT achieves a mean average precision (mAP) of 52.6% (compared to, say, 45.4% of CenterPoint) on the VoD dataset.

Abstract:
Back injuries resulting from manual material handling have long constituted a prominent threat to occupational safety. While back-support exosuits offer the potential to augment human strength, their practical implementation is hindered by persistent challenges pertaining to comfort and safety. Drawing inspiration from human biomechanics and muscle behavior, we develop a lightweight assistive exosuit that synchronizes with natural load-handling rhythms. By integrating a deployable Kresling origami structure with a twostage transmission mechanism, a single motor can sequentially assist both the waist and arms, achieving motion-conforming support with minimal complexity. An energy-aware compliance control strategy allows the system to yield passively during unassisted motion, avoiding interference with voluntary human behavior. We propose an event-triggered impedance control strategy based on an energy tank framework, which adaptively intervenes only when interaction energy exceeds safety thresholds. Experimental results demonstrate substantial reductions in muscle activation during load-handling tasks, with decreases of up to 22.8%, 15.4%, and 14.8% in the biceps, triceps, and erector spinae (MVC%), respectively.

Abstract:
Accurately assessing brain activity to modulate training parameters online is crucial for improving the human-robot cognitive interaction (HRCI) performance in closed-loop brain training. The major challenge for this technique lies in how to accurately model and characterize the intrinsic behavior of brain activity in HRCI process, which typically exhibits a dynamic manner across spatial and temporal scales. In this study, we propose a dynamic perspective to visualize the spatiotemporal evolution of brain activity during HRCI process, thus enabling assessment of brain states and adaptive modulation during rehabilitation. A novel framework is developed to model the spatiotemporal dynamics of brain activity by integrating deterministic learning with neural population theory. It demonstrates a remarkable capability to mine and visualize the complex nonlinear dynamics of brain activity, encompassing both temporal evolution and spatial connectivity patterns. The proposed model not only visualizes of spatiotemporal brain dynamics but also enables online assessment of brain states, which can facilitate optimal modulation of HRCI process and improve the brain training efficiency. The method is validated using a panoramic virtual reality system. Results show that our method improves the accuracy of brain activity assessment by 8.86%, effectively demonstrating that it accurately visualizes spatiotemporal brain dynamics and enhances training outcomes when integrated with HRCI.

Abstract:
Recent advances in visuotactile sensors increasingly employ biomimetic curved surfaces to enhance sensorimotor capabilities. Although such curved visuotactile sensors enable more conformal object contact, their perceptual quality is often degraded by non-uniform illumination, which reduces reconstruction accuracy and typically necessitates calibration. Existing calibration methods commonly rely on customized indenters and specialized devices to collect large-scale photometric data, but these processes are expensive and labor-intensive. To overcome these calibration challenges, we present NLiPsCalib, a physics-consistent and efficient calibration framework for curved visuotactile sensors. NLiPsCalib integrates controllable near-field light sources and leverages Near-Light Photometric Stereo (NLiPs) to estimate contact geometry, simplifying calibration to just a few simple contacts with everyday objects. We further introduce NLiPsTac, a controllable-light-source tactile sensor developed to validate our framework. Experimental results demonstrate that our approach enables high-fidelity 3D reconstruction across diverse curved form factors with a simple calibration procedure. We emphasize that our approach lowers the barrier to developing customized visuotactile sensors of diverse geometries, thereby making visuotactile sensing more accessible to the broader community.

Abstract:
Though robotic systems are now being commercialized and deployed in various industries, many of these systems are highly specialized and often require an advanced skill set to operate and ensure they perform as instructed. To mitigate this problem, it has been proposed to use large language models (LLMs) to synthesize mission plans in precision agriculture and other domains based on mission descriptions provided in natural language (NL). While these systems demonstrate impressive performance, they also suffer from the inherent ambiguities of NL. In this paper, we address this issue by introducing a planning architecture that combines LLMs with linear temporal logic (LTL) to ensure that, through formal verification, the mission planning system meets the specifications formulated by the user while still using NL. In our proposed system, the mission plan is seen as the implementation and the LTL formalization is seen as the specification. Both are automatically extracted from mission descriptions provided in NL. To mitigate potential bias, two separate LLMs are tasked with the implementation and specification generation. Through feedback loops, the system self-corrects when syntax or verification errors are encountered, thus offering a fully hands-off solution. Through extensive experiments, we highlight the strengths and limitations of integrating mission verification into a fully autonomous pipeline, particularly regarding an LLM's ability to generate valuable LTL formulas, and show how our proposed implementation addresses and solves these challenges.

Abstract:
Learning-based model predictive control (MPC) can enhance control performance by correcting for model inaccuracies, enabling more precise state trajectory predictions than traditional MPC. A common approach is to model unknown residual dynamics as a Gaussian process (GP), which leverages data and also provides an estimate of the associated uncertainty. However, the high computational cost of online learning poses a major challenge for real-time GP-MPC applications. This work presents an efficient implementation of an approximate spatio-temporal GP model, offering online learning at constant computational complexity. It is optimized for GP-MPC, where it enables improved control performance by learning more accurate system dynamics online in real-time, even for time-varying systems. The performance of the proposed method is demonstrated by simulations and hardware experiments in the exemplary application of autonomous miniature racing.

Abstract:
Classical robotic systems typically rely on custom planners designed for constrained environments. While effective in restricted settings, these systems lack generalization capabilities, limiting the scalability of embodied AI and general‑purpose robots. Recent data‑driven Vision‑Language‑Action (VLA) approaches aim to learn policies from large‑scale simulation and real‑world data. However, the continuous action space of the physical world significantly exceeds the representational capacity of linguistic tokens, making it unclear if scaling data alone can yield general robotic intelligence. To address this gap, we propose ActionReasoning, an LLM-driven framework that performs explicit action reasoning to produce physics-consistent, prior-guided decisions for robotic manipulation. ActionReasoning leverages the physical priors and real-world knowledge already encoded in Large Language Models (LLMs) and structures them within a multi-agent architecture. We instantiate this framework on a tractable case study of brick stacking, where the environment states are assumed to be already accurately measured. The environmental states are then serialized and passed to a multi-agent LLM framework that generates physics-aware action plans. The experiments demonstrate that the proposed multi-agent LLM framework enables stable brick placement while shifting effort from low-level domain-specific coding to high-level tool invocation and prompting, highlighting its potential for broader generalization. This work introduces a promising approach to bridging perception and execution in robotic manipulation by integrating physical reasoning with LLMs.

Abstract:
We present a new causal transformer system consisting of Spatial-Attention Tokenization (SAT) with MultiResolution Causal Temporal Mixing (MRCTM) to perform online skeleton-based action recognition during humanrobot interaction. The novel architecture uses Spatial-Attention Tokenization (SAT) to generate soft tokens from human joint groups. MRCTM performs causal convolutions and selfattention operations to detect both detailed motion patterns and extended temporal relationships. We introduce GoHAR12 dataset as an evaluation tool as it contains 12 gesture and posture classes which are recorded in human-robot interaction (HRI) settings and directly translate to high-level commands for the Unitree Go1 quadruped. The proposed model reaches 98.4% accuracy on the GoHAR-12 dataset and it shows superior performance in distinguishing between actions that are quite similar in motion, maintains strong results on public benchmarks such as NTU-RGB+D and NW-UCLA. We demonstrate how causal transformer performs for reliable realtime skeleton-based control of the Unitree Go1 robot.

Abstract:
General-purpose robotic skills from end-to-end demonstrations often leads to task-specific policies that fail to generalize beyond the training distribution. Therefore, we introduce FunCanon, a framework that converts long-horizon manipulation tasks into sequences of action chunks, each defined by an actor, verb, and object. These chunks focus policy learning on the actions themselves, rather than isolated tasks, enabling compositionality and reuse. To make policies pose-aware and category-general, we perform functional object canonicalization for functional alignment and automatic manipulation trajectory transfer, mapping objects into shared functional frames using affordance cues from large visionlanguage models. An object-centric and action-centric diffusion policy FuncDiffuser trained on this aligned data naturally respects object affordances and poses, simplifying learning and improving generalization ability. Experiments on simulated and real-world benchmarks demonstrate category-level generalization, cross-task behavior reuse, and robust sim-to-real deployment, showing that functional canonicalization provides a strong inductive bias for scalable imitation learning in complex manipulation domains. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/funcanon-anonymous.

Abstract:
Spinning flexible objects, exemplified by traditional Chinese handkerchief performances, demands periodic steady-state motions under nonlinear dynamics with frictional contacts and boundary constraints. To address these challenges, we first design an intuitive dexterous wrist based on a parallel anti-parallelogram tendon-driven structure, which achieves 90 degrees omnidirectional rotation with low inertia and decoupled roll-pitch sensing, and implement a high-low level hierarchical control scheme. We then develop a particle-spring model of the handkerchief for control-oriented abstraction and strategy evaluation. Hardware experiments validate this framework, achieving an unfolding ratio of approximately 99% and fingertip tracking error of RMSE = 2.88 mm in high-dynamic spinning. These results demonstrate that integrating control-oriented modeling with a task-tailored dexterous wrist enables robust rest-to-steady-state transitions and precise periodic manipulation of highly flexible objects. More visualizations: https://slowly1113.github.io/icra2026-handkerchief/

Abstract:
This paper presents a unified decision-making framework that integrates Hybrid Markov Decision Processes (HMDPs) with Model Predictive Control (MPC), augmented by velocity-dependent safety margins and a prediction-aware hysteresis mechanism. Both the ego and surrounding vehicles are modeled as HMDPs, allowing discrete maneuver transition and kinematic evolution to be jointly considered within the MPC optimization. Safety margins derived from the Intelligent Driver Model (IDM) adapt to traffic context but vary with speed, which can cause oscillatory decisions and velocity fluctuations. To mitigate this, we propose a frozen-release hysteresis mechanism with distinct trigger and release thresholds, effectively enlarging the reaction buffer and suppressing oscillations. Decision continuity is further safeguarded by a two-layer recovery scheme: a global bounded relaxation tied to IDM margins and a deterministic fallback policy. The framework is evaluated through a case study, an ablation against a no-hysteresis baseline, and large-scale randomized experiments across 18 traffic settings. Across 8,050 trials, it achieves a collision rate of only 0.05%, with 98.77% of decisions resolved by nominal MPC and minimal reliance on relaxation or fallback. These results demonstrate the robustness and adaptability of the proposed decision-making framework in heterogeneous traffic conditions.

Abstract:
Generating collision-free motions in dynamic environments is a challenging problem for high-dimensional robotics, particularly under real-time constraints. Control Barrier Functions (CBFs), widely utilized in safety-critical control, have shown significant potential for motion generation. However, for high-dimensional robot manipulators, existing QP formulations and CBF-based methods rely on positional information, overlooking higher-order derivatives such as velocities. This limitation may lead to reduced success rates, decreased performance, and inadequate safety constraints. To address this, we construct time-varying CBFs (TVCBFs) that consider dynamic obstacles. Our approach leverages recent developments on distance fields for articulated manipulators, a differentiable representation that enables the mapping of objects' position and velocity into the robot's joint space, offering a comprehensive understanding of the system's interactions. This allows the manipulator to be treated as a point-mass system thus simplifying motion generation tasks. Additionally, we introduce a time-varying control Lyapunov function (TVCLF) to enable whole-body contact motions. Our approach integrates the TVCBFs, TVCLF, and manipulator physical constraints within a unified QP framework. We validate our method through simulations and comparisons with state-of-the-art approaches, demonstrating its effectiveness on a 7-axis Franka arm in real-world experiments. Source codes, experimental data and videos are available on the project webpage: urlhttps://sites.google.com/view/sdfcdf-tvcbfs-qp.

Abstract:
Trajectory optimizers for legged robots typically assume a single end effector on each leg, often a foot or wheel, without switching to another. Robots employing point-modeled end effectors, compared to those with wheeled end effectors, often benefit in adaptability and maneuverability but at the cost of higher energy expenditure and lower speed. While current hardware supports switching between these two end-effector types, existing research has largely focused on maintaining stability during switching, with little attention to determining when each type is most effective. To our knowledge, this paper introduces the first framework that simultaneously optimizes both trajectories and end-effector contact dynamics through mixed-integer optimization. We validate our approach by solving and executing trajectories with a whole-body controller in Gazebo across a variety of terrains, including ramps and stepping stones. The results show that our framework not only handles diverse terrains but also exploits contact dynamics to reduce cost of transport and increase speed compared to foot-only locomotion.

Abstract:
Collaborative transportation of cable-suspended payloads by teams of Unmanned Aerial Vehicles (UAVs) has the potential to enhance payload capacity, adapt to different payload shapes, and provide built-in compliance, making it attractive for applications ranging from disaster relief to precision logistics. However, multi-UAV coordination under disturbances, nonlinear payload dynamics, and slacktaut cable modes remains a challenging control problem. To our knowledge, no prior work has addressed these cable mode transitions in the multi-UAV context, instead relying on simplifying rigid-link assumptions. We propose CrazyMARL, a decentralized Reinforcement Learning (RL) framework for multi-UAV cable-suspended payload trans- port. Simulation results demonstrate that the learned policies can outperform classical decentralized controllers in terms of disturbance rejection and tracking precision, achieving an 80% recovery rate from harsh conditions compared to 44% for the baseline method. We also achieve successful zero-shot sim- to-real transfer and demonstrate that our policies are highly robust under harsh conditions, including wind, random external disturbances, and transitions between slack and taut cable dynamics. This work paves the way for autonomous, resilient UAV teams capable of executing complex payload missions in unstructured environments. Code and videos can be found on the website: https://imrclab.github.io/CrazyMARL.

Abstract:
Local navigation in cluttered environments often suffers from dense obstacles and frequent local minima. Conventional local planners rely on heuristics and are prone to failure, while deep reinforcement learning (DRL)-based approaches provide adaptability but are constrained by limited onboard sensing. These limitations lead to navigation failures because the robot cannot perceive about structures outside its field of view. In this paper, we propose DreamFlow, a DRL-based local navigation framework that extends the robots perceptual horizon through conditional flow matching (CFM). The proposed CFM-based prediction module learns probabilistic mapping between local height map latent representation and broader spatial representation conditioned on navigation context. This enables the navigation policy to predict unobserved environmental features and proactively avoid potential local minima. Experimental results demonstrated that DreamFlow outperforms existing methods in terms of latent prediction accuracy and navigation performance in simulation. The proposed method was further validated in cluttered real-world environments with a quadrupedal robot. The project page is available at https://dreamflow-icra.github.io.

Abstract:
We introduce MERGE, a system for situational grounding of actors, objects, and events in dynamic humanrobot group interactions. Effective collaboration in such settings requires consistent situational awareness, built on persistent representations of people and objects and an episodic abstraction of events. MERGE achieves this by uniquely identifying physical instances of actors (humans or robots) and objects and structuring them into actoractionobject relations, ensuring temporal consistency across interactions. Central to MERGE is the integration of Vision-Language Models (VLMs) guided with a perception pipeline: a lightweight streaming module continuously processes visual input to detect changes and selectively invokes the VLM only when necessary. This decoupled design preserves the reasoning power and zero-shot generalization of VLMs while improving efficiency, avoiding both the high monetary cost and the latency of frame-by-frame captioning that leads to fragmented and delayed outputs. To address the absence of suitable benchmarks for multi-actor collaboration, we introduce the GROUND dataset, which offers fine-grained situational annotations of multi-person and humanrobot interactions. On this dataset, our approach improves the average grounding score by a factor of 2 compared to the performance of VLM-only baselines--including GPT-4o, GPT-5 and Gemini 2.5 Flash--while also reducing run-time by a factor of 4. The code and data are available at www.github.com/HRI-EU/merge.

Abstract:
Functional Electrical Stimulation (FES) is a critical therapy for motor rehabilitation, yet the rapid onset of muscle fatigue severely limits its efficacy. This paper presents the design, implementation, and validation of a comprehensive, intelligent closed-loop FES system designed to provide effective force assistance by actively sensing FES-induced fatigue. The system integrates a pressure-based Mechanomyography (P_MMG) sensor for real-time feedback of muscle force capacity, a Kalman filter for robust signal estimation, and a fuzzy-logic-based Proportional-Integral-Derivative (PID) controller to modulate FES dynamically. The developed system was first validated in a comprehensive simulation and then tested with four healthy participants. The results demonstrate that the closed-loop fuzzy PID controller yielded a functionally meaningful improvement in performance over an open-loop-controlled protocol. The system substantially extended the duration of effective FES and, critically, delayed the onset of functional failure (indicated by a force drop >50%), with performance improvements showing a strong trend toward statistical significance (Wilcoxon signed-rank test, p = 0.0625). This work delivers a practical and effective solution for managing fatigue during FES therapy, holding the potential to significantly enhance rehabilitation outcomes.

Abstract:
Autonomous docking between Unmanned Aerial Vehicles (UAVs) and ground robots is essential for heterogeneous systems, yet most existing approaches target wheeled platforms whose limited mobility constrains exploration in complex terrains. Quadruped robots offer superior adaptability but undergo frequent posture variations, making it difficult to provide a stable landing surface for UAVs. To address these challenges, we propose an autonomous UAV-quadruped docking framework for GPS-denied environments. On the quadruped side, a Hybrid Internal Model with Horizontal Alignment (HIM-HA), learned via deep reinforcement learning, actively stabilizes the torso to provide a level platform. On the UAV side, a three-phase strategy is adopted, consisting of long-range acquisition with a median-filtered YOLOv8 detector, close-range tracking with a constraint-aware controller that integrates a Nonsingular Fast Terminal Sliding Mode Controller (NFTSMC) and a logarithmic Barrier Function (BF) to guarantee finite-time error convergence under field-of-view (FOV) constraints, and terminal descent guided by a Safety Period (SP) mechanism that jointly verifies tracking accuracy and platform stability. The proposed framework is validated in both simulation and real-world scenarios, successfully achieving docking on outdoor staircases higher than 17 cm and rough slopes steeper than 30 degrees. Supplementary material and videos are available at: https://uav-quadruped-docking.github.io.

Abstract:
Autonomous robots are increasingly being used in the field of scientific exploration and data acquisition. In particular, the use of robotic systems for mapping and sampling of species is becoming widespread in both aerial and underwater domains, however the problem of choosing where to sample is challenging when the phenomena of interest are discrete and sparsely distributed in space or time, such as when mapping a particular benthic species. In this paper we present a hierarchical online learning framework for reasoning about species distribution in realtime, in order to inform sampling decisions. Drawing inspiration from the Species Distribution Modelling community, a hierarchical probabilistic model is developed using the Integrated Nested Laplace Approximation framework, that enables online inference about expected target hotspots using predicted substrate distributions. Model parameters are learned online to build a prediction over the discrete targets, and the model is integrated into an anytime online planner to enable adaptive path planning. The hierarchical learning approach is demonstrated on simulated synthetic environments and shown to consistently outperform baseline methods such as Gaussian Process regression and boustrophedon coverage approaches, when robot resources are constrained.

Abstract:
This paper presents QuadSoft, a novel fully actuated quadrotor equipped with continuous-curvature, tendon-driven soft robotic arms. The design combines a semi-rigid central frame with flexible arms, enabling controlled structural reconfiguration during flight without altering the propeller layout. Unlike existing soft aerial platforms that rely on discrete bending joints, QuadSoft utilizes a continuum deformation approach to modulate arm curvature, actively adjusting its thrust vector and aerodynamic characteristics. We characterize the geometric mapping between servomotor input and the resulting constant curvature, validating it experimentally. Outdoor flight tests demonstrate stable take-off, hover, directional maneuvers, and landing, confirming that controlled arm bending can generate horizontal displacement while preserving altitude. Measurements of pitch, roll, and curvature angles show that the platform follows intended actuation patterns with minimal attitude deviations. These results demonstrate that QuadSoft preserves the baseline stability of rigid quadrotors while enabling morphology-driven maneuverability, all under the standard PX4 autopilot without retuning. Beyond a proof of concept, this work establishes a distinctive outdoor validation of a tendon-driven continuum morphing quadrotor, opening a new research avenue toward adaptive aerial systems that combine the safety and versatility of soft robotics with the performance of conventional UAVs.

Abstract:
Robotic hand exoskeletons hold immense potential for enhancing human hand functionality, addressing the hands strength limitations and fatigue during physically-demanding tasks. However, most existing hand exoskeletons are motorized, being weak in generating high supporting force for gripping augmentation. We present a nonmotorized hand exoskeleton based on magnetorheological (MR) actuators to provide high gripping support and elevate grip endurance. Meanwhile, it ingeniously harnesses human energy for actuation and energy storage, enhancing grip strength without external power. The MR actuator demonstrates a peak holding force of 1046 N with merely 5 W power input, boasting a force-to-power ratio one-order-of-magnitude higher than conventional approaches, and 97.7% energy reduction for same holding force compared to other approaches. Participants wearing the hand exoskeletons experience a 41.8% enhancement in grip strength without external power and reduced hand muscle fatigue during prolonged physical labor. In rescuing scenarios such as postearthquake rescue, debris clearance, and casualty evacuation, our exoskeleton effectively supports gripping and improves working efficiency.

Abstract:
This study introduces the soft tactile electromagnetic (STEM) actuator, a compact and wearable haptic device designed to deliver multimodal tactile feedback in virtual environments. The actuator employs soft materials as both an energy-storing and encasing structure, enabling out-of-plane deformations in response to arbitrary input signals while ensuring high wearability. Magnetic reinforcements, including a soft magnetic cap and a ferromagnetic pole piece, minimize magnetic flux leakage, effectively amplifying output force along with protrusion to enable precise and varied haptic feedback. The actuator generates multimodal tactile stimuli, including force, impulse, and vibration, surpassing conventional vibrotactile devices in delivering more varied and dynamic feedback. Experimental evaluation of the actuator's mechanical performance demonstrates its ability to produce both low- and high-frequency tactile feedback. A user study evaluating perception thresholds and signal recognition accuracy found that participants identified eight distinct tactile signals with an average accuracy of 91%, confirming the actuators capacity to deliver distinguishable multimodal feedback. These findings underscore the feasibility of the STEM actuator for immersive haptic interactions and highlight its potential applications in virtual reality.

Abstract:
Unsupervised domain adaptation (UDA) frameworks have shown good generalization capabilities for 3D point cloud semantic segmentation models on clean data. However, existing works overlook adversarial robustness when the source domain itself is compromised. To comprehensively explore the robustness of the UDA frameworks, we ffrst design a stealthy adversarial point cloud generation attack that can signiffcantly contaminate datasets with only minor perturbations to the point cloud surface. Based on that, we propose a novel dataset, AdvSynLiDAR, comprising synthesized contaminated LiDAR point clouds. With the generated corrupted data, we further develop the Adversarial Adaptation Framework as the countermeasure. Speciffcally, by extending the key point sensitive loss towards the Robust Long-Tailed loss and utilizing a decoder branch, our approach enables the model to focus on long-tailed classes during the pre-training phase and leverages high-conffdence decoded point cloud information to restore point cloud structures during the adaptation phase. We evaluated our AAF method on the AdvSynLiDAR dataset, where the results demonstrate that our AAF method can mitigate performance degradation under source adversarial perturbations for UDA in the 3D point cloud segmentation application.

Abstract:
Elastomer-based soft manipulators, featuring fibre-reinforced chambers, represent a prevalent design paradigm in the field of soft robotics. These robots incorporate multiple reinforced actuation chambers, enabling robust elongation and omni-directional bending motions. However, the inherent compliance of materials and the pressurised chambers inevitably introduce significant nonlinearity to these soft robots. Moreover, design of such robots often relies on a trial-and-error approach. Consequently, a comprehensive robot prototyping framework is of paramount importance. To achieve this, we present a static modelling, design and evaluation framework for soft robots with densely reinforced chambers (i.e., the angle between the reinforcement fibre and the axial direction of soft robots is 90. We first propose a static analytical modelling framework to achieve both the forward kinematics and tip force generation modelling of soft robots. This modelling framework accommodates the effects of pressurised chambers and (non)linear material behaviours. Furthermore, our design and evaluation framework incorporates an open-accessible simulation toolbox with a user-friendly grap

Abstract:
Robots in uncertain real-world environments must perform both goal-directed and exploratory actions. However, most deep learning-based control methods neglect exploration and struggle under uncertainty. To address this, we adopt deep active inference, a framework that accounts for human goal-directed and exploratory actions. Yet, conventional deep active inference approaches face challenges due to limited environmental representation capacity and high computational cost in action selection. We propose a novel deep active inference framework that consists of a world model, an action model, and an abstract world model. The world model encodes environmental dynamics into hidden state representations at slow and fast timescales. The action model compresses action sequences into abstract actions using vector quantization, and the abstract world model predicts future slow states conditioned on the abstract action, enabling low-cost action selection. We evaluate the framework on object-manipulation tasks with a real-world robot. Results show that it achieves high success rates across diverse manipulation tasks and switches between goal-directed and exploratory actions in uncertain settings, while making action selection computationally tractable. These findings highlight the importance of modeling multiple timescale dynamics and abstracting actions and state transitions.

Abstract:
Autonomous ground mobile robots rely on their configuration characteristics to prevent tip-overs and collisions, ensuring safe navigation in complex environments. However, complex configurations with specially designed links and joints produce a higher dimensional workspace and bring significant challenges for path planning, especially in large-scale rough terrains. To address this, we propose a real-time multilevel terrain-aware path planning framework that integrates different levels of terrain awareness into the global and local layers. An implicit map representation is introduced at the global layer to enable efficient terrain analysis and path planning, while an iterative geometric evaluation is designed at the local layer to estimate configuration stability and improve path smoothness. By sharing the global layer information with the local layer, the framework enhances path planning efficiency and adaptability in complex environments. Its modular design supports diverse robot configurations and pathfinding algorithms, enabling effective autonomous navigation in large-scale 3-D terrains with online or offline maps. Simulations and real-world experiments demonstrated that our approach outperforms state of the art across diverse environments, including uneven terrains, multilayered structures, and complex debris fields. The results highlighted that our approach provides faster and safer path planning, more accurate and robust configuration-stability estimation, and higher success rates in traversing complex 3-D environments.

Abstract:
Data gloves offer excellent portability and a strong ability to handle occluded movements, making them more advantageous over other methods for capturing complex hand motions in unstructured environments. However, the majority of existing hand-motion-capture gloves do not preserve visual features of the hand, which critically hinders their applicability for automatic pose annotation in RGB images. Here, we propose a data glove based on filament-sliding linear potentiometers (FLiPo), which can maintain finger appearance and ensure high accuracy as well as robustness, paving the way for automatic annotation. In FLiPo, fine filaments (diameter 0.1 mm) are deployed on finger skin to transmit joint arc length variations as well as preserve the hands visual features, while linear potentiometers used to capture filament length changes are positioned on the arm. Simultaneously, a quantitative occlusion scoring metric is proposed to evaluate the degree of finger occlusion caused by the device. Further, we experimentally analyze the nonlinearities induced by biaxial joint coupling and skin tissue artifact (STA)-related hysteresis, and employ a fully connected neural network to map arc length to joint angles with an MAE of joint angles of 2.15°. Meanwhile, tests under challenging environmental conditions, including heat, moisture, and magnetic interference, are conducted to evaluate its stability. Finally, the system's capability for real-time pose capture with high accuracy, robustness, and low occlusion was demonstrated

Abstract:
Despite extensive investigations into the multiuser haptic-enabled robotic system (M-Hers), achieving scalable control design in the presence of nonpassive human operators remains a key challenge. This is primarily due to the increasing complexity of stability conditions and interaction coupling as the number of operators grows. In this study, we address this challenge in two steps. First, we introduce the individual interaction environment (IIE) to isolate the passivity violations, which facilitates the independent control design for each humanrobot subsystem, thereby enhancing the scalability with respect to the number of subsystems. Second, within the IIE framework, we identify passivity-violating components caused by partnersactive behaviors and propose a novel augmented tank-based controller (ATBC) to guarantee passive IIE while maintaining high rendering accuracy. Specifically, the ATBC employs an energy-related power regulation strategy to enhance interaction safety and a time-varying control gain to mitigate the negative effects of power regulation on rendering fidelity. We validated the proposed method through collaborative haptic tasks on a customized M-Hers composed of three robots in four different scenarios. Comparative studies demonstrate that our approach effectively ensures IIE passivity in the presence of active human behaviors, while ensuring high reproducibility and achieving a favorable balance between passivity and rendering accuracy.

Abstract:
We introduce the soft curvature and spectroscopy (SCANS) system: a versatile, electronics-free, fluidically actuated soft manipulator capable of assessing the spectral properties of objects either in hand or through pre-touch caging. This platform offers a wider spectral sensing capability than previous soft robotic counterparts. We perform a material analysis to explore optimal soft substrates for spectral sensing, and evaluate both pre-touch and in-hand performance. Experiments demonstrate explainable, statistical separation across diverse object classes and sizes (metal, wood, plastic, organic, paper, foam), with large spectral angle differences between items. Through linear discriminant analysis, we show that sensitivity in the near-infrared wavelengths is critical to distinguishing visually similar objects. These capabilities advance the potential of optics as a multi-functional sensory modality for soft robots.

Abstract:
Vibrotactile actuators are used in many different haptic devices, e.g. game controllers. These vibrotactile actuators are typically made of rigid materials. In this paper, we use soft pneumatic actuators known as Pneumatic Unit Cell (PUC) to characterize the perceived intensity of vibrotactile stimuli when presented at the tip of the index finger. This study investigates how three parametersstimulus pressure (4 to 30 kPa), inflation-deflation frequency (20 to 100 Hz), and actuator stiffness (determined by top layer thicknesses of 0.9 mm and 1.2 mm)influence the perceptual intensity of the stimuli. Psychophysical experiments involving 16 participants were conducted using the AEPsych toolbox. These reveal that all the three parameters - pressure, frequency, and actuator stiffness significantly affect perceptual intensity. The findings indicate that both pressure and frequency exhibit a positive main effect and a positive interaction effect on perceived vibrotactile intensity. Additionally, the results show that, for a given frequency, pressure variations produce more perceptually distinct stimuli than frequency variations for a given pressure. Finally, presenting vibrotactile stimuli on a less stiff PUC actuator was perceived as being less intense than when the same stimulus was presented on a stiffer PUC actuator. Overall, this study provides key insights into the combined influence of pressure, frequency and actuator stiffness on the perceived vibrotactile intensity.

Abstract:
This paper proposes a control method to mitigate initial force overshoot caused by contact surface position estimation errors. The proposed method adds a compensation term based on Adaptive Sliding Mode Control (ASMC) to a conventional admittance control structure. Also, to maintain position tracking performance under model uncertainty, a hierarchical control structure is designed by combining a Time-Delay Control (TDC)-based internal position controller. Simulation results show that the proposed method reduces the maximum overshoot from 76.6% to 32.1% compared to conventional admittance control, and shortens the peak time from 0.254 s to 0.126 s. Furthermore, the settling time is reduced from 3.51 s to 1.467 s at the 2% criterion and from 2.113 s to 0.832 s at the 5% criterion, improving transient response stability and convergence speed.

Abstract:
Accurate relative pose estimation is critical for autonomous close-proximity satellite operations, such as on-orbit servicing and debris removal. However, this task remains highly challenging for unknown, non-cooperative targets due to the unavailability of geometric priors, scarce training data, and the extreme variations in lighting and backgrounds inherent to the space environment. Existing approaches for pose estimation can be categorized into: 1. model-based methods, which achieve higher accuracy but assume access to 3D CAD models of target satellites; and 2. model-free methods, which can be used for unknown satellites but typically suffer from reduced accuracy and robustness. To bridge this gap, this paper introduces a sample-efficient, model-free pose estimation framework that achieves high accuracy and robustness for unknown satellites while demonstrating strong generalization under the uncertain conditions inherent to the space environment. The proposed method utilizes a novel appearance aware 3D reconstruction to generate satellite model from images accounting for different lighting conditions during the training. This model is then used to generate a large, diverse dataset to train a pose predictor network (stage 1). The predicted pose is refined using the 3D reconstruction by utilizing the appearance information of the target image along with differentiable rendering (stage 2). Evaluated across SPEED+ and URSO Soyuz datasets, our approach achieves state-of-the-art accuracy and proves highly robust to test-time domain shifts, notably reducing rotation error by 80% on the challenging URSO Soyuz dataset.

Abstract:
Careful robot manipulation in every-day cluttered environments requires an accurate understanding of the 3D scene, in order to grasp and place objects stably and reliably and to avoid colliding with other objects. In general, we must construct such a 3D interpretation of a complex scene based on limited input, such as a single RGB-D image. We describe SceneComplete, a system for constructing a complete, segmented, 3D model of a scene from a single view. SceneComplete is a novel pipeline for composing general purpose pretrained perception modules (vision-language, segmentation, image-inpainting, image-to-3D, visual-descriptors and pose-estimation) to obtain highly accurate results. We demonstrate its accuracy and effectiveness with respect to ground-truth models in a large benchmark dataset and show that its accurate whole-object reconstruction enables robust grasp proposal generation, including for a dexterous hand. We release the code and additional results on our website - https://scenecomplete.github.io/.

Abstract:
Visual inspection of confined spaces such as aircraft wings is ergonomically challenging for human mechanics. This work presents a novel crane robot that can travel the entire span of the aircraft wing, enabling mechanics to perform inspection from outside of the confined space. However, teleoperation of the crane robot can still be a challenge due to the need to avoid obstacles in the workspace and potential oscillations of the camera payload. The main contribution of this work is to exploit the differential flatness of the crane-robot dynamics for designing reduced-oscillation, collision-free time trajectories of the camera payload for use in teleoperation. Autonomous experiments verify the efficacy of removing undesired oscillations by 89%. Furthermore, teleoperation experiments demonstrate that the controller eliminated collisions (from 33% to 0%) when 12 participants performed an inspection task with the use of proposed trajectory selection when compared to the case without it. Moreover, even discounting the failures due to collisions, the proposed approach improved task efficiency by 18.7% when compared to the case without it

Abstract:
Robotic manipulation of objects in cluttered dynamic scenes is challenging for a twofold reason. Object detection and localization are complex due to partial occlusions and high variability in the object classes and manipulation in tight spaces is difficult due to potential collisions. The present letter focuses on the low-level control of the non-prehensile pushing action aimed at moving planar objects of generic shape along a given path with an assigned time law. Based on the continuous and nonlinear dynamics of the system, we propose a nonlinear model predictive controller (NMPC), which avoids the need for linearization and, thus, the hybrid dynamics arising from it. An extensive comparison with a state-of-the-art linear MPC demonstrates that the NMPC can successfully react to more general disturbances, outperforming the linear one. Experimental results confirm the effectiveness of the method in a task where a robot is required to grasp fruits in a container with other obstructing objects (shown in the attached video).

Abstract:
The inclusion of robots in daily life presents significant ethical, legal, and social implications (ELSI) that stem from their interactions with humans. Social robots are able to operate in environments that are rich in cultural norms, emotions, and social cues, leading to critical questions about privacy, trust, and safety. We explore the ways in which the interdisciplinary field of robot ethics can tackle these challenges using a hybrid methodological approach that incorporates thought experiments and empirical research. Ethical dilemmas can be systematically analyzed using thought experiments, and empirical methods can provide real-world insights to validate and refine these theoretical frameworks. In the paper, the use of living labs as dynamic environments for testing and integrating ethical design principles into robot design is emphasized, ensuring that robots comply with ethical expectations and legal standards.

Abstract:
Nonlinear model predictive locomotion controllers based on the reduced centroidal dynamics are nowadays ubiquitous in legged robots. These schemes, even if they assume an inherent simplification of the robots dynamics, were shown to endow robots with a step-adjustment capability in reaction to small pushes, and in the case of uncertain parameters - as unknown payloads - they were shown to provide some practical, albeit limited, robustness. In this work, we provide rigorous certificates of their closed-loop stability via reformulating the online centroidal MPC controller. This is achieved thanks to a systematic procedure inspired by the machinery of adaptive control, together with ideas coming from Control Lyapunov Functions. Our reformulation, in addition, provides robustness for a class of unmeasured constant disturbances. To demonstrate the generality of our approach, we validated our formulation on a new generation of humanoid robots - the 56.7 kg ergoCub, as well as on the commercially available 21 kg quadruped robot Aliengo.

Abstract:
The human palm is a remarkable and highly functional part of the hand that significantly contributes to dexterity, grasp versatility, and overall manipulation capability. The metacarpophalangeal joints (MCP) of the palm facilitate movement of the fingers for flexion, extension, abduction, adduction, and limited circumduction, which can be a challenge to replicate the same function in robotic hands by simple design. In this paper, we proposed a single actuated folding mechanism as the metacarpals and passive rotating MCP joints to perform abduction, adduction, and circumduction of the human hand. The proposed anthropomorphic hand called Folding Hand which has a reconfigurable palm and five underactuated tendon-driven fingers. The design of the hand is compact and the price is low, with all six actuators and five sensors on the hand costing less than 180. Additionally, a methodology has been developed to comprehensively analyze the grasping capacity by combining the grasping quality and the grasping workspace. The experimental results show that the folding palm mechanism and the compliant rotating finger base can replicate human hand capabilities and performance precision and in-hand manipulation tasks.

Abstract:
Lower limb exoskeletons assist users by supporting joint movements. Since joint motion patterns vary depending on how the user moves, accurately recognizing the type of movement (locomotion mode) is crucial for controlling the exoskeleton and ensuring user safety. Inspired by how humans use multiple types of sensory information to control movement, we developed a multi-modal locomotion mode recognition (LMR) system that uses both mechanical and visual sensor data to identify locomotion modes. Our approach utilizes two fusion methods: intermediate fusion, which combines the data in the form of features, and late fusion, which integrates the sensor data by averaging the recognition results from each sensor. By fusing these two different modalities, the prediction accuracy improved by an average of 11.7% with the test data. Through comparisons with uni-modal LMR systems that rely on a single type of sensor data for locomotion mode recognition, we found that the improved performance of the multi-modal LMR system is due to the visual information's ability to generalize different gait patterns across users and the mechanical sensor data's consistency within the same classes.

Abstract:
Mobile robot motion planning heavily relies on grid-based occupancy maps, while existing works require high memory usage and expensive updating overhead. In this work, we propose a memory-lite grid-block data structure and an efficient map updating algorithm for LiDAR-based online exploration-oriented planning. To accelerate a query operation and reduce the memory usage, we adopt the grid-block-based map as the basic data structure and propose to dynamically read and write blocks around the sensor. For each block, the occupied grids and frontier grids are maintained in two separate lists, serving as key grids for the map update. Instead of updating free grids by ray-racasting, we propose a key grids expansion algorithm to avoid repetitively querying grids on casted beams. The proposed algorithm not only speeds up the occupancy map update but also detects the frontier grids, which are crucial for exploration tasks, without extra computation. We compare the proposed method with state-of-the-art mapping methods on the KITTI dataset and a self-collected dataset. The proposed method outperforms other methods in terms of memory usage and map update computation. It is also deployed on a UAV for a real-world exploration test. The source code is released at: https://github.com/NKU-MobFly-Robotics/RipNeon.

Abstract:
Autonomous parking is a critical component for achieving safe and efficient urban autonomous driving. However, unstructured environments and dynamic interactions pose significant challenges to autonomous parking tasks. To address this problem, we propose SEG-Parking, a novel end-to-end offline reinforcement learning (RL) framework to achieve interaction-aware autonomous parking. Notably, a specialized parking dataset is constructed for parking scenarios, which include those without interference from the opposite vehicle (OV) and complex ones involving interactions with the OV. Based on this dataset, a goal-conditioned state encoder is pretrained to map the fused perception information into the latent space. Then, an offline RL policy is optimized with a conservative regularizer that penalizes out-of-distribution actions. Extensive closed-loop experiments are conducted in the high-fidelity CARLA simulator. Comparative results demonstrate the superior performance of our framework with the highest success rate and robust generalization to out-of-distribution parking scenarios. The related dataset and source code are available at https://github.com/Yeulerzzz/SEG-Parking.

Abstract:
Dynamic indoor environments pose significant challenges for autonomous robots, as objects frequently move and scenes continuously change, requiring robust scene representation and adaptive navigation strategies. In this work, we introduce DSSM-SG, a dynamic open-vocabulary 3D scene graph framework enhanced with spatial-semantic memory, to support complex language instruction parsing and goal navigation in dynamic environments. First, we construct a multi-layered scene graph by combining waypoint topology with semantic object information, and propose a viewpoint-based mechanism to model object dynamics and detect scene changes, enabling more precise semantic-geometric representation. Second, we design an efficient incremental graph update strategy that adapts to object-level dynamics and navigation-observed obstacles, thereby maintaining graph consistency and alleviating mismatch during re-navigation. Finally, we introduce a subgraph generation and matching approach driven by large language models, significantly improving the system's ability to interpret and ground ambiguous goal descriptions. Experimental results demonstrate that DSSM-SG achieves superior performance in scene graph accuracy, update efficiency, and language goal navigation success compared to existing baselines in dynamic indoor environments.

Abstract:
In this work, we introduce MO-SeGMan, a Multi-Objective Sequential and Guided Manipulation planner for highly constrained rearrangement problems. MO-SeGMan generates object placement sequences that minimize both replanning per object and robot travel distance while preserving critical dependency structures with a lazy evaluation method. To address highly cluttered, non-monotone scenarios, we propose a Selective Guided Forward Search (SGFS) that efficiently relocates only critical obstacles and to feasible relocation points. Furthermore, we adopt a refinement method for adaptive subgoal selection to eliminate unnecessary pick-and-place actions, thereby improving overall solution quality. Extensive evaluations on nine benchmark rearrangement tasks demonstrate that MO-SeGMan generates feasible motion plans in all cases, consistently achieving faster solution times and superior solution quality compared to the baselines. These results highlight the robustness and scalability of the proposed framework for complex rearrangement planning problems.

Abstract:
This paper presents a novel end-to-end trajectory planning framework that integrates LiDAR-based perception with trajectory optimization, enabling safe and efficient navigation in dynamic environments without relying on semantic detection or explicit kinematic modeling. Learning-based dynamic collision avoidance methods often depend on reinforcement learning, which introduces challenges related to training efficiency, model generalization, and deployment safety. To address these limitations, we propose a lightweight map representation for temporally continuous dynamic obstacles, facilitating unsupervised network training with physically simulated data. Additionally, a repulsion-based adjustment method built upon motion primitives allows adaptive trajectory planning in highly crowded scenarios where no feasible trajectory exists, balancing target-reaching objectives with motion safety. Extensive simulations and real-world experiments demonstrate that the proposed framework achieves millisecond-level planning latency while ensuring high safety, trajectory smoothness, and flight efficiency. The demonstration video is available on the project website: https://swift520.github.io/Dynamic-Planner/.

Abstract:
Underwater concrete infrastructure plays a crucial role in energy and water systems. However, it requires regular inspections to ensure structural integrity. Remotely Operated Vehicles (ROVs) offer a safer and more cost-effective alternative to diver-based inspections. The data collected during inspections often require extensive post-mission processing, either manually or through computationally intensive algorithms. This limitation makes real-time damage detection during inspections impossible. In this study, we present a real-time image-level domain alignment pipeline suitable for deployment on resource-constrained hardware. It combines image enhancement with crack detection using a YOLO11n-seg model fine-tuned on a publicly available aerial concrete crack dataset. The model was quantized and deployed on a Jetson Nano, which was connected to an ROV for real-time inference. To reduce the domain gap between the raw underwater images captured by the ROV and the aerial training data, a Contrast Limited Adaptive Histogram Equalization (CLAHE)-based strategy was applied. Field tests were conducted on a submerged concrete embankment in a turbid lake environment. A validation dataset was developed to evaluate performance offline and is publicly available.

Abstract:
Extrinsic calibration between LiDAR and camera is a crucial step in multi-sensor fusion, where targetless approaches have attracted increasing attention for their flexibility and reusability. However, existing methods still suffer from three major limitations: time-consuming data preparation, lack of robustness under sparse single-frame input, and limited generalization across diverse LiDAR architectures. We propose MIND-Calib, a truly single-frame, targetless calibration framework. The method generates depth and intensity images through virtual multi-view projection, and performs image-domain completion and back-projection to densify the point cloud and construct sub-pixel 2D--3D correspondences. High-precision extrinsics are then estimated via dual-channel cross-modal matching that leverages both depth and intensity modalities. Experiments on three representative LiDAR types (MEMS-based, solid-state, and mechanical spinning) as well as on public datasets demonstrate an average accuracy of 2.85 cm (with respect to an average scene depth of 40 meters) in translation and 0.20°in rotation. More importantly, MIND-Calib not only achieves true single-frame calibration without any additional preparation, but also maintains stable accuracy under sparse inputs and exhibits strong generalization and robustness across devices and challenging environments.

Abstract:
This paper presents a magnet-based robotic skin that integrates a multilayer soft lattice with distributed Hall-effect sensor arrays and a tactile super-resolution model. External contact forces are converted to magnetic field changes by embedded permanent magnets, and the lattice spreads these changes across the sensing domain. This gives each sensor a large, overlapping receptive field and enables a large sensing area with minimal blind spots. Lattice parameters are tunable, enabling joint adjustment of mechanical compliance and transduction characteristics. An implicit modeling workflow and selective laser sintering (SLS) 3D printing support rapid fabrication of conformal, high-complexity structures. A convolutional neural network trained on experimental measurements estimates contact location and normal force in real time. Experiments validate localization accuracy and indicate scalability to larger surfaces, suggesting applicability to whole-body robotic skin and safe human-robot interaction.

Abstract:
Vision-language-action models (VLAs) have shown generalization capabilities in robotic manipulation tasks by inheriting from vision-language models (VLMs) and learning action generation. Most VLA models focus on interpreting vision and language to generate actions, whereas robots must perceive and interact within the spatial-physical world. This gap highlights the need for a comprehensive understanding of robotic-specific multisensory information, which is crucial for achieving complex and contact-rich control. To this end, we introduce a multisensory languageaction (MLA) model that collaboratively perceives heterogeneous sensory modalities and predicts future multisensory objectives to facilitate physical world modeling. Specifically, to enhance perceptual representations, we propose an encoder-free multimodal alignment scheme that innovatively repurposes the large language model itself as a perception module, directly interpreting multimodal cues by aligning 2D images, 3D point clouds, and tactile tokens through positional correspondence. To further enhance MLAs understanding of physical dynamics, we design a future multisensory generation post-training strategy that enables MLA to reason about semantic, geometric, and interaction information, providing more robust conditions for action generation. For evaluation, the MLA model outperforms the previous state-of-the-art 2D and 3D VLA methods by 12% and 24% in complex, contact-rich real-world tasks, respectively, while also demonstrating improved generalization to unseen configurations. Project website: sites.google.com/view/robot-mla.

Abstract:
3D reconstruction serves as the foundational layer for numerous robotic perception tasks, including 6D object pose estimation and grasp pose generation. Modern 3D reconstruction methods for objects can produce visually and geometrically impressive meshes from multi-view images, yet standard geometric evaluations do not reflect how reconstruction quality influences downstream tasks such as robotic manipulation performance. This paper addresses this gap by introducing a large-scale, physics-based benchmark that evaluates 6D pose estimators and 3D mesh models based on their functional efficacy in grasping. We analyze the impact of model fidelity by generating grasps on various reconstructed 3D meshes and executing them on the ground-truth model, simulating how grasp poses generated with an imperfect model affect interaction with the real object. This assesses the combined impact of pose error, grasp robustness, and geometric inaccuracies from 3D reconstruction. Our results show that reconstruction artifacts significantly decrease the number of grasp pose candidates but have a negligible effect on grasping performance given an accurately estimated pose. Our results also reveal that the relationship between grasp success and pose error is dominated by spatial error, and even a simple translation error provides insight into the success of the grasping pose of symmetric objects. This work provides insight into how perception systems relate to object manipulation using robots.

Abstract:
Humanoid table tennis (TT) demands rapid perception, proactive whole-body motion, and agile footwork under strict timingcapabilities that remain difficult for end-to-end control policies. We propose a reinforcement learning (RL) framework that maps ball-position observations directly to whole-body joint commands for both arm striking and leg locomotion, strengthened by predictive signals and dense, physics-guided rewards. A lightweight learned predictor, fed with recent ball positions, estimates future ball states and augments the policys observations for proactive decision-making. During training, a physics-based predictor supplies precise future states to construct dense, informative rewards that lead to effective exploration. The resulting policy attains strong performance across varied serve ranges (hit rate>=96% and success rate>=92%) in simulations. Ablation studies confirm that both the learned predictor and the predictive reward design are critical for end-to-end learning. Deployed zero-shot on a physical Booster T1 humanoid with 23 revolute joints, the policy produces coordinated lateral and forwardbackward footwork with accurate, fast returns, suggesting a practical path toward versatile, competitive humanoid TT. We have open-sourced our RL training code at: https://github.com/purdue-tracelab/TTRL-ICRA2026

Abstract:
Lightweight industrial robots are increasingly deployed alongside humans to perform diverse and intelligent industrial tasks. A major concern with these robots is energy efficiency, driven by rising operational costs and environmental impacts. A growing contributor to energy use is the heavy computational workload of their electronic components. Although motion configurations and computational load are often interdependent, current state-of-the-art energy optimization methods tend to address them separately, focusing on individual consumption. In this work, we demonstrate that computational energy is comparable to mechanical energy and show how their dependency affects overall consumption in a Franka Emika Panda robot equipped with a multi-core processing system and two depth cameras. Building on this understanding, we propose a Bayesian approach for the joint optimization of mechanical motion and computational frequency in a robotic arm. Experiments show that the proposed method enables the Franka arm to reduce energy use by 3.7% in pick-and-place tasks and 6.2% in sorting tasks, compared to methods that optimize locomotion and computation separately.

Abstract:
In medical robotics, biological shape deformation resulting from arbitrary tool-tissue interaction commonly occurs and motivates the need in microsurgery to predict the new geometry of tissue structures. However, handling deformation is challenging due to the lack of a general prediction model for varied surgical scenarios, complex tissue properties, and myriad surgical tool geometries. Limited intraoperative sensors to observe microlevel deformations further compound this difficulty. To solve this problem, this paper proposes the first geometric data-driven framework that uses only the robot palpation tooltip movement and a pre-deformed surface to predict the tissue deformation by using the optical coherence tomography (OCT) sensor. A neural network is trained to learn tooltissue physics and predict the shape from the given robot-tool configurations represented as orientations and displacements. We conducted realistic experiments to verify the models using phantoms of various stiffness and three ex vivo tissue types, with average prediction errors of approximately 0.15 mm and 0.52 mm respectively. This framework provides a general data collection platform for collecting micro-scale palpation data under OCT and can be generalized to soft-tissue related studies in biomedical engineering and surgical robotics research.

Abstract:
Basketball players catch fast passes, and porters unload goods with apparent ease. These actions demonstrate how humans rely on intelligent regulation strategies to drive muscle activity. Replicating similar dynamic responses and strong impact absorption in robotics, however, remains a major challenge. Classical impedance control requires a trade-off between compliance and stability in high-impact interactions, which limits dynamic performance. To address this issue, this paper proposes a Stroke-based Variable Damping Model (SVDM), which adjusts the damping coefficient adaptively according to the position error relative to the contact point. In addition, a Force Attenuation (FA) strategy is applied to the external forces injected into SVDM, resulting in the SVDM with Force Attenuation (FA-SVDM). Based on human biomechanical principles, we fabricated a 4-DOF robotic manipulator using 3D printing technology. Using FA-SVDM, the manipulator successfully captured a 1kg rigid sphere falling freely from 0.8m. Under identical conditions, it exhibits superior performance compared to various fixed-damping configurations. We further developed a 6-DOF robotic manipulator equipped with a dexterous hand in the widely-used MuJoCo engine, employing quadratic programming (QP) for pre-contact trajectory tracking and FA-SVDM for post-contact energy dissipation, ultimately achieving human-like compliant capture of high-momentum flying objects using a single arm with a half-prehensile strategy.

Abstract:
In this paper, we present a hardware-control co-design approach that enables efficient and versatile roller skating on quadrupedal robots equipped with passive wheels. Passive-wheel skating reduces leg inertia and improves energy efficiency, particularly at high speeds. However, the absence of direct wheel actuation tightly couples mechanical design and control. To unlock the full potential of this modality, we formulate a bilevel optimization framework: an upper-level Bayesian Optimization searches the mechanical design space, while a lower-level Reinforcement Learning trains a motor control policy for each candidate design. The resulting design-policy pairs not only outperform human-engineered baselines, but also exhibit versatile behaviors such as hockey stop (rapid braking by turning sideways to maximize friction) and self-aligning motion (automatic reorientation to improve energy efficiency in the direction of travel), offering the first system-level study of dynamic skating motion on quadrupedal robots.

Abstract:
Tracking the 6DoF pose of previously unseen objects from monocular RGB videos is crucial for robotic manipulation, yet remains challenging due to depth ambiguity and limited object-centric visual context. Existing trackers often rely on accurate depth sensors, which constrains deployment in low-cost settings, while substituting monocular pseudo-depth frequently introduces geometric errors that reduce tracking robustness. In this end, We propose textbfMGS-Track, an object-centric online tracking and reconstruction framework that combines learning-based geometric priors with differentiable 3D Gaussian Splatting (3DGS). Specifically, we first introduce a mask-augmented DUSt3R network (DUSt3R-M) to establish pairwise correspondences and predict point maps, which serve as geometric priors for initializing and guiding an online 3DGS representation. We then jointly optimize Gaussian parameters and 6DoF object poses in a coarse-to-fine manner, enabling robust tracking and high-fidelity reconstruction. To control model growth and maintain efficiency over time, we further introduce adaptive Gaussian management with capacity-aware selection and mask-consistent pruning. Experiments on YCBInEOAT and HO3D show that MGS-Track consistently outperforms competitive monocular baselines on both pose tracking and object reconstruction in challenging object-centric scenarios.

Abstract:
In informed search-based path planning, heuristic functions that incorporate problem knowledge are essential for guiding the search and improving efficiency. The accuracy and computational cost of these heuristics are therefore critical to performance. However, accuracy and computational efficiency are often contradictory, making it difficult to select an appropriate heuristic for a given problem. In this paper, we present CoIT (Cooperative Informed Tree), an almost-asymptotically optimal asymmetric bi-directional planning algorithm designed to address these challenges. CoIT introduces a multi-resolution and multi-heuristic queue cooperation mechanism between forward and reverse searches: the forward search interacts with the reverse search to provide cooperative information exchange, which enhances both local and global edge screening. This cooperation improves the accuracy of the reverse search, while multi-resolution exploration enables lazy edge validation in the forward search, thereby reducing planning time. We validate CoIT on high-dimensional benchmark problems as well as simulated and real surgical robot planning tasks. Experimental results demonstrate that CoIT achieves higher accuracy and significantly lower planning time compared with state-of-the-art planners.

Abstract:
Modular continuum soft arms represent an emerging class of robotic systems characterized by flexible, highly deformable structures. Designing shape controllers for these arms poses significant challenges due to their modeling complexity and hyper-redundant nature. Our goal is to develop a scalable control framework for modular arms, where each module is self-contained. Starting from distributed control theory, we assign a collaborative controller to each soft module. Through collaboration among modules, the framework enables the system to achieve the desired tip position and shape. Each controller relies on the minimal model, such as the Constant Curvature, of its self-contained module and the local transformation shared by adjacent modules. We present three kinematic control strategies - Consensus, Bipartite Consensus, and Formation Control - for a modular continuum soft arm, that progressively relax constraints to achieve more complex, adaptable shapes. In addition, we develop a decentralized curvature-based dynamic controller to manage dynamic coupling among modules. The validation is carried out through numerical analysis and dynamic simulations of soft arms with varying numbers of modules.

Abstract:
We propose a fully data-driven, Koopman-based framework for statistically robust control of discrete-time nonlinear systems with linear embeddings. Establishing a connection between the Koopman operator and contraction theory, it offers distribution-free probabilistic bounds on the state tracking error under Koopman modeling uncertainty. Conformal prediction is employed here to rigorously derive a bound on the state-dependent modeling uncertainty throughout the trajectory, ensuring safety and robustness without assuming a specific error prediction structure or distribution. Unlike prior approaches that merely combine conformal prediction with Koopman-based control in an open-loop setting, our method establishes a closed-loop control architecture with formal guarantees that explicitly account for both forward and inverse modeling errors. Also, by expressing the tracking error bound in terms of the control parameters and the modeling errors, our framework offers a quantitative means to formally enhance the performance of arbitrary Koopman-based control. We validate our method both in numerical simulations with the Dubins car and in real-world experiments with a highly nonlinear flapping-wing drone. The results demonstrate that our method indeed provides formal safety guarantees while maintaining accurate tracking performance under Koopman modeling uncertainty.

Abstract:
This letter presents a comprehensive comparative study of Incremental Nonlinear Dynamic Inversion (INDI) and standard Nonlinear Dynamic Inversion (NDI) for smooth trajectory tracking on Samara Seed-Inspired Single-Actuator Monocopters (SAM). While prior work on SAMs has largely focused on hover stabilization, smooth robust control for aggressive translational motion remains a largely uncharted frontier. Leveraging the precession-prone dynamics inherent to the SAM, we analyze the tracking performance of INDI across varying flight speeds, trajectories, and wing morphologies (long, short, ultralight). Our experiment results demonstrate that INDI on the long-wing consistently achieves lower angular acceleration tracking errors, reducing mean and RMS by up to 13.8% and 13.0%, respectively, while also improving motor efficiency with up to 8.4% less PWM usage compared to NDI. Additionally, INDI produces tighter and more stable body yaw rates (± 0.1 Hz) and delivers up to 65% improvement in position tracking over traditional purely attitude control (ATT).Finally, even under severe actuation constraints with an ultralight-wing operating at reduced thrust, INDI maintains robust performance, validating its resistance towards precession and robust control of highly under-actuated SAMs.

Abstract:
Wheeled Inverted Pendulum (WIP) systems offer agile mobility but are challenging to control due to their unstable and underactuated dynamics. To address these limitations, we develop a Wheeled Inverted Pendulum with a Fan (WIPF), which incorporates a fan-generated bidirectional thrust force as an additional control input. This makes the system fully actuated and enhances stability; however, the limited bandwidth of the fan thrust introduces control challenges. In this letter, we propose a Frequency-Shaped Model Predictive Control (FSMPC) design framework that accounts for actuator dynamics in the optimization process, and is expandable to other systems with different actuator dynamics. The proposed FSMPC can provide improved stability by penalizing high-frequency input using the frequency response of the fan. The nonlinear solver enables control input updates at rates exceeding 1 kHz, meeting real-time control requirements. The performance of FSMPC with the proposed design framework is compared through simulations and experiments against a Linear Quadratic Regulator (LQR), a standard Model Predictive Controller (MPC), and a Frequency-Shaped LQR (FSLQR) that does not consider fan dynamics or the input constraint. The results demonstrate that FSMPC achieves improved stability and robustness compared to other controllers.

Abstract:
Controlling a team of robots in a coordinated manner is challenging, as a centralized approach (where all computation is done on a central machine) has poor scalability and a globally-referenced external localization system may not always be available. In this work, we consider the problem of range-aided decentralized localization and formation control. In such a setting, each robot estimates its relative pose by combining data only from onboard odometry sensors and distance measurements to other robots in the team. Additionally, each robot calculates the control inputs necessary to collaboratively navigate an environment to accomplish a specific task, for example, moving in a desired formation while monitoring an area. We present a block coordinate descent approach to localization that does not require strict coordination between the robots. We present a novel formulation for formation control as inference on factor graphs that takes into account the state estimation uncertainty and can be solved efficiently. Our approach to range-aided localization and formation-based navigation is completely decentralized, does not require specialized trajectories to maintain formation, and achieves decimeter-level positioning and formation control accuracy. We demonstrate our approach through multiple real experiments involving formation flights in diverse indoor and outdoor environments.

Abstract:
The advent of end-to-end autonomy stacksoften lacking interpretable intermediate moduleshas placed an increased burden on ensuring that the final output, i.e., the motion plan, is safe in order to validate the safety of the entire stack. This requires a safety monitor that is both complete (able to detect all unsafe plans) and sound (does not flag safe plans). In this work, we propose a principled safety monitor that leverages modern multi-modal trajectory predictors to approximate forward reachable sets (FRS) of surrounding agents. By formulating a convex program, we efficiently extract these data-driven FRSs directly from the predicted state distributions, conditioned on scene context such as lane topology and agent history. To ensure completeness, we leverage conformal prediction to calibrate the FRS and guarantee coverage of ground-truth trajectories with high probability. To preserve soundness in out-of-distribution (OOD) scenarios or under predictor failure, we introduce a Bayesian filter that dynamically adjusts the FRS conservativeness based on the predictors observed performance. We then assess the safety of the ego vehicles motion plan by checking for intersections with these calibrated FRSs, ensuring the plan remains collision-free under plausible future behaviors of others. Extensive experiments on the nuScenes dataset show our approach significantly improves soundness while maintaining completeness, offering a practical and reliable safety monitor for learned autonomy stacks.

Abstract:
Robotic manipulators are increasingly used in diverse applications, ranging from industrial automation to human-centered tasks such as grocery picking and packaging, where they are often required to perform sequences of tasks while maintaining motion optimality and collision-free operation over long horizons. This type of problem is known as the Robotic Task Sequencing Problem. Most existing works address this problem by reducing it to the Traveling Salesman Problem (TSP) and within a purely Euclidean framework, neglecting the robots inherently non-Euclidean toroidal mathcalC-space topology. This simplification limits the selection of feasible configurations and may lead to failure, suboptimal, or detoured motions. In this paper, we propose a robotic task sequencing problem solver that incorporates the robots natural mathcalT^n topology and joint limits.

Abstract:
Automating large-scale manufacturing in domains like timber construction requires multi-robot systems to manage tightly coupled spatiotemporal constraints, such as collision avoidance and process-driven deadlines. This paper introduces LASER (Level-based Asynchronous Scheduling and Execution Regime), a complete framework for scheduling and executing complex assembly tasks, demonstrated on a screw-press gluing application for timber slab manufacturing. Our central contribution is to integrate a barrier-based mechanism into a constraint programming (CP) scheduling formulation that partitions tasks into spatiotemporally disjoint sets, which we define as levels. This structure enables robots to execute tasks in parallel and asynchronously within a level, synchronizing only at level barriers, which guarantees collision-free operation by construction and provides robustness to timing uncertainties. To solve this formulation for large problems, we propose two specialized algorithms: an iterative temporal-relaxation approach for heterogeneous task sequences and a bi-level decomposition for homogeneous tasks that balances workload. We validate the LASER framework by fabricating a full-scale 2.4m x 6m timber slab with a two-robot system mounted on parallel linear tracks, successfully coordinating 108 subroutines and 352 screws under tight adhesive time windows. Computational studies show our method scales steadily with size compared to a monolithic approach.

Abstract:
Integrating multiple locomotion modes on a single platform has become an active focus in pursuit of versatile and efficient movement. This paper introduces a novel monocycle robot with four legs, named Ringbot Quad, combining the wheeled and legged mechanisms. The Ringbot Quad is developed as a unique monocycle mechanism that replaces the traditional monocycle drivetrain with four individually actuated driving modules, each topped with an articulated leg. The four legs can be used for balance and steering in driving mode and for quadruped walking that fully supports the body with the legs. By switching between two distinct locomotion modes, it can navigate various terrains and overcome obstacles typically challenging for either mechanism alone. In this work, we present a compact-scale Ringbot Quad prototype as a proof of concept for the proposed mechanism, demonstrating the feasibility of a new type of mobile robot.

Abstract:
Multi-robot systems often need to navigate through obstacle-cluttered environments while performing complex tasks. To ensure collision-free trajectories among the robots and with the obstacles is essential for the overall safety, along with additional requirements such as dynamic feasibility, relative formation, connectivity maintenance and temporal tasks. Existing work mostly focuses on the design of analytical controllers that encapsulate all these constraints, which often suffer from undesired local minima due to conflicting non-convex objectives. This work proposes a novel motion planning scheme for multi-robot systems under various safety and high-level tasks, specified as signal temporal logic (STL) formulas over collective states such as collision avoidance, relative formation and connectivity maintenance. A gradient-free method called collaborative optimal transport (CLOT) is proposed that optimizes batches of system-wide smooth trajectories over highly nonlinear costs handled through the zero-order Sinkhorn-Knopp step. Via parallel computation on GPUs, it is shown to significantly improve the scalability from a few robots to over 100 robots, with an average planning time of few seconds. Lastly, its applicability is extensively demonstrated both in simulation and hardware, over complex environments and high-level temporal tasks.

Abstract:
Multi-robot mapping with neural implicit representations enables the compact reconstruction of complex environments. However, it demands robustness against communication challenges like packet loss and limited bandwidth. While prior works have introduced various mechanisms to mitigate communication disruptions, performance degradation still occurs under extremely low communication success rates. This paper presents UDON, a real-time multi-agent neural implicit mapping framework that introduces a novel uncertainty-weighted distributed optimization to achieve highquality mapping under severe communication deterioration. The uncertainty weighting prioritizes more reliable portions of the map, while the distributed optimization isolates and penalizes mapping disagreement between individual pairs of communicating agents. We conduct extensive experiments on standard benchmark datasets and real-world robot hardware. We demonstrate that UDON significantly outperforms existing baselines, maintaining high-fidelity reconstructions and consistent scene representations even under extreme communication degradation (as low as 1% success rate). The codes can be found at https://iconlab.negarmehr.com/UDON/

Abstract:
Human activity recognition (HAR) is fundamental in human-robot collaboration (HRC), enabling robots to respond and dynamically adapt to human intentions. This paper introduces a HAR system combining a modular data glove equipped with Inertial Measurement Units and a vision-based tactile sensor to capture hand activities in contact with a robot. We tested our activity recognition approach under different conditions, including offline classification of segmented sequences, real-time classification under static conditions, and a realistic HRC scenario. The experimental results show a high accuracy for all the tasks, suggesting that multiple collaborative settings could benefit from this multi-modal approach.

Abstract:
Legged robots hold great promise for agile and flexible mobility across diverse and unstructured terrains, inspired by the remarkable adaptability of bipeds and quadrupeds in nature. However, achieving robust autonomous locomotion in cluttered and complex environments remains a significant challenge. In this work, we present a hierarchical control framework for quadrupedal robots that enables safe and autonomous traversal of cluttered terrains. Central to our approach is a novel multi-layer elevation map representation, which is generalized enough to capture a wide range of terrains. To further improve policy generalization and maneuverability, we incorporate terrain augmentation, knowledge distillation, and carefully designed reward functions. Extensive simulation experiments demonstrate that each component contributes to improved policy generalization, and that our terrain representation is more efficient and informative than existing alternatives. By training a terrain compressor in simulation, we successfully deploy our system on a low-cost quadrupedal robot in real-world environments, showcasing the practicality and robustness of our approach.

Abstract:
Vision-based tactile sensors are highly promising for enabling robots to perform dexterous, contact-rich manipulation tasks by providing high-resolution tactile data. Recent studies have attempted to implement shape reconstruction and force estimation capabilities for sensors with omnidirectional sensing surfaces and a compact form factor. However, achieving a small diameter comparable to that of a human fingertip remains challenging, and integrating the multiple functionalities within the fingertip form factor poses significant challenges. In this study, we present UVDtact, a vision-based tactile sensor with a fingertip-like form factor that incorporates a switchable translucent elastomer. The proposed switchable translucent elastomer, which integrates ultraviolet (UV) ink and a translucent elastomer, decouples tactile images for shape reconstruction and force estimation. The independent tactile images ensure that shape reconstruction remains unaffected by UV markers, making them visible when needed, thereby enabling effective force estimation. For shape reconstruction, we leverage the darkening effect of the translucent elastomer in response to tactile stimuli and introduce a calibration method that utilizes this effect in an all-around curved sensor configuration. Furthermore, we validate that embedding UV markers enhances tactile features, improving force estimation performance while preserving the quality of tactile images used for shape reconstruction. By integrating various tactile sensing capabilities into a compact, fingertip-like design, UVDtact contributes to developing robotic systems with human-like dexterity.

Abstract:
Extrinsic calibration for LiDAR and camera using sparse point clouds can significantly reduce cost and improve efficiency. However, most target-based methods are designed for dense point clouds and are less effective in sparse scenarios, while targetless methods primarily rely on environmental features. To address this limitation, a LiDARcamera extrinsic calibration method for sparse point clouds is proposed in this paper. First, the proposed method extracts the complete checkerboard image via line-segment direction clustering and midpoint-to-normal projection. Second, a constructed theoretical checkerboard boundary point cloud is aligned to the scanned boundary point cloud using a proposed dimension-reduced, global-search and local-refinement (DGL) method. Third, coarse calibration is derived from the centroids of the checkerboard in images and aligned point clouds, followed by refinement through joint optimization of reprojection error and normal consistency error. Finally, experiments on the simulated dataset yield translation and rotation errors below 0.015 m and 0.3°, respectively. On a self-collected dataset, the method achieves an mIoU of 90.9% between the checkerboard region reprojected from point clouds and its image counterpart, outperforming state-of-the-art methods under sparse point cloud conditions.

Abstract:
The Samara Autorotating Wing (SAW) is a bioinspired autorotating glider capable of both controlled autorotation and diving modes. This work presents control and deployment strategies that enable precision landing of the platform from low altitude. The proposed control approach leverages cyclic control and a dive maneuver to improve landing accuracy. The deployment strategy is developed by updating parameters in a simulated model to reflect real-world performance under varying wind conditions, and then using the model to predict feasible release regions for specified wind direction, speed, and altitude. A total of 56 deployments were conducted from 60 m altitude in both low-wind (< 5 ms�?) and high-wind (> 5 ms�?) conditions representative of the local climate. The platform achieved landings within 10 m of the target in 89% of low wind trials and 57% of high-wind trials. These results highlight the potential of SAW platform for applications requiring high precision remote sensor deployment.

Abstract:
Given that Visual SLAM relies on appearance cues for localization and scene understanding, texture-less or visually degraded environments (e.g., plain walls or low lighting) lead to poor pose estimation and track loss. However,robots are typically equipped with sensors that provide some form of dead reckoning odometry with reasonable short-time performance but unreliable long-time performance. The Good Weights (GW) algorithm described here provides a framework to adaptively integrate dead reckoning (DR) with passive visual SLAM for continuous and accurate frame-level pose estimation. Importantly, it describes how all modules in a comprehensive SLAM system must be modified to incorporate DR into its design. Adaptive weighting increases DR influence when visual tracking is unreliable and reduces when visual feature information is strong, maintaining pose track without overreliance on DR. Good Weights yields a practical solution for mobile navigation that improves visual SLAM performance and robustness. Experiments on collected datasets and in real-world deployment demonstrate the benefits of Good Weights.

Abstract:
Flexible actuators have garnered extensive attention due to their flexibility and versatility. However, they still exhibit significant limitations in load capacity and structural stiffness. We have developed a multifunctional rigid-flexible coupled actuator with large deformation and high load capacity. We first investigated the structural design and material selection of the actuator. When establishing the mechanical model, we found that conventional methods could not solve it and that geometric nonlinearity could not be neglected. Therefore, we proposed a rigid-flexible coupled multibody dynamics modeling method suitable for large deformations and conducted static experiments to obtain a nonlinear torquerotation angle curve. Finally, we compared the simulation results with the dynamic experimental results, demonstrating the effectiveness and accuracy of the proposed method.

Abstract:
Zero-shot Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural language instructions and navigate without task-specific training. Prior works have demonstrated the potential of open-source large language models (LLMs) in zero-shot VLN-CE, yet two major limitations remain: (1) difficulty in accurately following instructions, and (2) susceptibility to loops in spatially confined or semantically similar regions. In this work, we introduce ReThinkNav, a framework designed to further advance open-source LLMs in zero-shot VLN-CE. ReThinkNav integrates contextual reasoning for enhanced instruction comprehension and progress estimation, enabling the LLM to accurately infer both the appropriate action and its rationale. In addition, a Loop Detection and Recovery (LDR) module detects loops and adjusts decisions accordingly. Experiments on the R2R-CE benchmark demonstrate excellent zero-shot performance, while real-world validation on the Unitree G1 humanoid robot confirms its practical applicability. The code is available at https://github.com/damonds27/ReThinkNav.

Abstract:
A common sensing problem is to use a set of stationary tracking locations to monitor a collection of moving devices: Given n objects that need to be tracked, each following its own trajectory, and m stationary traffic control stations, each with a sensing region of adjustable range; how should we adjust the individual sensor ranges in order to optimize energy consumption? We provide both negative theoretical and positive practical results for this important and natural challenge. On the theoretical side, we show that even if all objects move at constant speed along straight lines, no polynomial-time algorithm can guarantee optimal coverage for a given starting solution. On the practical side, we present an algorithm based on geometric insights that is able to find optimal solutions for the min max variant of the problem, which aims at minimizing peak power consumption. Runtimes for instances with 500 moving objects and 25 stations are in the order of seconds for scenarios that take minutes to play out in the real world, demonstrating real-time capability of our methods.

Abstract:
This paper proposes an innovative approach of full-scale autonomous highway inspection in complex environments using quadruped robot to enhance the adaptability and coverage of inspection tasks. Considering adaptive locomotion control as the foundation of autonomous inspection, a multi-level locomotion learning framework based on reinforcement learning is developed, including primitive-level, skill-level and inspection-level. Primitive-level control policy built upon Vector Quantized Variational Autoencoder is trained through imitation learning from existing open-source robots locomotion models, thereby achieving discrete embedding and reusability of foundational locomotion knowledge. At skill-level, to support diverse inspection skills learning, parametric modular scenario modeling method of the highway environment is proposed. Each skill-level control network is trained in corresponding modular scenario while reusing primitive-level control network. Inspection-level control network is established through multi-skill distillation from trained skill control networks. Combined with coverage path generator, automatic inspection can be completed. In a simulated complex highway environment, inspection robot demonstrates diverse inspection skills, successfully completing inspection of 14,400m2 area in 0.4h, with speed of 2.37m/s. Coverage and hazard detection rates both reach 100%. Compared to the existing highway inspection forms, the proposed highway inspection framework with quadruped robot enables efficient, stable, and full-scale autonomous inspection in complex highway environments, which provides general deployment capability for intelligent inspection systems.

Abstract:
Vision-Language-Action (VLA) models have emerged as a generalist robotic agent. However, existing VLAs are hindered by excessive parameter scales, prohibitive pre-training requirements, and limited applicability to diverse embodiments. To improve the practicality of VLAs, we propose a comprehensive benchmark and an improved baseline. First, we propose CEBench, a new benchmark spanning diverse embodiments in both simulation and the real world with consideration of domain randomization. We collect 14.4k simulated trajectories and 1.6k real-world expert-curated trajectories to support training on CEBench. Second, using CEBench as our testbed, we study three critical aspects of VLAs' practicality and offer several key findings. Informed by these findings, we introduce LightVLA, a lightweight yet powerful VLA designed for practical deployment on consumer-grade GPUs. Architecturally, it integrates a compact VLM backbone with multi-view perception, proprioceptive tokenization, and action chunking. To eliminate reliance on costly pre-training, LightVLA adopts a two-stage training paradigm including post-training and fine-tuning. Furthermore, LightVLA extends the action space to unify navigation and manipulation. Experiments across embodiments demonstrate the capabilities of generalization and versatility of LightVLA, while real-world mobile manipulation experiments establish it as the first end-to-end VLA model for mobile manipulation. We will open-source all datasets, codes, and checkpoints upon acceptance to foster reproducibility and future research.

Abstract:
Humans can achieve diverse in-hand manipulations, such as object pinching and tool use, which often involve simultaneous contact between the object and multiple fingers. This is still an open issue for robotic hands because such dexterous manipulation requires distinguishing between tactile sensations generated by their self-contact and those arising from external contact. Otherwise, object/robot breakage happens due to contacts/collisions. Indeed, most approaches ignore self-contact altogether, by constraining motion to avoid/ignore self-tactile information during contact. While this reduces complexity, it also limits generalization to real-world scenarios where self-contact is inevitable. Humans overcome this challenge through self-touch perception, using predictive mechanisms that anticipate the tactile consequences of their own motion, through a principle called sensory attenuation, where the nervous system differentiates predictable self-touch signals, allowing novel object stimuli to stand out as relevant. Deriving from this, we introduce TaSA, a two-phased deep predictive learning framework. In the first phase, TaSA explicitly learns self-touch dynamics, modeling how a robots own actions generate tactile feedback. In the second phase, this learned model is incorporated into the motion learning phase, to emphasize object contact signals during manipulation. We evaluate TaSA on a set of insertion tasks, which demand fine tactile discrimination: inserting a pencil lead into a mechanical pencil, inserting coins into a slot, and fixing a paper clip onto a sheet of paper, with various orientations, positions, and sizes. Across all tasks, policies trained with TaSA achieve significantly higher success rates than baseline methods, demonstrating that structured tactile perception with self-touch based on sensory attenuation is critical for dexterous robotic manipulation.

Abstract:
Soft continuum robots, attributable to inherently compliant trunks and shape manipulability, have been widely deployed in complex scenarios requiring safe human-robot interaction. However, their nonlinear deformations and hyperredundant degrees of freedom pose substantial challenges for full-body shape sensing and closed-loop control of the end effector. A low-cost yet accurate feedback solution is thus highly desirable. To address this, we present a hydraulicdriven reciprocating magnet strategy, integrated with magnetic localization, to enable both shape sensing and tip pose estimation of soft continuum robots, thereby facilitating precise closed-loop control. The proposed approach time-multiplexes a single magnet under different operational phases to fulfill two functions: full-body shape reconstruction and tip pose tracking. We validate the effectiveness of the reciprocating magnet system on a pneumatic manipulator prototype with two active degrees of freedom. Experimental results show that the magnet can travel through the guide channel at a maximum speed of 6.5 cm/s, achieving average errors of less than 2 mm in position (1.1% of the robots length), 3�?in orientation for shape sensing and tip pose estimation. Using this sensing strategy, we demonstrate a simple closed-loop control on the soft continuum robot. Owing to its simplicity, low cost, and high precision, the proposed method holds promise as a practical alternative for state feedback in soft continuum robots.

Abstract:
In this work, we propose an accurate and real-time optical flow and disparity estimation model by fusing pairwise input images in the proposed non-causal selective state space for dense perception tasks. We propose a non-causal Mamba block-based model that is fast and efficient and aptly manages the constraints present in real-time applications. Our proposed model reduces inference times while maintaining high accuracy and low GPU usage for optical flow and disparity map generation. The results and analysis justify that our proposed model can be used for unified real-time and accurate 3D dense perception estimation tasks. The code, along with the models, can be found at https://github.com/vimstereo/DensePerceptNCSSD

Abstract:
Automated parking is a challenging operational domain for advanced driver assistance systems, requiring robust scene understanding and interaction reasoning. The key challenge is twofold: (i)predict multiple plausible ego intentions according to context and (ii)for each intention, predict the joint responses of surrounding agents, enabling effective what-if decision-making. However, existing methods often fall short, typically treating these interdependent problems in isolation. We propose ParkDiffusion++, which jointly learns a multi-modal ego intention predictor and an ego-conditioned multi-agent joint trajectory predictor for automated parking. Our approach makes four key contributions. First, we introduce an ego intention tokenizer that predicts a small set of discrete endpoint intentions from agent histories and vectorized map polylines. Second, we perform ego intention-conditioned joint prediction, yielding socially consistent predictions of the surrounding agents for each possible ego intention. Third, we employ a lightweight safety-guided denoiser with different constraints to refine joint scenes during training, thus improving accuracy and safety. Fourth, we propose counterfactual knowledge distillation, where an EMA teacher refined by a frozen safety-guided denoiser provides pseudo-targets that capture how agents react to alternative ego intentions. Extensive evaluations demonstrate that ParkDiffusion++ achieves state-of-the-art performance on the Dragon Lake Parking (DLP) dataset and the Intersections Drone (inD) dataset. Importantly, qualitative what-if visualizations show other agents react appropriately to different ego intentions.

Abstract:
Robots operating in changing environments either predict obstacle changes and/or plan quickly enough to react to them. Predictive approaches require a strong prior about the position and motion of obstacles. Reactive approaches require no assumptions about their environment but must replan quickly and find high-quality paths to navigate effectively. Reactive approaches often reuse information between queries to reduce planning cost. These techniques are conceptually sound but updating dense planning graphs when information changes can be computationally prohibitive. It can also require significant effort to detect the changes in some applications. This paper revisits the long-held assumption that reactive replanning requires updating existing plans. It shows that the incremental planning problem can alternatively be solved more efficiently as a series of independent problems using fast almost-surely asymptotically optimal (ASAO) planning algorithms. These ASAO algorithms quickly find an initial solution and converge towards an optimal solution which allows them to find consistent global plans in the presence of changing obstacles without requiring explicit plan reuse. This is demonstrated with simulated experiments where Effort Informed Trees (EIT) finds shorter median solution paths than the tested reactive planning algorithms and is further validated on a real-world planning problem on a robot arm.

Abstract:
Accurate perception of object hardness is essential for safe and dexterous contact-rich robotic manipulation. Here, we present TactEx, an explainable multimodal robotic interaction framework that unifies vision, touch, and language for human-like hardness estimation and interactive guidance. While task-agnostic, we demonstrate and evaluate TactEx's capabilities on fruit-ripeness assessment as a representative use case requiring tactile perception and contextual understanding. Our system fuses GelSight-Mini tactile streams with RGB observations and language prompts: a ResNet50 + LSTM estimates hardness from tactile sequential data, and a cross-modal alignment module integrates visual cues with LLM guidance. The resulting interface is explainable and multimodal, enabling users to identify fruit ripeness levels with statistically significant separability (p < 0.01 for all fruit pairs). For touch placement, we compare YOLO with Grounded-SAM (GSAM) and find GSAM to be more robust for fine-grained segmentation and contact-site selection. A lightweight LLM parses user instructions and produces grounded natural-language explanations linked to the tactile outputs. In end-to-end evaluations, TactEx attains 90% task success on simple user queries and generalises to novel tasks without large-scale tuning. These results highlight the promise of combining pretrained visual and tactile models with language grounding to advance explainable, human-like touch perception and decision-making in robotics.

Abstract:
The physicality of exercise makes the role of athletic trainers unique. Their physical presence allows them to guide a student through a motion, demonstrate an exercise, and give intuitive feedback. Robot quadrupeds are also embodied agents with robust agility and athleticism. In our work, we investigate whether a robot quadruped can serve as an effective and enjoyable personal trainer device. We focus on a case study of interval training for runners: a repetitive, long-horizon task where precision and consistency are important. To meet this challenge, we propose Snoopie, an autonomous robot quadruped pacer capable of running interval training exercises tailored to challenge a user's personal abilities. We conduct a set of user experiments that compare the robot trainer to a wearable trainer device--the Apple Watch--to investigate the benefits of a physical embodiment in exercise-based interactions. We demonstrate 60.6% better adherence to a pace schedule and were 45.9% more consistent across their running speeds with the quadruped trainer. Subjective results also showed that participants strongly preferred training with the robot over wearable devices across many qualitative axes, including its ease of use (+56.7%), enjoyability of the interaction (+60.6%), and helpfulness (+39.1%). Additional videos and visualizations can be found on our website: https://sites.google.com/view/snoopie

Abstract:
We present multipanda_ros2, a novel open-source ROS2 architecture for multi-robot control of Franka Robotics robots. Leveraging ros2_control, this framework provides native ROS 2 interfaces for controlling any number of robots from a single process. Our core contributions address key challenges in real-time torque control, including interaction control and robot-environment modeling. A central focus of this work is sustaining a 1 kHz control frequency, a necessity for real-time control and a minimum frequency required by safety standards. Moreover, we introduce a controllet-feature design pattern that enables controller-switching delays of �?2 ms, facilitating reproducible benchmarking and complex multi-robot interaction scenarios. To bridge the simulation-to-reality (sim2real) gap, we integrate a high-fidelity MuJoCo simulation with quantitative metrics for both kinematic accuracy and dynamic consistency (torques, forces, and control errors). Furthermore, we demonstrate that real-world inertial parameter identification can significantly improve force and torque accuracy, providing a methodology for iterative physics refinement. Our work extends approaches from soft robotics to rigid dual-arm, contact-rich tasks, showcasing a promising method to reduce the sim2real gap and providing a robust, reproducible platform for advanced robotics research.

Abstract:
The ability to accurately assess and anticipate risks in safety-critical scenarios is crucial for autonomous driving systems. While existing research has made progress in collision prediction, accurately quantifying risk levels from monocular vision inputs remains challenging due to the complex dynamics of multi-agent interactions and the inherent uncertainty in real-world environments. To address these challenges, we present textbfNSF-HRPT, a novel framework that combines learning-based perception with structured reasoning for quantitative risk assessment. Our approach features a Neural Semantic Field (NSF) that learns to model scene semantics, trajectory predictions, and probabilistic Time-to-Collision (TTC) distributions from simulation data. During inference, the pre-trained NSF serves as a prior for our Hierarchical Risk Perception Tree (HRPT), which enables efficient parallel computation and spatial reasoning about multi-agent risks. Additionally, we introduce a Sim2Real enhancement strategy that improves real-world applicability without retraining by incorporating priors from foundation models. Extensive evaluations demonstrate that our framework achieves state-of-the-art performance on synthetic benchmarks and delivers competitive, near-state-of-the-art results on real-world datasets for both TTC estimation accuracy and risk localization precision. The proposed method provides an effective solution for real-time risk awareness from monocular camera inputs.

Abstract:
We present a novel flapping‑wing mechanism capable of generating steering torque through a two‑degree‑of‑freedom (2‑DoF) coordinated actuation. Whereas most existing flapping‑flight systems produce steering torque by incorporating additional mechanisms that asynchronously alter passive deformation limits, our approach enables transient aerodynamic force modulation synchronized with each flapping stroke. The concept draws inspiration from biological flyers such as dragonflies and hawkmoths, which utilize multiple synchronous muscles per wing and perform stroke‑synchronized, multi‑DoF wing kinematicsstrategies thought to contribute to their precise attitude and position control even at low flapping frequencies. To emulate this capability, we developed a mechanism employing parallel direct‑drive actuators within a coupled multi‑DoF architecture. By introducing a cross‑coupling force between the actuators, we eliminate path dependency in the wing‑twist motion, thereby enabling stable and independent control of both stroke angle and angle of attack. Using a single‑wing 2‑DoF testbed, we successfully demonstrate a lift force exceeding 10 gf and a yaw‑steering torque range of 1.5 mNm. This work advances the development of biologically inspired, stroke‑synchronized steering mechanisms for next‑generation flapping‑wing micro aerial vehicles.

Abstract:
Fast-start maneuversexemplified by the C-start in fishrepresent a highly agile and very attractive locomotor strategy that requires precise multi-joint coordination under conditions of unsteady fluid dynamics, and has evolved through extensive predatorprey interactions in natural environments. Replicating such maneuvers in robotic fish is challenging due to strong fluidstructure nonlinearities, instantaneous dynamics, and complex vortex interactions. Prior approaches were limited by their dependence on specialized materials, lack of active controllability, incompatibility with mechanical structures, and inability to generate sufficient forward propulsion. Here, we propose a deep reinforcement learning method for multi-joint robotic fish that embeds key biological features of C-start maneuversburst acceleration, rapid directional adjustment, and two-stage bend-and-stretch motioninto the reward and observation design. By training in a physically consistent, high-performance Computational Fluid Dynamics (CFD) solver, the agent autonomously discovers effective launch strategies without requiring explicit models or real fish data. The resulting policies not only reproduce C-start-like motions and achieve fully controllable directional fast-starts, but also significantly expand the maneuvering potential of robotic fish, enabling higher velocities, greater displacement, and more agile motion than state-of-the-art methods. This biologically inspired and generalizable method demonstrates the promise of integrating biological principles into reinforcement learning to unlock advanced, high-acceleration capabilities in multi-joint aquatic robots.

Abstract:
Deep learning models are increasingly central to autonomous vehicle (AV) pipelines, yet their integration has traditionally followed a monolithic design where perception, planning, and control execute on a single onboard computer. This design overlooks the emerging paradigm of cooperative autonomy, where vehicles interact with roadside units (RSUs), edge servers, and cloud-hosted intelligence through vehicle-to-everything (V2X) connectivity. Cooperative perception and control improve safety and efficiency, but also introduce systems-level challenges: network latency, compute heterogeneity, and multi-tenant contention, all critically affect real-time decision-making. These challenges are further amplified by the increasing reliance on large foundation models, whose scale necessitates cloud deployment. We present CADET (Cooperative Autonomy through Distributed Experimentation Toolkit), a modular platform for systematic and reproducible evaluation of distributed cooperative autonomy systems under realistic deployment conditions. CADET decouples the AV stack into composable modules that can be flexibly deployed across vehicles, infrastructure, and edge/cloud tiers. The framework integrates state-of-the-art models, incorporates trace-driven network and workload emulation, and provides synchronized model-, system-, and task-level instrumentation. Through V2V and V2I experiments, we show that distributed deployment choices fundamentally shape safety, with V2V intent packets outperforming cloud-based perception and RSU-assisted perception sustaining safety until overloaded by concurrent requests. Although designed for AV pipelines, CADET also supports dataset-driven experimentation, enabling systems and ML researchers to benchmark distributed inference workloads independently of full vehicle simulation. CADET is open source, with code and demo available at https://nesl.github.io/cadet-web.

Abstract:
Multi-modal 3D semantic occupancy prediction remains challenged by two fundamental issues: (i) geometric--semantic misalignment introduced by fixed-neighborhood fusion under heterogeneous sensing distributions, and (ii) feature degradation with prediction inconsistency in dynamic scenes caused by sparse supervision. We propose TACOcc, a framework coupling a target-adaptive, bidirectional symmetric fusion module with sequential volume rendering supervision. The fusion module predicts a query-wise neighborhood size via a differentiable Gumbel-Softmax strategy, expanding the receptive field for large objects to enrich context while contracting it for small objects to suppress noise, thereby achieving precise cross-modal alignment. To stabilize predictions under sparse labels and motion, we introduce temporally enhanced Gaussian rendering that aggregates multi-frame dependencies, initializes dual-source geometric anchors, and transfers multi-view photometric constraints from images to 3D occupancy features. A velocity-adaptive temporal bandwidth further mitigates flicker in fast-motion cases. Experiments on nuScenes and SemanticKITTI demonstrate strong performance, including 28.9% mIoU on nuScenes, particularly improving small-object categories and long-range regions. These results highlight that scale-aware bidirectional fusion and temporally grounded volumetric supervision form an effective recipe for robust multi-modal occupancy perception.

Abstract:
Accurate online inertial parameter estimation is essential for adaptive robotic control, enabling real-time adjustment to payload changes, environmental interactions, and system wear. Traditional methods often struggle to track abrupt parameter shifts or incur high computational costs, limiting their effectiveness in dynamic environments and for computationally constrained robotic systems. We introduce TAG-K, a lightweight extension of the Kaczmarz method that combines greedy randomized row selection for rapid convergence with tail averaging for robustness under noise and inconsistency. This design enables fast, stable parameter adaptation while retaining the low per-iteration complexity inherent to the Kaczmarz framework. We evaluate TAG-K in synthetic benchmarks and quadrotor tracking tasks against RLS, KF, and other Kaczmarz variants. TAG-K achieves 1.5×1.9× faster solve times on laptop-class CPUs and 4.8×20.7× faster solve times on embedded microcontrollers. More importantly, these speedups are paired with improved robustness to measurement noise and a 25% reduction in estimation error, leading to nearly 2× better end-to-end tracking performance. Website, documentation, and code available at: https://a2r-lab.org/TAG-K/.

Abstract:
Recent methods in ergodic coverage planning have shown promise as tools that can adapt to a wide range of geometric coverage problems with general constraints, but are highly sensitive to the numerical scaling of the problem space. The underlying challenge is that the optimization formulation becomes brittle and numerically unstable with changing scales, especially under potentially nonlinear constraints that impose dynamic restrictions, due to the kernel-based formulation. This paper proposes to address this problem via the development of a scale-agnostic and adaptive ergodic coverage optimization method based on the maximum mean discrepancy metric (MMD). Our approach allows the optimizer to solve for the scale of differential constraints while annealing the hyperparameters to best suit the problem domain and ensure physical consistency. We also derive a variation of the ergodic metric in the log space, providing additional numerical conditioning without loss of performance. We compare our approach with existing coverage planning methods and demonstrate the utility of our approach on a wide range of coverage problems.

Abstract:
This paper investigates the resilient multi-robot efficient search problem (R-MuRES), which aims at coordinating multiple robots for the minimal time detection of a 'non-adversarial' moving target. R-MuRES faces challenges like robot malfunctions and withdrawals during task execution, leading to a variable number of searchers and new research hurdles. We propose resilient value function factorization (R-FAC) to construct a central value function resiliently, minimizing mean squared temporal difference (TD) errors across team compositions. R-FAC ensures that individual global maximum (IGM) principles are met, allowing functioning robots to contribute positively. We introduce variational value decomposition network (V2DN) as an instantiation of the R-FAC paradigm, proving superior to brute-force summation in multi-robot search tasks. V2DN is compared with state-of-the-art MuRES solutions and the vanilla VDN, showcasing superior resiliency when robots leave the team. Validation of V2DN is performed in a real multi-robot system in a self-constructed indoor environment, demonstrating its effectiveness and contributing valuable insights to the robotics community.

Abstract:
This letter proposes a new image-based visual servoing controller for positioning a camera with respect to a cylindrical object. Traditional image-based approaches often rely on estimating planar parameters from the cylinders projected edges, making them sensitive to noise and modeling errors. In this work, we introduce a novel controller that uses pure image features while directly tied to the cylinders 3D pose, which depends solely on the cylinder radius. Crucially, this controller offers formal global stability irrespective of the radius estimate. Simulations and real experiments with a robotic arm confirm the controller improved convergence and robustness under practical conditions.

Abstract:
This work presents a method for automatically detecting and recognizing test tube types in a rack. It leverages automatic segmentation, clustering, and labeling processes to eliminate the need for explicitly preparing training data. These processes are addressed by using combined global prediction and local cropping, where global prediction estimates the slot occupation states of a rack, and local cropping extracts tube pictures in the local regions of each slot for clustering and labeling. With the help of the proposed method, the robotic tube manipulation system no longer needs tailored data and explicit training in the presence of new tubes, thus achieving flexibility and efficiency. Experimental evaluations conducted with a RealSense D405 camera and the UFactory xArm Lite6 robot manipulator confirm the methods effectiveness in accurately identifying novel test tube types under real-world conditions.

Abstract:
The Hospital at Home initiative transforms medical service automation through modern technologies. This paper revisits remote physiotherapy, allowing convalescents to record exercises using mobile devices from arbitrary angles. To address this, we propose a physiotherapy video matching method that accurately aligns movements from unconstrained viewpoints. The task is formulated as an optimization problem and solved using a modular pipeline. We introduce the Angle-of-Limb-based Posture Structure (ALPS) and the Camera-Angle-Free (CAFE) transformation to counter camera-angle differences. We also develop the Three-phase ALPS Matching Algorithm (TALMA) for matching movements between mentor and convalescent videos. Real-world experiments show our method outperforms existing solutions in both precision and practicality, with a time deviation of less than 0.07 seconds from expert annotations. The prototype and datasets are publicly available at: https://github.com/NCKU-CIoTlab/TALMA-on-ALPS/.

Abstract:
The growing demand for neurorehabilitation is driving the development of innovative, home-based robotic solutions, offering a promising approach to alleviate the strain on healthcare systems burdened by limited resources and workforce shortages. Despite significant technological advancements in rehabilitation robotics, adoption remains limited due to unresolved safety, legal, and ethical concerns. This study provides a comprehensive analysis of these three aspects from the perspective of experienced neurorehabilitation clinicians, offering valuable insights into the challenges surrounding home-based rehabilitation robots. Using a qualitative approach, we identified eight key themes derived from clinicians' feedback. These themes underscore critical areas, including the need for robust safety measures, regulatory clarity on liability and data privacy, and the ethical imperative of ensuring equitable access to technology for diverse user populations. Our findings highlight the need for a multifaceted approach to overcome these challenges, including user-centred design, rigorous testing, comprehensive user training, and necessary updates to regulatory frameworks to ensure the safe, effective, and equitable deployment of these technologies.

Abstract:
Autonomous navigation for legged robots in complex and dynamic environments relies on robust simultaneous localization and mapping (SLAM) systems to accurately map surroundings and localize the robot, ensuring safe and efficient operation. While prior sensor fusion-based SLAM approaches have integrated various sensor modalities to improve their robustness, these algorithms are still susceptible to estimation drift in challenging environments due to their reliance on unsuitable fusion strategies. Therefore, we propose a robust LiDAR-visual-inertial-kinematic odometry system that integrates information from multiple sensors, such as a camera, LiDAR, inertial measurement unit (IMU), and joint encoders, for visual and LiDAR-based odometry estimation. Our system employs a fusion-based pose estimation approach that runs optimization-based visual-inertial-kinematic odometry (VIKO) and filter-based LiDAR-inertial-kinematic odometry (LIKO) based on measurement availability. In VIKO, we utilize the foot-preintegration technique and robust LiDAR-visual depth consistency using superpixel clusters in a sliding window optimization. In LIKO, we incorporate foot kinematics and employ a point-to-plane residual in an error-state iterative Kalman filter (ESIKF). Compared with other sensor fusion-based SLAM algorithms, our approach shows robust performance across public and longterm datasets

Abstract:
This paper presents an effective and reliable pose tracking solution, termed ERPoT, for mobile robots operating in large-scale outdoor and challenging indoor environments, underpinned by an innovative prior polygon map. Especially, to overcome the challenge that arises as the map size grows with the expansion of the environment, the novel form of a prior map composed of multiple polygons is proposed. Benefiting from the use of polygons to concisely and accurately depict environmental occupancy, the prior polygon map achieves longterm reliable pose tracking while ensuring a compact form. More importantly, pose tracking is carried out under pure LiDAR mode, and the dense 3D point cloud is transformed into a sparse 2D scan through ground removal and obstacle selection. On this basis, a novel cost function for pose estimation through point-polygon matching is introduced, encompassing two distinct constraint forms: point-to-vertex and point-to-edge. In this study, our primary focus lies on two crucial aspects: lightweight and compact prior map construction, as well as effective and reliable robot pose tracking. Both aspects serve as the foundational pillars for future navigation across diverse mobile platforms equipped with different LiDAR sensors in varied environments. Comparative experiments based on the publicly available datasets and our self-recorded datasets are conducted, and evaluation results show the superior performance of ERPoT on reliability, prior map size, pose estimation error, and runtime over the other six approaches. The corresponding code can be accessed at https://github.com/ghm0819/ERPoT, and the supplementary video is at https://youtu.be/6XdcXyUrLKw.

Abstract:
This paper presents an indirect adaptive predictorpreview control architecture for continuous-time systems with unknown time-varying input delays and unknown (slowly varying) parameters. An adaptive super-twisting algorithm (STA) estimates the unknown delay online using a monotone ramp probe, and an indirect recursive least-squares (RLS) module tracks slow parameter variations; both feed a frozen-parameter predictor and a preview feedforward-based on r(t + h(t)). Nominal exponential tracking is shown under exact prediction, and a practical input-to-state The stability (ISS) bound is derived that accounts for delay/parameter estimation errors, disturbances, and numerical approximation. On the DC motor speed servo benchmark, the controller reduces steady state RMSE/peak error to 0.046/0.074 (S1) and 0.062/0.099 (S4), below all compared baselines.

Abstract:
Rectangular nozzles are attractive for large-format additive manufacturing (LFAM) due to their improved deposition efficiency. However, single-inlet feeding of high-aspect-ratio nozzles inherently induces lateral pressure gradients, causing center-heavy flow and eliminating localized control during dynamic trajectories. We introduce a distributed multi-inlet extrusion testbed featuring three independently actuated inlets. Functioning as a programmable fluid manifold, this architecture actively manages the internal flow field. In-line laser profilometry is integrated as a continuous state estimator to quantify cross-sectional bead geometry. Experiments confirm this distributed architecture regularizes flow, achieving nominal steady-state extrusion with 33% less input flow and a 78% reduction in required plunger velocity per actuator compared to a single-inlet baseline. Furthermore, differential actuation enables high-resolution lateral steering and improves deposition under simulated outlet constraints with high allocation efficiency. This work establishes the hardware and state-estimation foundation for dynamically reconfigurable nozzle outlets by mapping inputs to spatial outputs.

Abstract:
Bimanual manipulation is imperative yet challenging for robots to execute complex tasks, requiring coordinated collaboration between two arms. However, existing methods for bimanual manipulation often rely on costly data collection and training, struggling to generalize to unseen objects in novel categories efficiently. In this paper, we present Bi-Adapt, a novel framework designed for efficient generalization for bimanual manipulation via semantic correspondence. Bi-Adapt achieves cross-category affordance mapping by leveraging the strong capability of vision foundation models. Fine-tuning with restricted data on novel categories, Bi-Adapt exhibits notable generalization to out-of-category objects in a zero-shot manner. Extensive experiments conducted in both simulation and real-world environments validate the effectiveness of our approach and demonstrate its high efficiency, achieving a high success rate on different benchmark tasks across novel categories with limited data. Project website: https://biadapt-project.github.io/

Abstract:
This paper presents the design, modeling, and experimental validation of a novel leg-wheel mechanism featuring an integrated, passive angle-bisecting foot. The core of the design is a two-stage planetary gear system. This system mechanically ensures a consistent foot-ground contact angle, addressing a key limitation in transformable robots with symmetrical leg-wheels. To leverage this innovation, we developed a comprehensive kinematic model. Furthermore, we designed a hierarchical motion planning framework that utilizes the pure rolling motion enabled by the mechanism. The effectiveness of the proposed design was validated through hardware experiments on a 23 kg prototype. The results demonstrated improved energy efficiency based on the Cost of Transport (C.O.T.) metric, achieving up to a 16.2% reduction in C.O.T. alongside a 28.6% reduction in pitch oscillation compared to a baseline design. This study provides a valuable guideline for developing adaptive gait controllers that can optimize for energy efficiency in real time.

Abstract:
Tracking moving anatomical targets with robotic ultrasound is particularly challenging when the target motion is both fast and large in scale, as the end-to-end latency of existing systems prevents the perceptioncontrol loop from closing fast enough. In this paper, we argue that overcoming this limitation calls for the joint design of perception and control, rather than optimizing each in isolation. We present a tightly-coupled framework with two main components: (1) a Decoupled DualStream Perception Network that estimates 3D translational state from 2D ultrasound images at high frequency, and (2) a Single-Step Flow Policy that outputs an entire action sequence in one forward pass, removing the need for iterative rollouts used in conventional policies. Together, the two modules enable closed-loop control at over 60 Hz. In phantom experiments with complex 3D trajectories, the system achieves a mean tracking error below 6.5 mm and re-acquires the target after resultant displacements exceeding 170 mm. It tracks targets moving at speeds up to 102 mm/s with a terminal error under 1.7 mm. In-vivo trials on a human volunteer further confirm that the approach transfers to realistic clinical conditions. To our knowledge, this is the first RUSS framework to unify high-bandwidth dynamic tracking with large-scale repositioning within a single architecture, offering a concrete step toward autonomous ultrasound operation in the presence of patient motion.

Abstract:
Robotic perception often requires solving large nonlinear least-squares (NLS) problems. While sparsity has been well-exploited to scale solvers, a complementary and underexploited structure is emphseparability -- where some variables (e.g., visual landmarks) enter the residuals linearly and, for any estimate of the remaining variables (e.g., poses), have a closed-form least-squares solution that can be substituted back to reduce the problem size and improve conditioning. Variable projection (VarPro) methods are a family of techniques to exploit this structure, however they have seen limited use in robotic perception; this is in part because gauge symmetries (e.g., cost invariance to global shifts and rotations) which are common in perception problems induce specific computational challenges in standard VarPro approaches. We present a VarPro scheme designed for problems with gauge symmetries that jointly exploits separability and sparsity. Our method can be applied as a one-time preprocessing step to construct a emphmatrix-free Schur complement operator. This operator allows for efficiently evaluating costs, gradients, and Hessian-vector products of the reduced problem and readily integrates with standard iterative NLS solvers. We provide precise conditions under which our method applies, and describe extensions when these conditions are only partially met. Across synthetic and real benchmarks in SLAM, SNL, and SfM, our approach achieves up to textbf2times--35times faster runtimes than state-of-the-art methods while maintaining accuracy. We release an open-source C++ implementation and all datasets.

Abstract:
Deep reinforcement learning has achieved super-human racing performance in high-fidelity simulators like Gran Turismo 7 (GT7). It typically utilizes global features that require instrumentation external to a car, such as precise localization of agents and opponents, limiting real-world applicability. To address this limitation, we introduce a vision-based autonomous racing agent that relies solely on ego-centric camera views and onboard sensor data, eliminating the need for precise localization during inference. This agent employs an asymmetric actor-critic framework: the actor uses a recurrent neural network with the sensor data local to the car to retain track layouts and opponent positions, while the critic accesses the global features during training. Evaluated in GT7, our agent consistently outperforms model predictive control drivers. To our knowledge, this work presents the first vision-based autonomous racing agent to demonstrate champion-level performance in competitive racing scenarios.

Abstract:
Motion generation in dynamic environments is crucial for human-machine interaction with redundant manipulators. In this context, we propose the Enhancing Safety and Manipulability (ESM) scheme, which integrates geometry-based dynamic obstacle avoidance, manipulability optimization,trajectory tracking, and joint limit avoidance into a unified scheme operating at the joint-angle level. The introduction of a flexible collision library enables the scheme to locate critical points based on geometry, while the incorporated obstacle speed allows the scheme to effectively avoid dynamic obstacles. In the ESM, manipulability is naturally set as the non-convex goal. To solve the ESM, the Accelerated Multi-agent recurrent Neural Network (AMNN) is proposed, which uses a meta-heuristic approach to construct activation functions, endowing the neural network with non-convex control capabilities. Subsequently, a GPU-based parallel computing method is implemented, significantly reducing computing time. Detailed simulations, experiments, and comparisons demonstrate the framework's effectiveness and superiority.

Abstract:
Monocular 3D object detection has received growing recognition in contemporary research due to its reduced hardware complexity and lower deployment cost compared to multi-sensor-based approaches. Prior research has primarily addressed ideal environmental settings, neglecting the influence of diverse weather scenarios, including rain, snow, and fog, that significantly hinder detection reliability. To enhance robustness under inclement weather conditions, we introduce MonoEM, a monocular 3D object detection framework that leverages object-level image representations and equirectangular maps. Starting from 2D detection results, MonoEM derives equirectangular maps through an equirectangular object-level reconstruction. Furthermore, MonoEM suppresses inclement weather noise in object-level images through image restoration. Subsequently, MonoEM fuses the reconstructed equirectangular map with the restored image and performs 3D bounding box prediction using a visual-range fusion detector. The integration of 2D-3D box alignment loss between 2D and 3D bounding boxes improves the geometric alignment and 3D object detection accuracy. Experimental results across various inclement weather conditions validate the notable accuracy and robustness of MonoEM compared to existing monocular 3D baselines. The source code is provided at https://anonymous.4open.science/r/MonoEM-00AC.

Abstract:
Humans naturally express emotions with subtle variations, and exaggerated expressions often appear as heightened intensity in facial, bodily, or vocal cues. This paper introduces a method for exaggerating robotic emotional expressions by dynamically adjusting intensity within an emotion dynamics model. By systematically manipulating the damping ratio, we generated five distinct intensity levels for each emotion, thereby producing emotional expressions that exhibited different degrees of overshoot. A user study revealed that liveliness ratings for surprise increased linearly with intensity, suggesting that exaggerated, high-energy dynamics are particularly effective for conveying surprise. In contrast, other emotions exhibited optimal points at intermediate levels, indicating that excessive exaggeration can reduce perceived naturalness. These findings highlight the need for emotion-specific and user-specific calibration of expression intensity, supporting more nuanced and engaging human-robot interactions.

Abstract:
Robust grasping in cluttered environments remains an open challenge in robotics. While benchmark datasets have significantly advanced deep learning methods, they mainly focus on simplistic scenes with light occlusion and insufficient diversity, limiting their applicability to practical scenarios. We present GraspClutter6D, a large-scale real-world grasping dataset featuring: (1) 1,000 highly cluttered scenes with dense arrangements (14.1 objects/scene, 62.6% occlusion), (2) comprehensive coverage across 200 objects in 75 environment configurations (bins, shelves, and tables) captured using four RGB-D cameras from multiple viewpoints, and (3) rich annotations including 736K 6D object poses and 9.3B feasible robotic grasps for 52K RGB-D images. We benchmark state-of-the-art segmentation, object pose estimation, and grasping detection methods to provide key insights into challenges in cluttered environments. Additionally, we validate the dataset's effectiveness as a training resource, demonstrating that grasping networks trained on GraspClutter6D significantly outperform those trained on existing datasets in both simulation and real-world experiments. The dataset, toolkit, and annotation tools are publicly available on our project website: https://sites.google.com/view/graspclutter6d.

Abstract:
Remote Center of Motion (RCM) mechanism is widely used in surgical robots for Minimally Invasive Surgery (MIS). For endoscopic spine surgery, the surgical robot requires RCM mechanism with sufficient stiffness and compact end effector to manipulate hard tissues while avoiding interference with other instruments. This paper presents a modified belt-driven RCM mechanism designed to meet these specific requirements of the endoscopic spine surgery. Gearboxes were incorporated to the belt-driven Remote Center of Motion (RCM) mechanism to reduce belt tension and resulting elastic deformation, while the RCM constraint is maintained through a specific relationship between gearbox reduction ratios. The prismatic joint for instrument insertion is relocated to the base to reduce the size of the end effector. Prototype of a surgical robot with the presented RCM mechanism achieved an RCM point accuracy of 0.56 mm, repeatability of 0.019 mm, and stiffness of 11.676, 12.435, and 5.341 N/mm in the X, Y, and Z directions, respectively. Feasibility of the robot was validated through simulated BESS on a spine phantom.

Abstract:
Tools extend the manipulation abilities of robots, much like they do for humans. Despite human expertise in tool manipulation, teaching robots these skills faces challenges. The complexity arises from the interplay of two simultaneous points of contact: one between the robot and the tool, and another between the tool and the environment. Tactile and proximity sensors play a crucial role in identifying these complex contacts. However, learning tool manipulation with a small amount of real-world data using these sensors remains challenging due to the large sim-to-real gap and sensor noise. To address this, we propose a few-shot tool-use skill transfer framework using multimodal sensing. The framework involves pre-training the base policy to capture contact states common in tool-use skills in simulation and fine-tuning it with human demonstrations collected in the real-world target domain to bridge the domain gap. We validate that this framework enables teaching surface-following tasks using tools with diverse physical and geometric properties with a small number of demonstrations on the Franka Emika robot arm. Our analysis suggests that the robot acquires new tool-use skills by transferring the ability to recognise tool-environment contact relationships from pre-trained to fine-tuned policies. Additionally, integrating proximity and tactile sensors enhances the identification of contact states and environmental geometry.

Abstract:
The recently developed approach to motion planning in graphs of convex sets (GCS) provides an efficient framework for computing shortest-distance collision-free paths using convex optimization. This new motion planner is notably more computationally efficient than popular sampling-based motion planners, but it does not support nonconvex cost functions. This article develops a novel motion planning algorithm, graph of convex sets with general costs (GCSGC), to solve this problem. A given nonconvex cost function is accurately approximated by a multiple-layer ReLU neural network and the configuration space is decomposed into a set of linear-cost regions using the hidden layers of the neural network. These linear-cost regions are intersected with a set of collision-free regions, and the resulting collision-free linear-cost regions are intersected to form the vertices and edges of the motion planners underlying graph structure. The edge costs have a closed-form solution within each collision-free linear-cost region, but it is nonconvex, so the McCormick relaxation is applied to convexify the edge costs. Finally, a graph preprocessing technique is developed to compute a representative graph structure that acts as a heuristic for the edge costs of the underlying GCS and then simplify the underlying graph structure by removing cycles and high-cost paths, which can significantly improve the efficiency of the planner and quality of the produced trajectories. The proposed motion planner is first validated in a 2-D configuration space with comparisons between different sized neural networks with and without preprocessing, comparisons between optimal trajectories from GCSGC with shortest-distance trajectories, and comparisons between GCSGC and GCS-Sequential linear programming (SLP). The GCSGC planner is further validated in a complex 7-D configuration space by comparing to state-of-the-art multiquery (PRM, GCS-SLP) and single-query (TrajOpt, BIT, AIT, RRT) planners.

Abstract:
This work presents a robust person tracking and re-identification system designed for Human-Robot Interaction applications. The approach introduces the One-Class Body-Part (OCBP) Transformer, trained online to model interactions among body-part features and construct a robust target representation. To improve data association and reduce identity swaps during the tracking phase, the SORT tracker is extended with depth information in order to provide correct samples for the Online Continual Learning (OCL) setting. The transformer is further enhanced through the use of pseudo-negative samples, which accelerate convergence during the online learning phase. Ablation studies compare the performance of the memory management system using different sample insertion configurations and highlight the benefit of using pseudo-negative samples. The proposed method is evaluated on a public dataset, where it outperforms state-of-the-art approaches in challenging scenarios, and is validated in a real-world person-following experiment with a robotic platform in an environment with multiple distractors, occlusions, out-of-view situations and illumination changes. Despite these complexities, the robot consistently re-identified and followed the target individual. Runtime analysis demonstrates that the system operates reliably on embedded computing platforms with NVIDIA GPUs, making it both robust and resource-efficient for real-world deployment.

Abstract:
Search and rescue operations require unmanned aerial vehicles to both traverse unknown unstructured environments at high speed and track targets once detected. Achieving both capabilities under degraded sensing and without global localization remains an open challenge. Recent works on relative navigation have shown robust tracking by anchoring planning and control to a visible detected object, but cannot address navigation when no target is in the field of view. We present HUNT: High-speed UAV Navigation and Tracking, a real-time framework that unifies traversal, acquisition, and tracking within a single relative formulation. HUNT defines navigation objectives directly from onboard instantaneous observables such as attitude, altitude, and velocity, enabling reactive high-speed flight during search. Once a target is detected, the same perceptioncontrol pipeline transitions seamlessly to tracking. Outdoor experiments in dense forests, container compounds, and search-and-rescue operations with vehicles and mannequins demonstrate robust autonomy where global methods fail. Video: https://youtu.be/YsSflqPPHhs

Abstract:
Robotic reproduction of oil paintings using soft brushes and pigments requires force-sensitive control of deformable tools, prediction of brushstroke effects, and multi-step stroke planning, often without human step-by-step demonstrations or faithful simulators. Given only a sequence of target oil painting images, can a robot infer and execute the stroke trajectories, forces, and colors needed to reproduce it? We present IMPASTO, a robotic oil-painting system that integrates learned pixel dynamics models with model-based planning. The dynamics models predict canvas updates from image observations and parameterized stroke actions; a receding-horizon model predictive control optimizer then plans trajectories and forces, while a force-sensitive controller executes strokes on a 7-DoF robot arm. IMPASTO integrates low-level force control, learned dynamics models, and high-level closed-loop planning, learns solely from robot self-play, and approximates human artists' single-stroke datasets and multi-stroke artworks, outperforming baselines in reproduction accuracy. Project website and appendix: https://impasto-robopainting.github.io/.

Abstract:
Humanoid whole-body locomotion control is a critical approach for humanoid robots to leverage their inherent advantages. Learning-based control methods derived from retargeted human motion data provide an effective means of addressing this issue. However, because most current human datasets lack measured force data, and learning-based robot control is largely position-based, achieving appropriate compliance during interaction with real environments remains challenging. This paper presents Compliant Task Pipeline (CoTaP): a pipeline that leverages compliance information in the learning-based structure of humanoid robots. A two-stage dual-agent reinforcement learning framework combined with model-based compliance control for humanoid robots is proposed. In the training process, first a base policy with a position-based controller is trained; then in the distillation, the upper-body policy is combined with model-based compliance control, and the lower-body agent is guided by the base policy. In the upper-body control, adjustable task-space compliance can be specified and integrated with other controllers through compliance modulation on the symmetric positive definite (SPD) manifold, ensuring system stability. We validated the feasibility of the proposed strategy in simulation and experiment, primarily comparing the responses to external disturbances under different compliance settings.

Abstract:
Conversation can benefit from small rhythmic gestures that track prosody, reinforce structure, and help to keep attention. However, many robots used in humanrobot interaction still rely on fixed templates or clip libraries that scale poorly to open-domain interactions; moreover, embedded platforms impose tight limits on motion range, speeds, and timing. Consequently, gesture generation methods must be lightweight, stable, and easy to integrate. To address this need, this work presents a lightweight gesture-generation model that generates in real time beat gestures based on the transcription of the robot's speech. First, a Conditional Variational Autoencoder (CVAE) conditioned on sentence-level BERT embeddings is trained on 2D posetext pairs to produce upper-body pose sequences. Next, a geometry-based retargeting algorithm deterministically maps those poses to the robots joints while enforcing kinematic limits. Finally, the joint sequence is converted into a pseudo-state machine and triggered in lockstep with the utterance. The results obtained show that the system achieves smooth, text-conditioned beat gestures with solid fidelity and temporal diversity, and demonstrates real-time performance when integrated on a social robot.

Abstract:
We introduce and solve the novel task of emphcontrolled separation of small objects with two fingers of a multi-purpose robotic hand: after grasping into a box of small objects, the task is to drop as many of them until a desired number remains between the fingers. The objects are small compared to the width of the fingers but also in absolute terms. In our case little pellets with a diameter of only 6mm are handled. We show that the task can be performed purely tactile (no vision) using a spatially-resolved tactile skin on a fingertip. The separation policy is trained in simulation via reinforcement learning using a straightforward sparse reward, which basically checks if the desired number of objects is reached. In simulation experiments, we provide an exhaustive analysis of the benefits of using spatially-resolved tactile feedback: while an ideal (high-resolution) tactile sensor allows solving the task almost perfectly, a sensor with lower spatial resolution (here 4x4 taxels) still leads to an improvement of up to 20% compared to using only the fingers' joint sensors. For this analysis, we further train an estimator alongside the policy that predicts the ground truth contact positions. Finally, we demonstrate the successful sim-to-real transfer for the DLR-Hand II equipped with a tactile skin.

Abstract:
Visual policy design is crucial for aerial navigation. However, state-of-the-art visual policies often overfit to a single track and their performance degrades when track geometry changes. We develop FalconGym 2.0, a photorealistic simulation framework built on Gaussian Splatting (GSplat) with an Edit API that programmatically generates diverse static and dynamic tracks in milliseconds. Leveraging FalconGym 2.0's editability, we propose a Performance-Guided Refinement (PGR) algorithm, which concentrates visual-policy training on challenging tracks while iteratively improving performance. Across two case studies (fixed-wing UAVs and quadrotors) with distinct dynamics and environments, we show that a single visual policy trained with PGR in FalconGym 2.0 outperforms state-of-the-art baselines in generalization and robustness: it generalizes to three unseen tracks with 100% success without per-track retraining and maintains higher success rates under gate-pose perturbations. Finally, we demonstrate zero-shot sim-to-real transfer of the PGR-trained visual policy to quadrotor hardware, achieving a 98.6% success rate (69/70 gates) over 30 trials across two three-gate tracks and one moving-gate track.

Abstract:
Vision-Language-Action (VLA) models rely on current observations, including images, language instructions, and robot states, to predict actions and complete tasks. While accurate visual perception is crucial for precise action prediction and execution, recent work has attempted to further improve performance by introducing explicit reasoning during inference. However, such approaches face significant limitations. They often depend on data-intensive resources such as Chain-of-Thought (CoT) style annotations to decompose tasks into step-by-step reasoning, and in many cases require additional visual grounding annotations (e.g., bounding boxes or masks) to highlight relevant image regions. Moreover, they involve time-consuming dataset construction, labeling, and retraining, which ultimately results in longer inference sequences and reduced efficiency. To address these challenges, we propose ATA, a novel training-free framework that introduces implicit reasoning into VLA inference through complementary attention-guided and action-guided strategies. Unlike CoT or explicit visual-grounding methods, ATA formulates reasoning implicitly by integrating attention maps with an action-based region of interest (RoI), thereby adaptively refining visual inputs without requiring extra training or annotations. ATA is a plug-and-play implicit reasoning approach for VLA models, lightweight yet effective. Extensive experiments show that it consistently improves task success and robustness while preserving, and even enhancing, inference efficiency.

Abstract:
Operating drones in urban environments often means they need to land on rooftops, which can have different geometries and surface irregularities. Accurately detecting roof inclination using conventional sensing methods, such as vision based or acoustic techniques, can be unreliable, as measurement quality is strongly influenced by external factors including weather conditions and surface materials. To overcome these challenges, we propose a novel unmanned aerial manipulator morphology featuring a dual-arm aerial manipulator with an omnidirectional 3D workspace and extended reach. Building on this design, we develop a proprioceptive contact detection and contact localization strategy based on a momentum-based torque observer. This enables the UAM to infer the inclination of slanted surfaces blindly through physical interaction prior to touchdown. We validate the approach in flight experiments, demonstrating robust landings on surfaces with inclinations of up to 30.5�?and achieving an average surface inclination estimation error of 2.87�?over 9 experiments at different incline angles.

Abstract:
Laboratory Automation (LA) has the potential to accelerate solid-state materials discovery by enabling continuous robotic operation without human intervention. While robotic systems have been developed for tasks such as powder grinding and X-ray diffraction (XRD) analysis, fully automating powder handling at the milligram scale remains a significant challenge due to the complex flow dynamics of powders and the diversity of laboratory tasks. To address this challenge, this study proposes the SCU-Hand-SV (Soft Conical Universal Robotic Hand with Single-sheet Valve), which preserves the softness and conical sheet designs in prior work while incorporating a controllable valve at the cone apex to enable precise, incremental dispensing of milligram-scale powder quantities. The SCU-Hand-SV is integrated with an external balance through a feedback control system based on a model of powder flow and online parameter identification. Experimental evaluations with glass beads, monosodium glutamate, and titanium dioxide demonstrated that 80% of the trials achieved an error within ±2 mg, and the maximum error observed was approximately 20 mg across a target range of 20 mg to 3 g. In addition, by incorporating flow prediction models commonly used for hoppers and performing online parameter identification, the system is able to adapt to variations in powder dynamics. Compared to direct PID control, the proposed model-based control significantly improved both accuracy and convergence speed. These results highlight the potential of the proposed system to enable efficient and flexible powder weighing, with scalability toward larger quantities and applicability to a broad range of laboratory automation tasks.

Abstract:
During the execution of Multi-Agent Path Finding (MAPF) plans in real-life applications, the MAPF assumption that the fleet's movement is perfectly synchronized does not apply. Since some of the agents may become delayed due to internal or external factors, it is often necessary to use a robust execution method to avoid collisions caused by desynchronization. Robust execution methods - such as the Action Dependency Graph (ADG) - synchronize the execution of risky actions, but often at the expense of increased plan execution cost, because it may require some agents to wait for the delayed agents. In such cases, the execution's cost can be reduced while still preserving safety by finding a new plan either by rescheduling (reordering the agents at crossroads) or the more general replanning capable of finding new paths. However, these operations may be costly, and the new plan may not even lead to lower execution cost than the original plan: for example, the two plans may be the exact same, as some losses may not be recoverable at all. Therefore, we estimate the benefit that can be achieved by single replanning in scenarios with delayed agents given an immediate state of the execution with a fully connected feed-forward neural network. The input to the neural network is a set of newly designed ADG-based features describing the execution's state and the impact of potential delays, and the output is an estimated benefit achievable by replanning. We train and test the network on a new labeled dataset containing 12,000 experiments and show that our proposed method is capable of significantly reducing the impact of recoverable delays.

Abstract:
We present HetroD, a dataset and benchmark for developing autonomous driving systems in heterogeneous environments. HetroD targets the critical challenge of navigating real-world heterogeneous traffic dominated by vulnerable road users (VRUs), including pedestrians, cyclists, motorcyclists, and vehicles. These mixed agent types exhibit complex behaviors such as hook turns, lane splitting, and informal right-of-way negotiation. Such behaviors pose significant challenges for autonomous vehicles but remain underrepresented in existing datasets focused on structured, lane-disciplined traffic. To bridge the gap, we collect a large-scale drone-based dataset to provide a holistic observations of traffic scenes with centimeter-accurate annotations, HD maps, and traffic signal states. We further develop a modular toolkit for extracting per-agent scenarios to support downstream task development. In total, the dataset comprises over 65.4k high-fidelity agent trajectories, 70% of which are from VRUs. HetroD supports modeling of VRU behaviors in dense, heterogeneous traffic and provides standardized benchmarks for forecasting, planning, and simulation tasks. Evaluation results reveal that state-of-the-art prediction and planning models struggle with the challenges presented by our dataset: they fail to predict lateral VRU movements, cannot handle unstructured maneuvers, and exhibit limited performance in dense and multi-agent scenarios, highlighting the need for more robust approaches to heterogeneous traffic. See our project page for more examples: https://hetroddata.github.io/HetroD/

Abstract:
Autonomous exploration of obstacle-rich spaces requires strategies that ensure efficiency while guaranteeing safety against collisions with obstacles. This paper investigates a novel platform-agnostic reinforcement learning framework that integrates a graph neural network-based policy for next-waypoint selection, with a safety filter ensuring safe mobility. Specifically, the neural network is trained using reinforcement learning through the Proximal Policy Optimization (PPO) algorithm to maximize exploration efficiency while minimizing safety filter interventions. Henceforth, when the policy proposes an infeasible action, the safety filter overrides it with the closest feasible alternative, ensuring consistent system behavior. In addition, this paper introduces a reward function shaped by a potential field that accounts for both the agents proximity to unexplored regions and the expected information gain from reaching them. The proposed framework combines the adaptability of reinforcement learning-based exploration policies with the reliability provided by explicit safety mechanisms. This feature plays a key role in enabling the deployment of learning-based policies on robotic platforms operating in real-world environments. Extensive evaluations in both simulations and experiments performed in a lab environment demonstrate that the approach achieves efficient and safe exploration in cluttered spaces.

Abstract:
Soft robots present unique challenges for accurate modeling and control due to their virtually infinite degrees of freedom and highly nonlinear deformations. High-fidelity continuum models offer accuracy but are often computationally prohibitive for real-time control, while purely learning-based policies are efficient yet frequently lack robustness and require extensive data collection. In this paper, we propose a hybrid control framework that trains Reinforcement Learning (RL) policies using the physics-based Geometric Variable-Strain (GVS) formulation. This integration enables a Proximal Policy Optimization (PPO) agent to learn on a compact, physically exact state parameterization within the SoRoSim environment, leveraging continuum mechanics for accuracy without the need for real-world data collection. We validate our approach in simulation through two experiments: a basketball throw task and a multi-step pick-and-place scenario. In the throwing task, the agent achieved a 100% success rate within a defined tolerance, with 55% of trials reaching the target with 1cm precision. In the multi-step scenario, the controller maintained high accuracy with a maximum relative error of approximately 8.5%. These results demonstrate that combining GVS-based modeling with PPO yields robust, data-efficient control policies, providing a scalable solution for controlling soft robotic systems across diverse applications.

Abstract:
Catheter-based interventions are widely used for the diagnosis and treatment of cardiac diseases. Recently, robotic catheters have attracted attention for their ability to improve precision and stability over conventional manual approaches. However, accurate modeling and control of soft robotic catheters remain challenging due to their complex, nonlinear behavior. The Koopman operator enables lifting the original system data into a linear "lifted space", offering a data-driven framework for predictive control; however, manually chosen basis functions in the lifted space often oversimplify system behaviors and degrade control performance. To address this, we propose a neural network-enhanced Koopman operator framework that jointly learns the lifted space representation and Koopman operator in an end-to-end manner. Moreover, motivated by the need to minimize radiation exposure during X-ray fluoroscopy in cardiac ablation, we investigate open-loop control strategies using neural Koopman operators to reliably reach target poses without continuous imaging feedback. The proposed method is validated in two experimental scenarios: interactive position control and a simulated cardiac ablation task using an atrium-like cavity. Our approach achieves average errors of 2.1±0.4 mm in position and 4.9±0.6�?in orientation, outperforming not only model-based baselines but also other Koopman variants in targeting accuracy and efficiency. These results highlight the potential of the proposed framework for advancing soft robotic catheter systems and improving catheter-based interventions.

Abstract:
LiDAR scene synthesis is an emerging solution to scarcity in 3D data for robotic tasks such as autonomous driving. Recent approaches employ diffusion or flow matching models to generate realistic scenes, but 3D data remains limited compared to RGB datasets with millions of samples. We introduce R3DPA, the first LiDAR scene generation method to unlock image-pretrained priors for LiDAR point clouds, and leverage self-supervised 3D representations for state-ofthe-art results. Specifically, we (i) align intermediate features of our generative model with self-supervised 3D features, which substantially improves generation quality; (ii) transfer knowledge from large-scale image-pretrained generative models to LiDAR generation, mitigating limited LiDAR datasets; and (iii) enable point cloud control at inference for object inpainting and scene mixing with solely an unconditional model. On the KITTI-360 benchmark R3DPA achieves state of the art performance. Code and pretrained models are available at https://github.com/valeoai/R3DPA.

Abstract:
Future trajectory prediction of a tracked pedestrian from an egocentric perspective is a key task in areas such as autonomous driving and robot navigation. The challenge of this task lies in the complex dynamic relative motion between the ego-camera and the tracked pedestrian. To address this challenge, we propose an ego-motion-guided trajectory prediction network based on the Mamba model. Firstly, two Mamba models are used as encoders to extract pedestrian motion and ego-motion features from pedestrian movement and ego-vehicle movement, respectively. Then, an ego-motion-guided Mamba decoder that explicitly models the relative motion between the pedestrian and the vehicle by integrating pedestrian motion features as historical context with ego-motion features as guiding cues to capture decoded features. Finally, the future trajectory is generated from the decoded features corresponding to the future timestamps. Extensive experiments demonstrate the effectiveness of the proposed model, which achieves state-of-the-art performance on the PIE and JAAD datasets.

Abstract:
Urban scene reconstruction is critical for autonomous driving, enabling structured 3D representations for data synthesis and closed-loop testing. Supervised approaches rely on costly human annotations and lack scalability, while current self-supervised methods often confuse static and dynamic elements and fail to distinguish individual dynamic objects, limiting fine-grained editing. We propose DIAL-GS, a novel dynamic instance-aware reconstruction method for label-free street scenes with 4D Gaussian Splatting. We first accurately identify dynamic instances by exploiting appearanceposition inconsistency between warped rendering and actual observation. Guided by instance-level dynamic perception, we employ instance-aware 4D Gaussians as the unified volumetric representation, realizing dynamic-adaptive and instance-aware reconstruction. Furthermore, we introduce a reciprocal mechanism through which identity and dynamics reinforce each other, enhancing both integrity and consistency. Experiments on urban driving scenarios show that DIAL-GS surpasses existing self-supervised baselines in reconstruction quality and instance-level editing, offering a concise yet powerful solution for urban scene modeling.

Abstract:
Deep learning models in robotics often output point estimates with poorly calibrated confidences, offering no native mechanism to quantify predictive reliability under novel, noisy, or out-of-distribution inputs. Conformal prediction (CP) addresses this gap by providing distribution-free coverage guarantees, yet its reliance on fixed nonconformity scores ignores context and can yield intervals that are overly conservative or unsafe. We address this with Learnable Conformal Prediction (LCP), which replaces fixed scores with a lightweight neural function that leverages geometric, semantic, and model cues. Trained to balance coverage, efficiency, and calibration, LCP preserves CP's finite-sample guarantees while producing intervals that adapt to instance difficulty, achieving context-aware uncertainty without ensembles or repeated inference. On the MRPB benchmark, LCP raises navigation success to 91.5% versus 87.8% for Standard CP, while limiting path inflation to 4.5% compared to 12.2%. For object detection on COCO, BDD100K, and Cityscapes, it reduces mean interval width by 46-54% at 90% coverage, and on classification tasks (CIFAR-100, HAM10000, ImageNet) it shrinks prediction sets by 4.7-9.9%. The method achieves real-time performance on resource-constrained edge hardware (Intel NUC, <30W) while simultaneously providing uncertainty estimates along with the mean prediction.

Abstract:
This paper investigates a sample-based solution to the hybrid mode control problem across non-differentiable and algorithmic hybrid modes. Our approach reasons about a set of hybrid control modes as an integer-based optimization problem where we select what mode to apply, when to switch to another mode, and the duration for which we are in a given control mode. A sample-based variation is derived to efficiently search the integer domain for optimal solutions. We find our formulation yields strong performance guarantees that can be applied to a number of robotics-related tasks. In addition, our approach is able to synthesize complex algorithms and policies to compound behaviors and achieve challenging tasks. Last, we demonstrate the effectiveness of our approach in a real-world robotic examples that requires reactive switching between long-term planning and high-frequency control.

Abstract:
Vision-Language Models (VLMs) are increasingly used in robotics for natural language understanding and executable plan generation, yet integrating them into real-time control pipelines remains challenging. Many existing systems rely on HTTP/JSON-based inference interfaces that require repeated Base64 serialization, introducing unnecessary overhead and increasing end-to-end latency. At the execution level, waiting for a full plan leads to stalls where no valid actions are available, while naive streaming of partial plans produces stop-and-go behavior due to token arrival gaps. To address these issues, we extend llama-ros with Stream-to-Act, a ROS2-native execution mechanism that begins acting once sufficient tokens arrive while ensuring continuous execution through an optimal start-time policy. Our open-source implementation is evaluated on RTX GPUs and NVIDIA Jetson platforms through end-to-end latency analysis, token generation throughput measurements, and execution timeline visualization. In addition, a Carla-based driving scenario illustrates how the proposed execution policy eliminates stop-and-go behavior and maintains continuous control, even when the total plan generation time remains unchanged.

Abstract:
Event camera offers advantages in object detection tasks for its properties such as high-speed response, low latency, and robustness to motion blur. However, event cameras inherently lack texture and color information, making open-vocabulary detection particularly challenging. Current event-based detection methods are typically trained on predefined target categories, limiting their ability to generalize to novel objects, where encountering previously unseen objects is common. Vision-language models (VLMs) have enabled open-vocabulary object detection in RGB images. However, the modality gap between images and event streams makes it ineffective to directly transfer CLIP to event data, as CLIP was not designed for event streams. To bridge this gap, we propose an event-image knowledge distillation framework, leveraging CLIPs semantic understanding to achieve open-vocabulary object detection on event data. Instead of training CLIP directly on event streams, we use image frames as teacher model inputs, guiding the event-based student model to learn CLIPs rich visual representations. Through spatial attention-based distillation, the student network learns meaningful visual features directly from raw event inputs, while inheriting CLIPs broad visual knowledge. Furthermore, to prevent information loss due to event data segmentation, we design a hybrid Spiking Neural Network (SNN) and Convolutional Neural Network (CNN) framework. Unlike fixed-group event segmentation methods, which often discard crucial temporal information, our SNN adaptively determines the optimal event segmentation moments, ensuring that key temporal features are extracted. The extracted event features are then processed by CNNs for object detection.

Abstract:
Accurate state estimation in GPS-denied environments is critical for autonomous quadrotor navigation. Conventional visual-inertial odometry (VIO) pipelines rely on dense feature extraction and tracking, which increases computational cost and is prone to drift when low-quality features dominate. Although learning-based detectors improve robustness, most are too computationally heavy for embedded deployment. This paper proposes a lightweight learning-based feature selection framework that prunes unreliable features to enable efficient optical flow navigation. A compact Convolutional Neural Network (CNN) is employed, with its pruning threshold adaptively adjusted to maintain a stable number of reliable features. The CNN augments ORB and LucasKanade optical flow in a multithreaded pipeline with rotational false-velocity compensation and EKF fusion. Experiments on the Quanser QDrone2 demonstrate up to 7580% reduction in position RMSE and approximately 2530% reduction in computation time compared to the Fourier-based Phase Correlation (FPC) method, while sustaining real-time performance above 120 Hz without reliance on external localization systems.

Abstract:
The remarkable interaction and reasoning capabilities of Large Language Models (LLMs) make them promising in collaborative Embodied AI tasks, particularly for Object-goal Navigation (ObjNav) tasks that require both decision-making and transparent explanation. However, existing work mainly uses LLMs as proxy target indicators, leaving the role of direct action decision-making to other components. This separation causes non-transparent action decisions and extra adaptation requirements. This observation prompts us to reconsider their role: Can LLMs be transformed into the central ``brain'' of agents, directly outputting action choices and explaining their reasoning? In pursuit of this inquiry, we decouple perception from action reasoning to focus specifically on the feasibility of deploying LLMs as navigation policies. We introduce the Experience-aware Action Cogitator (proposedmethod) that integrates two kinds of experience, ie, expert-informed experience and trial & error experience, into prompts. Inspired by David Hume's philosophical principles that knowledge is acquired through reflective experience, these experiences are designed for two critical questions: (i) ``What action should be selected as the best option?'' and (ii) ``What actions have been tried but proven suboptimal?'' By analyzing and reflecting on these two types of experience, we show that LLMs can reason navigation actions in unseen environments effectively without costly fine-tuning. Experiments on the widely-adopted iTHOR yield significant improvements in ObjNav performance. These compelling results validate the feasibility of our proposedmethod. Compared to vanilla LLMs, proposedmethod nearly doubles both the Success Rate and the Success weighted by Path Length, reaching peak values of 73.93% and 48.35% in unseen scenes, respectively.

Abstract:
Stereo depth estimation has drawn widespread attention from the robotics and vision community due to its broad applications such as 3D reconstruction. Recently, stereo matching foundation models have made significant progress by being trained on the large-scale datasets containing natural images. However, directly leveraging these pretrained large models to minimally invasive surgery still remains challenging due to domain shifts in aspects of specular highlights and low-texture tissue. In this paper, we propose a parameter-efficient adaptation framework to address this gap. Specifically, we introduce Camera-Aware LoRA for fine-tuning FoundationStereo, using a camera-aware scaling gate computed from focal length and baseline to address intraoperative intrinsics drift arising from instrument self-heating and other thermal effects. We further develop a geometric consistency constraint and a spectral alignment regularizer that enforce cross-view depth agreement. Extensive experiments on the SCARED and Hamlyn datasets indicate that the proposed method achieves state-of-the-art performance. Notably, CaLoRA is easy to integrate into standard fine-tuning pipelines, requiring no backbone changes and only a small number of trainable parameters.

Abstract:
Social robot teleoperation is a skill that must be acquired through practice with the social robot. Mobile neuroimaging and human-computer interface performance metrics permit the gathering of information from the operators' systemic and behavioral responses associated with their skill acquisition. Profiling the skill levels of social robot operators using this information can help improve training protocols. In this study, thirty-two participants performed real-world social robot teleoperation tasks. Brain function signals from the prefrontal cortex (PFC), and behavioral data from interactions with the system were collected using functional near-infrared spectroscopy (fNIRS). Participants were divided into two groups (high and low performance) based on an integrative metric of task efficiency, workload, and presence when operating the social robot. Significant differences were found in the operation time, width, and multiscale entropy of the hemoglobin oxygenation curve of the operator's PFC. Functional connectivity in the PFC also depicted differences in the low- and high-performance groups when connectivity networks were compared and in the leaf fraction metrics of the functional networks. These findings contribute to understanding the operator's progress during teleoperation training protocols and designing the interface to assist in enhancing task performance.

Abstract:
Endoscopic submucosal dissection (ESD) is an effective technique to resect early cancers in the gastrointestinal (GI) tract. Bimanual telerobotic manipulation is an approach to performing ESD intuitively and efficiently, which requires two robotic instruments with flexibility, stiffness, dexterity and accuracy. However, the priority of these properties depends on the specific surgical tasks. In this work, we proposed the first heterogeneous flexible manipulators (HFMs) for bimanual ESD, which can take advantage of different mechanical structures. The grasping instrument employs a serial articulated manipulator (SAM) to perform multidirectional (relatively higher dexterity) and stable (relatively higher stiffness) traction for better submucosal visualization. The electrosurgical instrument utilizes a parallel continuum wrist (PCW) to execute accurate (higher accuracy) tissue dissection. Both HFMs have sufficient flexibility to go through the flexible endoscopic working channels. Based on the HFMs, we established a transendoscopic telerobotic system. The kinematics of the SAM and PCW were built in the endoscope frame using the Denavit-Hartenberg (DH) method and Cosserat rod method,

Abstract:
We introduce DriveAgent, a modular multi-agent autonomous driving framework that leverages large language model (LLM) reasoning combined with multimodal sensor fusion for autonomous driving. DriveAgent orchestrates specialized agents operating on camera, Light Detection and Ranging (LiDAR), Inertial Measurement Unit (IMU), and Global Positioning System (GPS) with LLM-driven analytical processes to deliver temporally aligned perception, causal reasoning, and action recommendations. The framework operates through a modular agent-based pipeline comprising four principal modules: (i) a descriptive analysis agent identifying critical sensor data events based on filtered timestamps, (ii) dedicated vehicle-level analysis conducted by LiDAR and vision agents that collaboratively assess vehicle conditions and movements, (iii) environmental reasoning and causal analysis agents explaining contextual changes and their underlying mechanisms, and (iv) an urgency-aware decision-generation agent prioritizing insights and proposing timely maneuvers. This modular design empowers the LLM to effectively coordinate specialized perception and reasoning agents, delivering cohesive, interpretable insights into complex autonomous driving scenarios. Extensive experiments demonstrate that DriveAgent substantially outperforms baseline methods, achieving a 26.31% improvement in vehicle reasoning and consistent enhancements of up to 2.85% in environmental reasoning. These results highlight the effectiveness of our LLM-driven multi-agent sensor fusion framework in boosting the robustness and reliability of autonomous driving systems. footnoteCode available at urlhttps://github.com/Paparare/DriveAgent

Abstract:
Magnetic compression anastomosis (MCA) offers a promising solution for minimally invasive anastomosis surgery. However, current MCA schemes lack safe, real-time localization, and guidance for compression magnets, hindering surgeons ability to control the compression magnets effectively in complex, unstructured endoluminal environments. To address these limitations, this article introduces the MagsL-HUD endoscopic system, a novel solution that enables multimagnetic six-degree-of-freedom (six-DoF) localization and head-up display (HUD) guidance within the endoscopic view (EV). Specifically, the system integrates a developed Endo-MagCap device with an orthogonal magnet configuration, along with a magnetic sensor array, to achieve real-time full-pose localization. An endoscopic camera model is incorporated for HUD visualization, enhancing intuitive interaction for surgeons better informed decisions. Eventually, the effectiveness of the Mags-LHUD endoscopic system is validated through laboratory experiments and ex vivo animal trials. The system demonstrates six-DoF tracking accuracy with average errors of 0.0070 m and 0.1437 rad, and 0.0071 m and 0.1721 rad in the designed trajectory cases for two compression magnets, respectively. Additionally, ex vivo porcine tests confirm the systems feasibility and applicability, successfully performing a stomach-colon MCA surgery with a final compression gap of approximately 0.00247 m. Further comparative studies demonstrate that the MagsL-HUD method has a compression success rate of 71.4% versus 42.9% of the non-HUD approaches in the designed tests. This work represents a significant step toward the clinical adoption of magnetic-assisted endoscopy for minimally invasive anastomosis surgeries, holding substantial practical significance for improving the safety and efficacy of MCA procedures in complex, unstructured endoluminal environments.

Abstract:
Convex polytopes have compact representations and exhibit convexity, which makes them suitable for abstracting obstacle-free spaces from various environments. Existing generation methods struggle with balancing high-quality output and efficiency. Moreover, another crucial requirement for convex polytopes to accurately contain certain seed point sets, such as a robot or a front-end path, is proposed in various tasks, which we refer to as manageability. In this paper, we propose Fast Iterative Regional Inflation (FIRI) to generate high-quality convex polytope while ensuring efficiency and manageability simultaneously. FIRI consists of two iteratively executed submodules: Restrictive Inflation (RsI) and Maximum Volume Inscribed Ellipsoid (MVIE) computation. By explicitly incorporating constraints that include the seed point set, RsI guarantees manageability. Meanwhile, iterative MVIE optimization ensures high-quality result through monotonic volume bound improvement. In terms of efficiency, we design methods tailored to the low-dimensional and multi-constrained nature of both modules, resulting in orders of magnitude improvement compared to generic solvers. Notably, in 2-D MVIE, we pr

Abstract:
Agricultural harvesting grippers have emerged as a pivotal technology in the evolution of smart agriculture, enhancing efficient fruit collection. Existing grippers frequently fail to accommodate the diverse morphologies of fruits. Moreover, achieving stable grasping under external disturbances, such as natural wind, remains a significant challenge. To mitigate these limitations, we present a novel under-actuated gripper for fruit harvesting, alongside the formulation of innovative harvesting strategies aimed at optimizing both the harvest success rate and fruit integrity. Firstly, a strategy is developed to promote the success rate of harvesting operations under disturbance. The harvesting strategy involves stabilizing the fruit by enclosing the stem, followed by grasping and severing the stem to detach the fruit. Secondly, the auxiliary locking and grasping coupling mechanism allowing for a single actuator driving all components of this gripper to reduce the cost and control complexity. Thirdly, the force distribution constraint unit allocates the actuators power between grasping and shearing actions, enabling regulation of the grasping force to protect the fruit. Finally, the performance of the gripper is validated assessed through a series of rigorous experiments on fruits with diverse sizes, textures, and surface characteristics, demonstrating its superior efficacy in preserving fruit integrity during real-world harvesting scenarios.

Abstract:
Timely detection of incipient slip is critical for delicate robotic grasping and dexterous manipulation. However, existing learning-based methods suffer from detection latency and high computational demands. In this paper, we present an ultra-fast lightweight incipient slip detection framework based on hyperdimensional (HD) computing, using the PapillArray optical tactile sensor. Our approach introduces a novel graphical-spatial-temporal HD encoding scheme coupled with a context-driven training and inference strategy, achieving a high slip detection accuracy of 91.78% in offline evaluation. The resulting model is exceptionally compact and highly edge-compatible, with a size of only 0.375 kB. Furthermore, hardware acceleration on an FPGA enables inference within 0.42 microseconds, representing an over 10^4× speedup compared to optimized CPU implementations. Online robotic experiments involving grip-force control based on the proposed slip detection method further validate its practical effectiveness. This work offers a practical and scalable solution for real-time slip detection in robotic manipulation tasks.

Abstract:
A common assumption to simplify the problem of controlling a multi-UAV slung load system (MUSLS) is that the flexible cables can be modeled as massless rigid rods. In this work, we propose an alternative Euler-Newton derived dynamical model which uses a series of rigid links to model the flexible cables. The model is specifically designed to allow efficient simulation using Featherstone's articulated body algorithm. We perform real-world validation of this model on gentle, aggressive, and tension-engagement maneuvers and run a parameter sweep to determine the number of links, joint damping, and joint friction to achieve the greatest model fidelity. The model closely matches real-world flight data with mean load translation errors below 132 mm (5.5% of the cable length) and orientation errors below 11.4 degrees. We make the real-world flight data publicly available for the development of future cable models.

Abstract:
Robust autonomous navigation for Autonomous Aerial Vehicles (AAVs) in complex environments is a critical capability. However, modern end-to-end navigation faces a key challenge: the high-frequency control loop needed for agile flight conflicts with low-frequency perception streams, which are limited by sensor update rates and significant computational cost. This mismatch forces conventional synchronous models into undesirably low control rates. To resolve this, we propose an asynchronous reinforcement learning framework that decouples perception and control, enabling a high-frequency policy to act on the latest IMU state for immediate reactivity, while incorporating perception features asynchronously. To manage the resulting data staleness, we introduce a theoretically-grounded Temporal Encoding Module (TEM) that explicitly conditions the policy on perception delays, a strategy complemented by a two-stage curriculum to ensure stable and efficient training. Validated in extensive simulations, our method was successfully deployed in zero-shot sim-to-real transfer on an onboard NUC, where it sustains a 100 Hz control rate and demonstrates robust, agile navigation in cluttered real-world environments. Our source code will be released for community reference.

Abstract:
Enabling robots to navigate safely and efficiently in dynamic, crowded environments requires learning from large-scale demonstrations, which are costly and unsafe to collect on physical platforms. While human videos offer a rich and scalable alternative, transferring these motion patterns to robots is challenged by the embodiment gap across observation and action spaces. This paper presents Human2Nav, a data-efficient framework that learns navigation policies directly from human videos via test-time feasibility-guided flow matching. Human2Nav employs a bird's-eye-view representation to align visual observations and trains a conditional flow matching model to capture nuanced human navigation patterns. Crucially, we introduce a training-free feasibility guidance mechanism that during inference steers generated trajectories to satisfy heterogeneous robot-specific kinematic and dynamic constraints without retraining. Extensive experiments in simulation and on real-world heterogeneous robotic platforms demonstrate that Human2Nav achieves superior data efficiency and navigation performance compared to model-based and learning-based baselines, while ensuring safe and executable trajectories across diverse crowd scenarios.

Abstract:
Humanoid robots promise transformative capabilities for industrial and service applications. While recent advances in Reinforcement Learning (RL) yield impressive results in locomotion, manipulation, and navigation, the proposed methods typically require enormous simulation samples to account for real-world variability. This work proposes a novel one-stage training frameworkLearn to Teach (L2T)which unifies teacher and student policy learning. Our approach recycles simulator samples and synchronizes the learning trajectories through shared dynamics, significantly reducing sample complexities and training time while achieving state-of-the-art performance. Furthermore, we validate the RL variant (L2T-RL) through extensive simulations and hardware tests on the Digit robot, demonstrating zero-shot sim-to-real transfer and robust performance over 12+ diverse terrains without depth estimation modules. Experimental videos are available online https://lidar-learn-to-teach.github.iohttps://lidar-learn-to-teach.github.io.

Authors: Christian Geckeler, Steffen Kirchgeorg, Georg Miguel Strunck, Frederik Bendix Thostrup, Florencia Sangermano, Andrea Desiderato, Martina Lüthi, Meret Jucker, Mailyn Adriana Gonzalez Herrera, Nicolás D. Franco-Sierra, Paola Pulido-Santacruz, Jia Jin Marc Chang, Yin Cheong Aden Ip, Elvira Mächler, Asger Svenning, Guillaume Mougeot, Toke Thomas Høye, Fabian Fopp, Loic Pellissier, David Dao, Kristy Deiner, Claus Melvad, Salua Hamaza, Stefano Mintchev

Abstract:
Tropical rainforests are among the most biodiverse ecosystems on Earth and also among the most threatened by anthropogenic pressures such as deforestation and climate change. Understanding human impact and the efficacy of conservation and preservation efforts requires scalable and comprehensive biodiversity monitoring solutions. As a winning finalist of the XPRIZE Rainforest Competition, ETH BiodivX collected biodiversity data from 100 ha of rainforest in the Amazon, in 24 h. A suite of complementary data types were captured, from remote sensing maps and close-up images to surface and water environmental DNA (eDNA), along with canopy rafts that collect specimens, close-up images, and bioacoustic recordings. Optimized workflows allow for a full RGB and digital surface model (DSM) after only one and a half hours. The captured DSM was then used to collect surface eDNA fully autonomously, at distances up to 1.4 km from the base station. Preprocessed multispectral satellite remote sensing provides indicators of water locations, which were then sampled for water eDNA. The canopy rafts can act as communication nodes or data collection stations, providing long-term bioacoustic recordings, insect images, and specimens. By utilizing a commercial drone platform with modular payloads for diverse tasks, the solutions are robust and easy to use. These field-proven systems mark a major step toward scalable biodiversity monitoring, including in the worlds most remote and biodiverse regions.

Abstract:
Camera localization within LiDAR maps has gained significant attention due to its potential for accurate positioning with low-cost and lightweight sensors compared to LiDAR-based systems. However, existing methods often prioritize localization accuracy, sometimes compromising efficiency, which can limit their suitability for real-time applications. To address these issues, we propose I2D-LocX, a lightweight monocular camera localization framework with three branches, establishing pixel-level and feature-level constraints to enhance localization performance without increasing model complexity. Specifically, the main branch generates a flow map to represent pixel-point displacements. One auxiliary branch shares the same input as the main branch and employs an additional decoder to evaluate the confidence of the flow map. The other auxiliary branch leverages a zero-flow generated from the displacement-free input to guide feature matching, thereby enhancing localization robustness. Notably, both auxiliary branches share parameters with the main branch and are omitted during inference, ensuring computational efficiency. Extensive experiments on benchmark datasets, including KITTI-Odometry, Argoverse, Waymo, and nuScenes, show that I2D-LocX can achieve centimeter-level localization accuracy with about 37 millisecond inference time, greatly improving the localization performance for real-world applications.

Abstract:
We propose a graph SLAM algorithm for sparse range sensing that incorporates a soft Manhattan world utilizing landmark-landmark constraints. Sparse range sensing is necessary for tiny robots that do not have the luxury of using heavy and expensive sensors. Existing SLAM methods dealing with sparse range sensing lack accuracy and accumulate drift error over time due to limited access to data points. Algorithms that cover this flaw using structural regularities, such as the Manhattan world (MW), have shortcomings when mapping real-world environments that do not coincide with the rules. We propose SoMaSLAM, a 2D graph SLAM designed for tiny drones with sparse range sensing. Our approach effectively maps sparse range data without enforcing strict structural regularities and maintains an adaptive graph. We implement the MW assumption as soft constraints, which we refer to as a soft Manhattan world. We propose novel soft landmark-landmark constraints to incorporate the soft MW into graph SLAM. Through extensive evaluation, we demonstrate that our proposed SoMaSLAM method improves localization accuracy on diverse datasets and is flexible enough to be used in the real world. We plan to release our code and dataset on our project page https://SoMaSLAM.github.io/.

Abstract:
The Dynamic Targeting (DT) mission concept is an approach to satellite observation in which a lookahead sensor gathers information about the upcoming environment and uses this information to intelligently plan observations. Previous work has shown that DT has the potential to increase the science return across several applications. However, DT mission concepts must address challenges such as the limited spatial extent of onboard lookahead data and instrument mobility, data throughput, and onboard computation constraints. In this work, we show how the performance of DT systems can be improved by using supplementary data streamed from geostationary satellites that provide lookahead information up to 35 minutes ahead of time rather than the 1 minute latency from an onboard lookahead sensor. While there is a greater volume of geostationary data, the search space for observation planning explodes exponentially with the size of the horizon. To address this, we introduce a hierarchical planning approach in which the geostationary data is used to plan a long-term observation blueprint in polynomial time, then the onboard lookahead data is leveraged to refine that plan over short-term horizons. We compare the performance of our approach to that of traditional DT planners relying on onboard lookahead data across four different problem instances: three cloud avoidance variations and a storm hunting scenario. We show that our hierarchical planner outperforms the traditional DT planners by up to 41% and examine the features of the scenarios that affect the performance of our approach. We demonstrate that incorporating geostationary satellite data is most effective for dynamic problem instances in which the targets of interest are sparsely distributed throughout the overflight.

Abstract:
This paper presents RoboMatch, a novel unified teleoperation platform for mobile manipulation with an auto-matching network architecture, designed to tackle long-horizon tasks in dynamic environments. Our system enhances teleoperation performance, data collection efficiency, task accuracy, and operational stability. The core of RoboMatch is a cockpit-style control interface that enables synchronous operation of the mobile base and dual arms, significantly improving control precision and data collection. Moreover, we introduce the Proprioceptive-Visual Enhanced Diffusion Policy (PVE-DP), which leverages Discrete Wavelet Transform (DWT) for multi-scale visual feature extraction and integrates high-precision IMUs at the end-effector to enrich proprioceptive feedback, substantially boosting fine manipulation performance. Furthermore, we propose an Auto-Matching Network (AMN) architecture that decomposes long-horizon tasks into logical sequences and dynamically assigns lightweight pre-trained models for distributed inference. Experimental results demonstrate that our approach improves data collection efficiency by over 20%, increases task success rates by 2030% with PVE-DP, and enhances long-horizon inference performance by approximately 40% with AMN, offering a robust solution for complex manipulation tasks. Project website: https://robomatch.github.io

Abstract:
Robotic surgical systems rely heavily on high-quality visual feedback for precise teleoperation; yet, surgical smoke from energy-based devices significantly degrades endoscopic video feeds, compromising the human-robot interface and surgical outcomes. This paper presents RGA-Net (Reciprocal Gating and Attention-fusion Network), a novel deep learning framework specifically designed for smoke removal in robotic surgery workflows. Our approach addresses the unique challenges of surgical smoke-including dense, non-homogeneous distribution and complex light scattering-through a hierarchical encoder-decoder architecture featuring two key innovations: (1) a Dual-Stream Hybrid Attention (DHA) module that combines shifted window attention with frequency-domain processing to capture both local surgical details and global illumination changes, and (2) an Axis-Decomposed Attention (ADA) module that efficiently processes multi-scale features through factorized attention mechanisms. These components are connected via reciprocal cross-gating blocks that enable bidirectional feature modulation between encoder and decoder pathways. Extensive experiments on the DesmokeData and LSD3K surgical datasets demonstrate that RGA-Net achieves superior performance in restoring visual clarity suitable for robotic surgery integration. Our method enhances the surgeon-robot interface by providing consistently clear visualization, laying a technical foundation for alleviating surgeons' cognitive burden, optimizing operation workflows, and reducing iatrogenic injury risks in minimally invasive procedures. These practical benefits could be further validated through future clinical trials involving surgeon usability assessments. The proposed framework represents a significant step toward more reliable and safer robotic surgical systems through computational vision enhancement.

Abstract:
Downstream fine-tuning of vision-language-action (VLA) models enhances robotics, yet exposes the pipeline to backdoor risks. Attackers can pretrain VLAs on poisoned data to implant backdoors that remain stealthy but can trigger harmful behavior during inference. However, existing defenses either lack mechanistic insight into multimodal backdoors or impose prohibitive computational costs via full-model retraining. To this end, we uncover a deep-layer attention grabbing mechanism: backdoors redirect late-stage attention and form compact embedding clusters near the clean manifold. Leveraging this insight, we introduce Bera, a test-time backdoor erasure framework that detects tokens with anomalous attention via latent-space localization, masks suspicious regions using deep-layer cues, and reconstructs a trigger-free image to break the trigger-unsafe-action mapping while restoring correct behavior. Unlike prior defenses, Bera requires neither retraining of VLAs nor any changes to the training pipeline. Extensive experiments across multiple embodied platforms and tasks show that Bera effectively maintains nominal performance, significantly reduces attack success rates, and consistently restores benign behavior from backdoored outputs, thereby offering a robust and practical defense mechanism for securing robotic systems.

Abstract:
This work presents DemoBot, a learning framework that enables a dual-arm, multi-finger robotic system to acquire complex manipulation skills from a single unannotated RGB-D video demonstration. The method extracts structured motion trajectories of both hands and objects from raw video data. These trajectories serve as motion priors for a novel reinforcement learning (RL) pipeline that learns to refine them through contact-rich interactions, thereby eliminating the need to learn from scratch. To address the challenge of learning long-horizon manipulation skills, we introduce: (1) Temporal-segment based RL to enforce temporal alignment of the current state with demonstrations; (2) Success-Gated Reset strategy to balance the refinement of readily acquired skills and the exploration of subsequent task stages; and (3) Event-Driven Reward curriculum with adaptive thresholding to guide the RL learning of high-precision manipulation. The novel video processing and RL framework successfully achieved long-horizon synchronous and asynchronous bimanual assembly tasks, offering a scalable approach for direct skill acquisition from human videos.

Abstract:
The Dubins Moving Target Traveling Salesman Problem with Obstacles (Dubins MT-TSP-O) seeks an obstacle-free trajectory for an agent with a fixed speed and minimum turning radius that intercepts several moving targets. To tackle this NP-hard problem, we introduce the Lazy Iterated Random Generalized TSP (Lazy IRG) algorithm. Each iteration of Lazy IRG samples a set of possible interception points in space-time along the trajectories of the targets. Lazy IRG then manages the high computational cost of motion planning by alternating between two steps: first, it optimistically selects a sequence of interception points by solving a Generalized TSP (GTSP) assuming an obstacle-free world; second, it searches for obstacle-free trajectories between consecutive points in the sequence using an obstacle-aware RRT-Connect planner. If a trajectory is not found, Lazy IRG solves the GTSP again; otherwise, Lazy IRG enters its next iteration and samples new interception points. By deferring expensive collision-checking, our method efficiently focuses computational effort on the most promising solutions. Numerical results show that Lazy IRG finds significantly lower-cost solutions within a 1-minute time budget compared to the existing IRG-PGLNS algorithm.

Abstract:
Robust robot planning in dynamic, human-centric environments remains challenging due to multimodal uncertainty, the need for real-time adaptation, and safety requirements. Optimization-based planners enable explicit constraint handling but can be sensitive to initialization and struggle in dynamic settings. Learning-based planners capture multimodal solution spaces more naturally, but often lack reliable constraint satisfaction. In this paper, we introduce a unified generation-refinement framework that combines reward-guided conditional flow matching (CFM) with model predictive path integral (MPPI) control. Our key idea is a bidirectional information exchange between generation and optimization: reward-guided CFM produces diverse, informed trajectory priors for MPPI refinement, while the optimized MPPI trajectory warm-starts the next CFM generation step. Using autonomous social navigation as a motivating application, we demonstrate that the proposed approach improves the trade-off between safety, task performance, and computation time, while adapting to dynamic environments in real-time. The source code is publicly available at https://cfm-mppi.github.io.

Abstract:
Semantic understanding of 3D scenes is essential for robots to operate effectively and safely in complex environments. Existing methods for semantic scene reconstruction and semantic-aware novel view synthesis often rely on dense multi-view inputs and require scene-specific optimization, limiting their practicality and scalability in real-world applications. To address these challenges, we propose SemGS, a feed-forward framework for reconstructing generalizable semantic fields from sparse image inputs. SemGS uses a dual-branch architecture to extract color and semantic features, where the two branches share shallow CNN layers, allowing semantic reasoning to leverage textural and structural cues in color appearance. We also incorporate a camera-aware attention mechanism into the feature extractor to explicitly model geometric relationships between camera viewpoints. The extracted features are decoded into dual-Gaussians that share geometric consistency while preserving branch-specific attributes, and further rasterized to synthesize semantic maps under novel viewpoints. Additionally, we introduce a regional smoothness loss to enhance semantic coherence. Experiments show that SemGS achieves state-of-the-art performance on benchmark datasets, while providing rapid inference and strong generalization capabilities across diverse synthetic and real-world scenarios.

Abstract:
Underwater and in-air environments exhibit distinct imaging characteristics, which should be carefully considered and effectively exploited for accurate depth estimation. In this work, we analyze the effectiveness of wavelength-dependent attenuation for underwater depth estimation and show that it is helpful but insufficient to perform depth estimation independently. Therefore, we propose a fast underwater monocular depth estimation network that incorporates underwater light absorption difference (ULAD) as supplementary information. Compared with methods that rely solely on RGB input, the proposed approach provides more accurate depth predictions. In our network, RGB and ULAD features are extracted by MobileNetV4 and fused using FusionMamba, followed by decoding and refinement with a micro Vision Transformer. The network is trained on the USOD10K dataset and evaluated on both its test set and the FLSea dataset. Experimental results demonstrate that our method achieves more accurate depth estimation and higher efficiency compared with other lightweight networks. Furthermore, Compared with existing state-of-the-art fast underwater depth estimation methods, our network further reduces the number of parameters by 10% and improves inference speed by 43%.

Abstract:
This paper introduces a learning-based control framework for a soft robotic actuator system designed to modulate intracranial pressure (ICP) waveforms, which is essential for studying cerebrospinal fluid dynamics and pathological processes underlying neurological disorders. A two-layer framework is proposed to safely achieve a desired ICP waveform modulation. First, a model predictive controller (MPC) with a disturbance observer is used for offset-free tracking of the systems motor position reference trajectory under safety constraints. Second, to address the unknown nonlinear dependence of ICP on the motor position, we employ a Bayesian optimization (BO) algorithm used for online learning of a motor position reference trajectory that yields the desired ICP modulation. The framework is experimentally validated using a test bench with a brain phantom that replicates realistic ICP dynamics in vitro. Compared to a previously employed proportional-integral-derivative controller, the MPC reduces mean and maximum motor position reference tracking errors by 83 % and 73 %, respectively. In less than 20 iterations, the BO algorithm learns a motor position reference trajectory that yields an ICP waveform with the desired mean and amplitude.

Abstract:
Robot-assisted minimally invasive neurosurgeries have shown great promise in enabling lower invasiveness and faster patient recovery times. However, performing such surgeries remains challenging, mainly due to the use of rigid surgical tools and limited accessibility to deep-seated brain structures. Employing robotically steerable tools could address these challenges, as these devices, being relatively more dexterous, can gain access to different regions in the brain. The autonomous control of these tools could further enable manipulation with higher precision and lower procedural time, facilitating less fatigue for surgeons. In this paper, we present a control strategy for the precise manipulation of a robotically steerable endoscopic cannula (RSEC). The proposed control architecture uses a combination of inverse kinematics, endoscopic imaging, and electromagnetic tracking feedback to perform task-space control of the RSEC in real-time. A joint angle estimation algorithm is proposed to estimate the bending angles of the RSEC using an endoscopic camera. The tip-position RMSE value of the RSEC when bending the proximal and distal joints, obtained using the proposed control strategy, was 0.7 mm. The results indicate that the proposed method can be used to achieve position control of the RSEC with sub-mm accuracy.

Abstract:
Occupational exoskeletons are emerging as a promising solution for industrial applications, providing support to reduce fatigue and the risk of musculoskeletal disorders. One of the main challenges limiting their widespread adoption is that most existing devices cannot deliver real-time, adaptable, and context-aware assistance. This paper presents the first fully vision-driven control strategy for a bimanual upper-limb soft exoskeleton, enabling adaptive assistance during industrial tool manipulation. The approach integrates three modules: tool recognition and segmentation, hand tracking with gesture recognition, and a fusion layer that ensures reliable understanding of the manipulation context. This allows modulation of lifting assistance in real time according to the weight of the grasped object. Experiments with human participants demonstrated that the proposed approach reduces biceps activation by more than 50% compared to the no-support condition, while operating in real time on embedded hardware. The method is robust to handobject occlusions, camera repositioning, and dynamic environments, demonstrating its practicality for industrial deployment. Overall, this work establishes vision-based control as a scalable solution for ergonomic, adaptive exoskeletons that enhance safety and productivity in demanding workplaces.

Abstract:
Soft robotics draws inspiration from biological appendages like seahorse tails, octopus arms, and elephant trunks, which demonstrate remarkable flexibility and diverse functionalities. While advances in soft robotics have enabled delicate manipulation, safe human interaction, and medical applications, existing systems lag behind mimicking nature with a change in length. This paper presents a novel rolling-disk-based continuum robot design that replicates natural logarithmic spiral geometry while overcoming limitations of previous bio-inspired systems, including fixed length, complex elastic interconnects, and absent central lumens. The proposed 3D-printable architecture enables cost-effective rapid prototyping with adjustable parameters, achieving both logarithmic spiral and constant curvature bending, consistent with established kinematic models. A central hollow passage supports tool integration for minimally invasive procedures, while contraction capability enables dynamic length adjustment. Comprehensive mathematical analysis, CAD development, and SOFA simulations validate the design conceptualisation. Experimental demonstrations confirm bending, contraction, and grasping capabilities across diverse object geometries, establishing a foundation for scalable, adaptable tendon-driven continuum robots that bridge biological inspiration with practical engineering implementation.

Abstract:
In social contexts, correctly interpreting communicative signals is essential for understanding an interlocutors reactions. Emotional cues provide important context, enhancing mutual understanding and enabling more natural, adaptive interactions. For non-humanoid robots, which typically have more limited interaction capabilities, it could be necessary to combine multiple non-verbal signals and to consider the context in which they are used. In this work, we present the design of non-verbal behaviors that enable a non-humanoid robot to communicate its intended emotional state effectively. Our strategy focuses on the use of LEDs as non-verbal cues along with facial expressions for conveying emotions. We conducted a user study to evaluate the effect of these channels with and without context. Our results show that the robot conveys emotions more transparently when context is included, and that blinking LEDs can be an effective channel for communicating emotion. Results suggest that blinking alone is a minimal but functional cue, with richer models performing better without context. When short contextual sentences and more spaced blink frequencies are added, the blinking-only condition performs on par with, or even better than, the full multimodal model for some emotions and, in particular, with respect to participants perceived arousal. This indicates that carefully designed, simple visual cues can be an effective affect channel for non-humanoid robots.

Abstract:
Thanks to their compliance and adaptability, soft actuators are promising devices for medical applications and the exploration of unstructured environment. However, their nonlinear behaviour, including strong hysteresis effects, presents challenges for accurate position control. This work investigates two data-driven feedforward control strategies for controlling the position of a Hyper-Elastic Ballooning Membrane Actuator (HBMA): a baseline single polynomial fit model and a hysteresis-aware Modified Prandtl-Ishlinskii (MPI) model. Comparative experiments demonstrate that hysteresis-aware control substantially improves accuracy. Specifically, incorporating hysteresis improved overall accuracy by 71% when the HBMA was inflated up to 20.5 mm. During partial inflation-deflation cycles, the MPI controller achieved a mean error of 0.685 mm, corresponding to 9.8% of the 7 mm displacement range. These results highlight the limitations of using feedforward control alone in soft robotic actuation while emphasising the benefits of hysteresis-aware modelling. The findings contribute to the ongoing effort to develop effective control strategies for soft robotic systems.

Abstract:
We present a method for formal safety verification of learning-based generative motion planners. Generative motion planners (GMPs) offer advantages over traditional planners, but verifying the safety and dynamic feasibility of their outputs is difficult since neural network verification (NNV) tools scale only to a few hundred neurons, while GMPs often contain millions. To preserve GMP expressiveness while enabling verification, our key insight is to imitate the GMP by stabilizing references sampled from the GMP with a small neural tracking controller and then applying NNV to the closed-loop dynamics. This yields reachable sets that rigorously certify closed-loop safety, while the controller enforces dynamic feasibility. Building on this, we construct a library of verified GMP references and deploy them online in a way that imitates the original GMP distribution whenever it is safe to do so, improving safety without retraining. We evaluate across diverse planners, including diffusion, flow matching, and vision-language models, improving safety in simulation (on ground robots and quadcopters) and on hardware (differential-drive robot).

Abstract:
Executing banked turns at elevated speeds poses significant dynamic challenges for 2-DOF pendulum-driven spherical robots. A steady-state torque balance reveals that centripetal loading at high speeds limits feasible roll angles and demands increasingly aggressive pendulum actuation. We derive a closed-form expression for the required pendulum angle and integrate this into a bank-aware Command Augmentation System (CAS) and control law that automatically alters infeasible commands. Experimental tests on Texas A&M RAD Lab's RoboBall II platform demonstrate that the CAS-equipped bank controller enables stable bank maneuvers at speeds up to 6 rad/s (1.83 m/s), where previous controllers fail, by dynamically limiting roll commands based on velocity and internal pressure.

Abstract:
In haptics, guaranteeing stability is essential to ensure safe interaction with remote or virtual environments. One of the most relevant methods at the state-of-the-art is the Time Domain Passivity Approach (TDPA). However, its high conservatism leads to a significant degradation of transparency. Moreover, the stabilizing action may conflict with the device's physical limitations. State-of-the-art solutions have attempted to address these actuator limits, but they still fail to account simultaneously for the power limits of each actuator while maximizing transparency. This work proposes a new damping limitation method based on prioritized dissipation actions. It prioritizes an optimal dissipation direction that minimizes actuator load, while any excess dissipation is allocated to the orthogonal hyperplane. The solution provides a closed-form formulation and is robust in multi-DoF scenarios, even in the presence of actuator and motion anisotropies. The method is experimentally validated using a parallel haptic interface interacting with a virtual environment and tested under different operating conditions.

Abstract:
Dynamic interactive object search in large-scale human environments presents substantial challenges for existing methods. Current scene representations like 3D Scene Graphs (3DSG) only provide coarse-grained spatial segmentation and cannot identify functional areas such as storage or leisure areas. Without functional area understanding, existing methods are constrained to exhaustive sequential exploration at large scales, resulting in inefficient search behaviorsparticularly in open-layout environments with numerous interactive objects such as drawers and cabinets. Moreover, these methods lack adaptability to environmental dynamics such as object relocations. To address these limitations, this paper proposes CMAR-search, a novel framework built upon Commonsense and Memory Augmented Reasoning (CMAR). Our approach leverages commonsense about area functionalities and aggregates environmental memory to construct a Functional 3D Scene Graph (F3DSG), which organizes the environment into functional areas with their associated containers. Through this structured representation, CMAR enables hierarchical action planning at both macro-area and micro-container levels, empowering the system to efficiently identify and inspect semantically relevant areas for effective object search. Notably, CMAR continuously integrates real-time perception, accumulated memory, and commonsense to dynamically relocalize objects in changing environments. Extensive experiments in simulation and real-world settings demonstrate that CMAR-search significantly surpasses state-of-the-art baselines in both success rate and search efficiency for object search in dynamic interactive environments.

Abstract:
Tactile perception is indispensable for robots to implement various manipulations dexterously, especially in contact-rich scenarios. However, alongside with the development of deep learning techniques, it meanwhile suffers from training data scarcity and time-consuming learning process in practical applications since the collection of a large amount of tactile data is costly and sometimes even impossible. Hence, we propose an automatic feature optimization-enabled prototypical network to realize meta learning, i.e., AFOP-ML framework. As a "learning to learn" network, it not only adapts to new unseen classes rapidly with few-shot, but also learns how to determine the optimal feature space automatically. Based on the four-channel signals acquired from a tactile finger, both shapes and materials are recognized. On a 36-category benchmark, it outperforms several existing approaches by attaining an accuracy of 96.08% in 5-way-1-shot scenario, where only 1 example is available for training. It still remains 88.7% in the extreme 36-way-1-shot case. The generalization ability is further validated through three groups of experiment involving unseen shapes, materials and force/speed perturbations. More insights are additionally provided by this work for the interpretation of recognition tasks and improved design of tactile sensors.

Abstract:
Safe autonomous navigation requires reliable estimation of environmental traversability. Traditional methods have relied on semantic or geometry-based approaches with human-defined thresholds, but these methods often yield unreliable predictions due to the inherent subjectivity of human supervision. While self-supervised approaches enable robots to learn from their own experience, they still face a fundamental challenge: the positive-only learning problem. To address these limitations, recent studies have employed Positive-Unlabeled (PU) learning, where the core challenge is identifying positive samples without explicit negative supervision. In this work, we propose GSAT, which addresses these limitations by constructing a positive hypersphere in latent space to classify traversable regions through anomaly detection without requiring additional prototypes (e.g., unlabeled or negative). Furthermore, our approach employs joint learning of anomaly classification and traversability prediction to more efficiently utilize robot experience. We comprehensively evaluate the proposed framework through ablation studies, validation on heterogeneous real-world robotic platforms, and autonomous navigation demonstrations in simulation environments. Our method is available at https://sparolab.github.io/research/gsat/.

Abstract:
Continuous stability, as one of the core modules of the autopilot system, is particularly important for its performance. However, as the vehicle speed increases, the system positioning error may be amplified, consequently introducing deviations in the positioning consistency of the system. The inherent high speeds and motion constraints in highway environments introduce new challenges for feature matching, particularly in vision-based vehicle localization, where initialization and scale estimation biases are further expanded. Lane markings, characterized by their simple and uniform structures and high distinctiveness from the surrounding environment, serve as effective features for matching-based localization in autonomous driving. This paper introduces a high-precision and robust vehicle localization method based on lane model constraints. Initially, leveraging lane model parameters from prior maps, we track and model lane line detections across consecutive frames to enhance the completeness of lane representation. The tracking results, combined with prior map data on lane widths, are employed to optimize scale parameters. Subsequently, real-time detected lanes are matched with prior maps through point-map association to constrain the vehicles heading angle. Finally, map matching results are integrated into existing visual local odometry methods to perform real-time localization optimization, thereby improving localization performance. Experimental evaluations conducted on a self-collected highway dataset demonstrate that the incorporation of lane models significantly enhances system localization accuracy.

Abstract:
In domestic environments, assigning scene semantic labels (scene recognition) to each node of a topological semantic map is an important task. Given the limitations of current scene recognition methods in efficiency, and accuracy for service robot, this paper proposes a scene recognition method based on a policy model. Considering the similarity of images captured from the adjacent nodes and the low-quality image caused by the uncertain node position and observation direction of the robot, we develop a policy model using a deep Q-learning network (DQN). This model enhances accuracy and efficiency by deciding whether to (1) inherit the scene type from the preceding node without re-recognition or (2) adjust the robots observation angle to capture a more informative image. A rule-based reward function integrated with a scene score model enables simultaneous learning of similarity assessment and viewpoint adjustment policies. Fur-thermore, a training strategy based on generated path is proposed to provide sufficient data for training the policy model. Extensive comparative experiments in simulated environments demonstrate that our method surpasses state-of-the-art approaches in both recognition accuracy and efficiency. Deployment on a mobile robot confirms its practical efficacy, achieving precise and efficient scene recognition across diverse real-world environments.

Abstract:
Enabling robots to interact effectively with the real world requires extensive learning from physical interaction data, making simulation crucial for generating such data safely and cost-effectively. Despite the advantages of simulation, manual environment creation remains a laborious process, motivating the development of automated generation approaches. However, the limitations of current automatic virtual scene generation approaches in bridging the sim-to-real gap and achieving task readiness necessitate the creation of automatically generated, realistic, and task-ready virtual scenes. In this paper, we propose GAIA, a novel methodology to automatically generate interactive, task-ready simulation environments grounded in real contexts from only a single RGB image and a task instruction. GAIA utilizes a pre-trained Vision-Language Model (VLM) without requiring explicit training, and jointly understands the visual context and the users instruction. Based on this understanding, it infers and places necessary task-aware objects, including unseen ones to construct an interactive virtual environment that maintains real-scene fidelity while reflecting task requirements without additional manual setup. We show qualitative experiments that GAIA generates spaces consistent with user instructions, and quantitative results that policies learned within these GAIA-generated environments successfully transfer to target environments. Source code and supplementary materials are available at our project page https://sites.google.com/view/gaia-project-page.

Abstract:
This letter presents the design and analysis of a compact magnetic field generator for robotic navigation of magnetic endovascular instruments in clinical settings. The system features eight symmetrically arranged permanent magnets in a ring configuration, maximizing magnetic field uniformity and aperture size for X-ray transmission while minimizing spatial footprint and magnet motion for uninterrupted fluoroscopy imaging. The rationale behind the design of the system is explained through analytical considerations of magnetic field distribution. In vitro demonstrations inside perfused biomimetic phantoms confirm the systems capability for 3D steering of flow-driven magnetic microcatheters, opening a path for pre-clinical testing.

Abstract:
Safety is a critical concern in motion planning for autonomous vehicles. Modern autonomous vehicles rely on neural network-based perception, but making control decisions based on these inference results poses significant safety risks due to inherent uncertainties. To address this challenge, we present a distributionally robust optimization (DRO) framework that accounts for both aleatoric and epistemic perception uncertainties using evidential deep learning (EDL). Our approach introduces a novel ambiguity set formulation based on evidential distributions that dynamically adjusts the conservativeness according to perception confidence levels. We integrate this uncertainty-aware constraint into model predictive control (MPC), proposing the DRO-EDL-MPC algorithm with computational tractability for autonomous driving applications. Validation in the CARLA simulator demonstrates that our approach maintains efficiency under high perception confidence while enforcing conservative constraints under low confidence.

Abstract:
Large Language Models (LLMs) have demonstrated impressive performance across various domains, including code generation and problem solving. However, their application in robotic controlparticularly in low-level tasks that require precise manipulation, real-time feedback, and environment-dependent executionremains limited. To address this challenge, we propose the Closed-Loop Modular Code Synthesizer framework. This framework leverages a pre-trained LLM without any task-specific fine-tuning to perform modular code planning and generation, and iteratively executes the generated code while inserting debugging probes to observe its behavior. This closed-loop structure facilitates systematic debugging and refinement, ultimately producing executable control programs. We apply the proposed framework to the calibration of an RGB-D camera and a robotic arm, validating its effectiveness in real-world settings. Furthermore, through a subsequent pick-and-place task, we demonstrate not only the accuracy of the calibration but also the potential extensibility of the framework. Across both tasks, the framework achieved high execution accuracy and autonomy, illustrating the practicality and scalability of LLM-based robotic control using our framework.

Abstract:
Multi-robot systems can greatly enhance efficiency through coordination and collaboration, yet in practice, full-time communication is rarely available and interactions are constrained to close-range exchanges. Existing methods either maintain all-time connectivity, rely on fixed schedules, or adopt pairwise protocols, but none adapt effectively to dynamic spatio-temporal task distributions under limited communication, resulting in suboptimal coordination. To address this gap, we propose CoCoPlan, a unified framework that co-optimizes collaborative task planning and team-wise intermittent communication. Our approach integrates a branch-and-bound architecture that jointly encodes task assignments and communication events, an adaptive objective function that balances task efficiency against communication latency, and a communication event optimization module that strategically determines when, where and how the global connectivity should be re-established. Extensive experiments demonstrate that it outperforms state-of-the-art methods by achieving a 22.4% higher task completion rate, reducing communication overhead by 58.6%, and improving the scalability by supporting up to 100 robots in dynamic environments. Hardware experiments include the complex 2D office environment and large-scale 3D disaster-response scenario.

Abstract:
Deploying semantic segmentation models for autonomous forklifts in industrial environments is challenging because visual conditions vary across sites, leading to poor cross-domain generalization and costly re-annotation efforts. We propose a curriculum-based domain adaptation framework that progressively transfers a segmentation model from simulation to real-world industrial deployment. The model is first pretrained on synthetic datasets with increasing complexity, then fine-tuned on a labeled real source domain to reduce the sim-to-real gap and adapt to camera-specific characteristics. Finally, it is adapted to a new target domain using pseudo-label-based self-training. To reduce drift during target adaptation, pseudo-labeled target samples are combined with labeled samples from the source-real domain, while a replay buffer improves robustness to class imbalance by oversampling rare classes. Preliminary experiments with DDRNet demonstrate improved performance under both moderate and hard domain shifts, with mIoU gains from 67.37 to 71.36 and from 49.57 to 57.22, respectively. The results highlight the potential of progressive multi-domain adaptation for scalable industrial robotic perception.

Abstract:
Tendon-driven robotic hands face a fundamental Triple Trade-Off among high force control bandwidth, low mechanical impedance, and precise force estimation. This extended abstract presents a novel variable-stiffness tension sensor designed to overcome these conflicting requirements. By integrating a nonlinear spring mechanism into the tendon routing path, the sensor dynamically modulates its physical stiffness according to the applied tension while simultaneously providing high-resolution force feedback. Experimental results confirm that the systems force control bandwidth dynamically increases from 12.70 Hz to 19.53 Hz as the tendon tension scales up. Furthermore, the feasibility of the system was validated on a 3DoF tendon-driven robotic finger, successfully demonstrating the sensors high sensitivity by delicately actuating a 50 gf mechanical keyboard switch at low tensions.

Abstract:
Advanced control algorithms are essential for enhancing the functionality of prosthetic hands, enabling them to operate in diverse conditions. This paper presents a Model Reference Adaptive Controller (MRAC) developed for a tendondriven soft continuum wrist, integrated into the PRISMA HAND II prosthetic hand. The primary objective of our research is to design an adaptive controller that facilitates wrist movements eliminating external disturbances while minimizing computational requirements. To achieve this, kinematic and dynamic models of the wrist are developed based on the Piece-wise Constant Curvature (PCC) hypothesis. The controller consists of a reference model generated using the PCC model, and state errors are evaluated by comparing the responses of the reference model to those of the wrist model. These errors are reduced using the MRAC approach to make the wrists behavior closely align with that of the reference model. Stability of the closed-loop system is ensured using the Lyapunov direct method, along with the New Theorem of Stability, a replacement for Barbalats lemma, ensuring that the error between the reference model and the actual system converges to zero and that the adaptive gains stabilize to fixed values.

Abstract:
Transferability, along with sample efficiency, is a critical factor for a reinforcement learning (RL) agent's successful application in real-world contact-rich manipulation tasks, such as product assembly. For instance, in the case of the industrial insertion task on high-mix, low-volume (HMLV) production lines, transferability could eliminate the need for machine retooling, thus reducing production line downtimes. In our work, we introduce a method called Multimodal Variational DeepMDP (MVDeepMDP) that demonstrates the ability to generalize to various environmental variations not encountered during training. The key feature of our approach involves learning a multimodal latent dynamic representation. We demonstrate the effectiveness of our method in the context of an electronic parts insertion task, which is challenging for RL agents due to the diverse physical properties of the non-standardized components, as well as simple 3D-printed blocks insertion. Furthermore, we evaluate the transferability of MVDeepMDP and analyze the impact of the balancing mechanism of the generalized Product of Expert, which is used to combine observable modalities. Finally, we explore the influence of separately processing state modalities of different physical quantities, such as pose and 6D force/torque (F/T) data.

Abstract:
The Concentric Tube Robot (CTR) has great promise for minimally invasive surgery. However, accurately modeling nonlinear and history-dependent behaviors remains a significant challenge. This paper proposes a learning-based forward and inverse kinematics model that accounts for the history dependence and nonlinearities of CTR, including the snapping behavior. A lightweight LSTM-MLP hybrid neural network with an input buffer and directional parameters was used to train forward and inverse kinematics models for 4-degree-of-freedom(DOF) CTR. The model was validated by comparing its predictions with actual values and results from a conventional torsional-compliant model(TCM) across random points, rotational trajectories, and arbitrary paths. This validation successfully demonstrated the models ability to capture snapping behavior. For forward kinematics, the model achieved a Root Mean Square Error (RMSE) of 0.69 mm and 0.16° with a computation time of 0.831±0.200 ms. The inverse kinematics model achieved an RMSE of 1.22 mm and 2.46° with a computation time of 0.816±0.200 ms. The proposed method improves the accuracy and speed of kinematic modeling by capturing nonlinear behaviors, such as snapping and hysteresis. The lightweight system ensures accurate real-time control and offers a safer and more reliable solution for microsurgical applications.

Abstract:
Cutting-edge robot learning techniques including foundation models and imitation learning from humans all pose huge demands on large-scale and high-quality datasets which constitute one of the bottleneck in the general intelligent robot fields. This paper presents the Kaiwu multimodal dataset to address the missing real-world synchronized multimodal data problems in the sophisticated assembling scenario, especially with dynamics information and its fine-grained labelling. The dataset first provides an integration of human, environment and robot data collection framework with 20 subjects and 30 interaction objects resulting in totally 11,664 instances of integrated actions. For each of the demonstration, hand motions, operation pressures, sounds of the assembling process, multi-view videos, high-precision motion capture information, eye gaze with first-person videos, electromyography signals are all recorded. Fine-grained multi-level annotation based on absolute timestamp, and semantic segmentation labelling are performed. Kaiwu dataset aims to facilitate robot learning, dexterous manipulation, human intention investigation and human-robot collaboration research.

Abstract:
In cluttered scenes with inevitable occlusions and incomplete observations, selecting informative viewpoints is essential for building a reliable representation. In this context, 3D Gaussian Splatting (3DGS) offers a distinct advantage, as it can explicitly guide the selection of subsequent viewpoints and then refine the representation with new observations. However, existing approaches rely solely on geometric cues, neglect manipulation-relevant semantics, and tend to prioritize exploitation over exploration. To tackle these limitations, we introduce a instance-aware Next Best View (NBV) policy that prioritizes underexplored regions by leveraging object features. Specifically, our object-aware 3DGS distills instance-level information into one-hot object vectors, which are used to compute confidence-weighted information gain that guides the identification of regions associated with erroneous and uncertain Gaussians. Furthermore, our method can be easily adapted to an object-centric NBV, which focuses view selection on a target object, thereby improving reconstruction robustness to object placement. Experiments demonstrate that our NBV policy reduces depth error by up to 77.14% on the synthetic dataset and 34.10% on the real-world GraspNet dataset compared to baselines. Moreover, compared to targeting the entire scene, performing NBV on a specific object yields an additional reduction of 25.60% in depth error for that object. We further validate the effectiveness of our approach through real-world robotic manipulation tasks.

Abstract:
High-risk applications in robotics, such as robot-assisted surgery, present unique challenges. These systems must be both highly precise and interpretable in order to be deployed in environments with very low tolerance for error or unsafe exploration. We present the first robotic system to demonstrate autonomous clip positioning on a physical phantom in laparoscopic surgery, one of the most common interventions in general surgery. After segmentation of a colorless point cloud from a single camera, target positions for the clips are extracted using spline interpolation, and can then be adjusted by the human operator. The segmentation model is trained on only 60 hand-labeled real point clouds, reflecting data scarcity in the surgical domain. We overcome this with a combination of pre-training on 128,000 synthetic point clouds and two novel data augmentation techniques. The motion of the end-effector to each target is visualized for the operator, satisfying the unique motion constraints of minimally-invasive surgery while ensuring that the robot's actions are verifiable and interpretable. In real robot experiments, our system localizes targets with the required precision of 0.75mm at a 95% success rate and executes autonomous clip positioning with a 100% success rate. We provide insights that are applicable to many other surgical and non-surgical tasks that require identifying and navigating to a precise target. Our source code is available at https://github.com/balazsgyenes/kirurc

Abstract:
Additive manufacturing offers extensive design freedom but remains limited by path planning strategies that rely on planar slicing, which introduce staircase artifacts. Non-planar slicing improves local fidelity yet still produces stacking artifacts due to exposed layer boundaries, leaving a gap in capturing complex geometries. This work proposes a Volumetric Envelope Generation Algorithm (VEGA) that generates geometry-aware enveloping layers through a buffering-erosion process. By introducing a Buffer Restraint Region (BRR), the method enables control over incorporation mode and layer positioning. Printability-based splitting further ensures feasible print paths for fabrication. Experiments were conducted on planar- and non-planar-base geometries, printed with a custom 3D printing robot. Printed models were scanned during evaluation, showing reductions of 68.5% in volumetric error, 69.1% in surface deviation, and 77.9% in chamfer distance relative to planar slicing, achieved without additional computational cost (�?2 s per model) or print length. These results demonstrate that enveloping-based path planning effectively mitigates artifacts inherent to slicing-based approaches, providing a strategy for high-fidelity, reliable fabrication of complex geometries.

Abstract:
This paper presents a model-based reinforcement learning (MBRL) approach with a receding horizon mechanism to optimize the lateral trajectory-tracking performance of autonomous vehicles (AVs). Accurate modeling of complex vehicle dynamics and adaptation to dynamic environments with limited data pose significant challenges for MBRL in AV control. To address these challenges, we propose sample-efficient algorithms that leverage autoregressive modeling to adapt from limited data while managing complex vehicle dynamics. Unlike traditional methods reliant on fixed models, our approach uses the temporal reasoning of autoregressive (AR) models to compensate for the residual dynamics, which effectively approximates the local effects of nonlinearities and disturbances. Integrated with real-time sensor data, the residual generation model is continuously refined via incremental learning in a closed-loop framework, enhancing adaptability. This architecture, combining physical modeling with data-driven residuals, maintains interpretability and improves responsiveness in complex scenarios. CarSim simulations demonstrate superior performance over other state-of-the-art learning-based predictive controllers and classical methods for AV lateral control. Real-world validation on a HongQi electric vehicle (HQEV) confirms the algorithms effectiveness, showing significant improvements over classical model predictive control (MPC). This approach holds substantial potential for advanced driver-assistance systems (ADAS) and fully autonomous driving, enabling precise control under diverse conditions.

Abstract:
Acquiring aligned visuo-tactile datasets is slow and costly, requiring specialised hardware and large-scale data collection. Synthetic generation is promising, but prior methods are typically single-modality, limiting cross-modal learning. We present MultiDiffSense, a unified diffusion model that synthesises images for multiple vision-based tactile sensors (ViTac, TacTip, ViTacTip) within a single architecture. Our approach uses dual conditioning on CAD-derived, pose-aligned depth maps and structured prompts that encode sensor type and 4-DoF contact pose, enabling controllable, physically consistent multi-modal synthesis. Evaluating on 8 objects (5 seen, 3 novel) and unseen poses, MultiDiffSense outperforms a Pix2Pix cGAN baseline in SSIM by +36.3% (ViTac), +134.6% (ViTacTip), and +64.7% (TacTip). For downstream 3-DoF pose estimation, mixing 50% synthetic with 50% real halves the required real data while maintaining competitive performance. MultiDiffSense alleviates the data-collection bottleneck in tactile sensing and enables scalable, controllable multi-modal dataset generation for robotic applications.

Abstract:
With advances in reinforcement learning and imitation learning, quadruped robots can acquire diverse skills within a single policy by imitating multiple skill-specific datasets. However, the lack of datasets on complex terrains limits the ability of such multi-skill policies to generalize effectively in unstructured environments. Inspired by animation, we adopt keyframes as minimal and universal skill representations, relaxing dataset constraints and enabling the integration of terrain adaptability with skill diversity. We propose Keyframe Guided Self-Imitation for Robust and Adaptive Skill Learning (KiRAS), an end-to-end framework for acquiring and transitioning between diverse skill primitives on complex terrains. KiRAS first learns diverse skills on flat terrain through keyframe-guided self-imitation, eliminating the need for expert datasets; then continues training the same policy network on rough terrains to enhance robustness. To eliminate catastrophic forgetting, a proficiency-based Skill Initialization Technique is introduced. Experiments on Solo-8 and Unitree Go1 robots show that KiRAS enables robust skill acquisition and smooth transitions across challenging terrains. This framework demonstrates its potential as a lightweight platform for multi-skill generation and dataset collection. It further enables flexible skill transitions that enhance locomotion on challenging terrains.

Abstract:
Legged robots offer significant potential for navigating complex industrial terrains, but their capabilities are often constrained by perception systems struggling to interpret intricate 3D geometry. Conventional 2D/2.5D representations like depth or elevation maps fail to capture complex 3D geometry, leading to unsafe locomotion. This paper presents SURF-Loco, a novel framework that enables robust legged locomotion by learning directly from a 3D surfel-based model. Our approach uses surfels to create an omnidirectional representation that explicitly encodes the geometric properties necessary for stable locomotion. We integrate this structured 3D representation into an end-to-end Mixture-of-Experts (MoE) reinforcement learning policy. A variational autoencoder (VAE) distills the complex 3D surroundings into a compact latent context. This geometric context enables a gating network to dynamically select expert sub-policies for agile, context-aware actions. We validate our method on hexapod robots, achieving robust zero-shot sim-to-real transfer on a variety of challenging industrial obstacles.

Abstract:
In order to flexibly act in an everyday environment, a robotic agent needs a variety of cognitive capabilities that enable it to reason about plans and perform execution recovery. Large language models (LLMs) have been shown to demonstrate emergent cognitive aspects, such as reasoning and language understanding; however, the ability to control embodied robotic agents requires reliably bridging high-level language to low-level functionalities for perception and control. In this paper, we investigate the extent to which an LLM can serve as a core component for planning and execution reasoning in a cognitive robot architecture. For this purpose, we propose a cognitive architecture in which an agentic LLM serves as the core component for planning and reasoning, while components for working and episodic memories support learning from experience and adaptation. An instance of the architecture is then used to control a mobile manipulator in a simulated household environment, where environment interaction is done through a set of high-level tools for perception, reasoning, navigation, grasping, and placement, all of which are made available to the LLM-based agent. We evaluate our proposed system on two household tasks (object placement and object swapping), which evaluate the agent's reasoning, planning, and memory utilisation. The results demonstrate that the LLM-driven agent can complete structured tasks and exhibits emergent adaptation and memory-guided planning, but also reveal significant limitations, such as hallucinations about the task success and poor instruction following by refusing to acknowledge and complete sequential tasks. These findings highlight both the potential and challenges of employing LLMs as embodied cognitive controllers for autonomous robots.

Abstract:
Containerization and orchestration using cloud-native technologies enable scalable deployment of robotic software. Integrating ROS~2 with Kubernetes offers a flexible infrastructure, but also introduces a complex, multi-layered communication stack - from DDS middleware to container networks and the physical layer. Each layer adds overhead and variability that impact application-level performance. This paper presents a comprehensive analysis of communication performance across the cloudedgerobot continuum, focusing on throughput and one-way latency in scalable ROS~2 deployments. We evaluate communication across intra-robot, edge, and cloud segments using wired and wireless connections, including emerging technologies like Wi-Fi~7 and high-speed LAN. Using a Kubernetes-based testbed, we investigate various ROS~2 middlewares, CNI plugins, QoS configurations, and encryption options. Our experiments reveal the impact of network overlays, routing paths, and middleware choices on latency and bandwidth. Despite the inherent complexity, the results confirm the feasibility of deploying ROS~2 in orchestrated, scalable environments. We summarize key insights as practical takeaways, many of which apply beyond Kubernetes, to guide the design of robust cloud/edge robotic systems.

Abstract:
Robotic-assisted percutaneous coronary intervention (PCI) is constrained by the inherent limitations of 2D Digital Subtraction Angiography (DSA). Unlike physicians, who can directly manipulate guidewires and integrate tactile feedback with their prior anatomical knowledge, teleoperated robotic systems must rely solely on 2D projections. This mode of operation, simultaneously lacking spatial context and tactile sensation, may give rise to projection-induced ambiguities at vascular bifurcations. To address this challenge, we propose a two-stage framework (SCAR-UNet-GAT) for real-time robotic path planning. In the first stage, SCAR-UNet, a spatial-coordinate-attention-regularized U-Net, is employed for accurate coronary vessel segmentation. The integration of multi-level attention mechanisms enhances the delineation of thin, tortuous vessels and improves robustness against imaging noise. From the resulting binary masks, vessel centerlines and bifurcation points are extracted, and geometric descriptors (e.g., branch diameter, intersection angles) are fused with local DSA patches to construct node features. In the second stage, a Graph Attention Network (GAT) reasons over the vessel graph to identify anatomically consistent and clinically feasible trajectories, effectively distinguishing true bifurcations from projection-induced false crossings. On a clinical DSA dataset, SCAR-UNet achieved a Dice coefficient of 93.1%. For path disambiguation, the proposed GAT-based method attained a success rate of 95.0% and a target-arrival success rate of 90.0%, substantially outperforming conventional shortest-path planning (60.0% and 55.0%) and heuristic-based planning (75.0% and 70.0%). Validation on a robotic platform further confirmed the practical feasibility and robustness of the proposed framework.

Abstract:
Locating and identifying objects in vision-denied environments is a critical challenge for intelligent robot systems. To address the limitation of vision, we present a tactile-only method for object search and recognition using custom tactile skin sensors on robot hands. The method involves searching an object in a vision-denied environment with a tactile hide and seek strategy. Upon contact, the system employs a novel two-phase classification process: an initial single-handed classification by pushing the object, followed by a two-handed verification stage that incorporates size measurement to confirm the object's identity and reduce critical errors. To support this approach, we introduce the HAS (Hide-and-Seek) dataset, a large-scale, multimodal tactile dataset of 1.1 million samples collected on a custom sensor hardware. Our system achieves an object classification accuracy of 91.1% and a weight classification accuracy of 83.1% on the HAS dataset, with a strict joint accuracy of 79.6%. The full online pipeline attains a 61.4% success rate in real-world identification, with the bimanual verification stage further correcting up to 17.6% of single-hand errors. Comprehensive ablation studies validate the contribution of individual sensor modalities and demonstrate the effectiveness of our tactile-only method for autonomous operation in a non-vision environment. Our project page is available at tactile-hide-and-seek.github.io.

Abstract:
Understanding risk in autonomous driving requires not only perception and prediction, but also high-level reasoning about agent behavior and context. Current Vision Language Model (VLM)-based methods primarily ground agents in static images and provide qualitative judgments, lacking the spatiotemporal reasoning needed to capture how risks evolve over time. To address this gap, we propose NuRisk, a comprehensive Visual Question Answering (VQA) dataset comprising 2.9K scenarios and 1.1M agent-level samples, built on real-world data from nuScenes and Waymo, completed with safety-critical scenarios from the CommonRoad simulator. The dataset provides Bird's-eye view (BEV) based sequential images with quantitative, agent-level risk annotations, enabling spatiotemporal reasoning. We benchmark well-known VLMs across different prompting techniques and find that they fail to perform explicit spatio-temporal reasoning, resulting in a peak accuracy of 33% at high latency. To address these shortcomings, our fine-tuned 7B VLM agent improves accuracy to 41% and reduces latency by 75%, demonstrating explicit spatio-temporal reasoning capabilities that proprietary models lacked. While this represents a significant step forward, the modest accuracy underscores the profound challenge of the task, establishing NuRisk as a critical benchmark for advancing spatio-temporal reasoning in autonomous driving. More information can be found at https://github.com/TUM-AVS/NuRisk.

Abstract:
Scene Change Detection (SCD) is a critical task for building smart cities, yet its practical application faces dual challenges: existing methods typically rely on temporal conditions present in the training data and the ideal assumption of small viewpoint differences. Consequently, they struggle to handle the common and significant viewpoint variations in real-world scenarios and exhibit strong sensitivity to temporal conditions, leading to drastic performance degradation under unseen temporal settings. To address these challenges, we propose the Fusion-Refinement Change Detection Network (FR-CDNet). By modeling correspondences between objects and preserving spatial prior information from ideally aligned scenes during the disentangled processing of different temporal directions, our network achieves a unified handling of varying degrees of viewpoint variations and different temporal conditions---a capability existing methods lack. Furthermore, FR-CDNet can automatically distinguish the temporal attribution of change entities to better support downstream tasks. To better evaluate performance in real-world settings, we further construct the URSCD dataset, which includes larger viewpoint differences and more diverse change scenarios. Extensive experiments demonstrate the universal scene detection capability of our method: it achieves significant improvement in F1-score on unaligned scenes while maintaining performance comparable to SOTA on aligned scenes. Ablation studies further demonstrate that the proposed framework can be migrated to enhance various mainstream models, effectively eliminating temporal condition dependency while improving overall performance.

Abstract:
Robotic manipulation of granular objects is crucial in various fields, yet modeling their complex dynamics and diverse physical properties remains challenging. Simulation plays an important role in learning robotic manipulation policies, but it exhibits challenge to accurately model the complex dynamics and physical properties given only visual observations. The difficulty is further compounded in tasks involving intricate contact mechanisms, particularly when using tools with complex shapes like spoons to interact with granular objects, resulting in a significant sim-to-real gap. To address this, we introduce a novel task of scooping all objects out of a storage container using a spoon, which requires sophisticated modeling of multi-object interactions. We propose a unified framework that combines rich simulation data with a small amount of real-world data. Rather than optimizing physical parameters in simulation, we learn a graph-based neural dynamics model in simulation and fine-tune it on real-world data. We then employ a Monte-Carlo Tree Search (MCTS)- based planner to accomplish long-horizon decision-making. Our system successfully scoops out three types of objects, demonstrating its potential for real-world applications. This work highlights the benefits of leveraging both simulation and real-world data to tackle the sim-to-real gap in contact-rich manipulation tasks.

Abstract:
Modular small-scale robots offer the potential for on-demand assembly and disassembly, enabling task-specific adaptation in dynamic and constrained environments. However, existing modular magnetic platforms often depend on workspace collisions for reconfiguration, employ bulky three-dimensional electromagnetic systems, and lack robust single-module control, which limits their applicability in biomedical settings. In this work, we present a modular magnetic millirobotic platform comprising three cube-shaped modules with embedded permanent magnets, each designed for a distinct functional role: a free module that supports self-assembly and reconfiguration, a fixed module that enables flip-and-walk locomotion, and a gripper module for cargo manipulation. Locomotion and reconfiguration are actuated by programmable combinations of time-varying two-dimensional uniform and gradient magnetic field inputs. Experiments demonstrate closed-loop navigation using real-time vision feedback and A path planning, establishing robust single-module control capabilities. Beyond locomotion, the system achieves self-assembly, multimodal transformations, and disassembly at low field strengths. Chain-to-gripper transformations succeeded in 90% of trials, while chain-to-square transformations were less consistent, underscoring the role of module geometry in reconfiguration reliability. These results establish a versatile modular robotic platform capable of multimodal behavior and robust control, suggesting a promising pathway toward scalable and adaptive task execution in confined environments.

Abstract:
Reliable quadrotor control in dynamic environments remains challenging due to external disturbances and internal uncertainties. While Model Predictive Path Integral (MPPI) control enables agile maneuvers through samplingbased optimization, its performance often degrades under such unmodeled uncertainties, leading to brittle and unsafe behavior. To address this, we propose SA-MPPI, a novel robust MPC framework that integrates asynchronous guidance with a novel open-loop sensitivity metric. The asynchronous module leverages a slower auxiliary controller to generate an informed sampling distribution, improving convergence without introducing latency. The sensitivity metric penalizes high-variance trajectories under sampled disturbances via nested Monte Carlo rollouts, embedding robustness directly into the optimization. Extensive simulations and real-world quadrotor experiments demonstrate that SA-MPPI outperforms adaptive baselines, reducing tracking errors by up to 47% under significant wind disturbances while achieving over 2× higher computational efficiency. These results highlight SA-MPPIs ability to deliver low-latency, safe, and predictable control in uncertain, dynamic environments.

Abstract:
Scene understanding via multi-modal large language models and scene forecasting with world models have advanced the development of autonomous driving. The former maps visual inputs to driving-specific outputs, neglecting spatial reasoning and world dynamics. The latter captures world dynamics, lacking comprehensive scene understanding. In contrast, human divers seamlessly integrate understanding, forecasting, and decision-making through multi-modal representations. To this end, we propose OccLLaMA, a unified occupancy-language-action world model to enhance motion planning via multi-task learning. It uses semantic occupancy as a unified 3D visual representation, effectively integrating spatial scene understanding and forecasting. Specifically, we first introduce a tailored scene tokenizer that auto-encodes semantic occupancy into latent tokens for invertible compression. Furthermore, we enhance LLaMA to enable joint learning across both understanding and generation tasks within a unified auto-regressive framework, incorporating multi-task pretraining and motion-planningoriented fine-tuning. Extensive experiments demonstrate that OccLLaMA not only achieves competitive performance on scene understanding and occupancy forecasting, but also enhances motion planning by integrating multi-task inference, showcasing its effectiveness and potential as a foundation model for autonomous driving. Project page: hrefhttps://vilonge.github.io/OccLLaMA_Page/OccLLaMA

Abstract:
Drone swarm performances---synchronized, expressive aerial displays set to music---have emerged as a captivating application of modern robotics. Yet designing smooth, safe choreographies remains a complex task requiring expert knowledge. We present SwarmGPT, a language-based choreographer that leverages the reasoning power of large language models (LLMs) to streamline drone performance design. The LLM is augmented by a safety filter that ensures deployability by making minimal corrections when safety or feasibility constraints are violated. By decoupling high-level choreographic design from low-level motion planning, our system enables non-experts to iteratively refine choreographies using natural language without worrying about collisions or actuator limits. We validate our approach through simulations with swarms up to 200 drones and real-world experiments with up to 20 drones performing choreographies to diverse types of songs, demonstrating scalable, synchronized, and safe performances. Beyond entertainment, this work offers a blueprint for integrating foundation models into safety-critical swarm robotics applications.

Abstract:
Laser vision sensors (LVS) are critical perception modules for industrial robots, facilitating real-time acquisition of workpiece geometric data in welding applications. However, the camera communication delay will lead to a temporal desynchronization between captured images and the robot motions. Additionally, hand-eye extrinsic parameters may vary during prolonged measurement. To address these issues, we introduce a measurement model of LVS considering the effect of the camera's time-offset and propose a teaching-free spatiotemporal calibration method utilizing line constraints. This method involves a robot equipped with an LVS repeatedly scanning straight-line fillet welds using S-shaped trajectories. Regardless of the robot's orientation changes, all measured welding positions are constrained to a straight-line, represented by Plucker coordinates. Moreover, a nonlinear optimization model based on straight-line constraints is established. Subsequently, the Levenberg-Marquardt algorithm (LMA) is employed to optimize parameters, including time-offset, hand-eye extrinsic parameters, and straight-line parameters. The feasibility and accuracy of the proposed approach are quantitatively validated through experiments on curved weld scanning. We open-sourced the code, dataset, and simulation report at https://anonymous.4open.science/r/LVS_ST_CALIB-015F/README.md.

Abstract:
Flapping wing Micro Air Vehicles (FWMAVs) hold great potential for real-world applications but are currently still hard to model. In this article, a simplified analysis of the equilibrium state of a tailless FWMAV in forward flight is presented. The definition of the equilibrium state complements previous dynamic and stability analysis, adding new information about the flight behavior of FWMAVs. A new aerodynamic decoupled model has been used for the analysis, considering separately the thrust force generated by the flapping movement and the lift and drag caused by the forward velocity. The aerodynamic forces are included in a dynamic model of the FWMAV, and the equilibrium state is derived. The formulation obtained is explicit in terms of the pitch actuator deflection, thus allowing its use for control corrections, and provides an estimation of the flight velocity. The thrust needed to maintain height is also formulated, demonstrating that forward flight is more efficient than hovering. The results are validated experimentally for the pitch angle, showing good agreement with the analytical results. Then, the dynamics of the FWMAV are simulated, comparing the results with experiments where the FWMAV goes from hovering to a specific pitch reference while maintaining its height. Additional simulations are performed with basic control considerations, showing how considering the equilibrium state for a feed-forward control significantly improves the flight behavior compared to PI and PID controllers, reducing the convergence time.

Abstract:
Controlling a flexible wheeled robot for complex tasks such as stair climbing is highly challenging. The nonlinearity inherent in soft materials hinders accurate modeling, creating a trade-off in Reinforcement Learning (RL) between simulation fidelity and learning speed. We propose an RL-friendly, multi-body model that approximates the deformation of the flexible wheel as a Mass-Spring-Damper (MSD) system composed of rigid links and joints. This model enables end-to-end RL within a fast rigid-body simulator, facilitating a blind control policy that relies solely on proprioceptive feedback. To reduce the reality gap and enhance policy robustness, we randomize the main parameters of the MSD system. In real-world experiments, a robot successfully climbed an 18 cm step, corresponding to approximately 51% of the wheel radiusa feat impossible for a rigid-wheeled equivalent. To our knowledge, this is the first successful application of RL-based blind control for stair climbing with a flexible wheeled robot. However, structural limitations in our model and challenges in parameter identification hinder sim-to-real transfer, and improving robustness remains a key issue for future work.

Abstract:
Robots and artificial intelligence technologies are becoming increasingly integrated into our daily lives. The introduction of humanoid robots into everyday settings is a gradual but ongoing processone that society is already beginning to navigate. Yet this shift raises important questions: Who or what is truly behind these physical agents? And can we, as users, perceive differences in our interactions depending on whether a robot acts autonomously or it's teleoperated by a human? In this study, we present the results of an experiment in which participants interacted with a robot under two control conditionsautonomous and teleoperatedwhile it performed two distinct tasks in both static and dynamic movement scenarios. In our results, human operators outperformed autonomous systems in tasks requiring spatial awareness and contextual reasoning. Conversely, the autonomous robotpowered by a Large Language Model and operating without visual inputwas perceived more favorably in tasks that demanded rapid access to broad and diverse information.

Abstract:
This paper introduces an innovative observer-based modular control strategy in a class of n_a-degree-of-freedom (DoF) fully electrified heavy-duty robotic manipulators (HDRMs) to (1) guarantee robustness in the presence of uncertainties and disturbances, (2) address the complexities arising from several interacting mechanisms, (3) ensure uniformly exponential stability, and (4) enhance overall control performance. To begin, the dynamic model of HDRM actuation systems, which exploits the synergy between cleantech electromechanical linear actuators (EMLAs) and permanent magnet synchronous motors (PMSMs), is investigated. In addition, the reference trajectories of each joint are computed based on direct collocation with B-spline curves to extract the key kinematic and dynamic quantities of HDRMs. To guarantee robust tracking of the computed trajectories by the actual motion states, a novel control methodology, called robust subsystem-based adaptive (RSBA) control, is enhanced through an adaptive state observer. The RSBA control addresses inaccuracies inherent in motion, including modeling errors, non-triangular uncertainties, and both torque and voltage disturbances, to which the EMLA-driven HDRM is susceptible. The proposed RSBA control performance is validated through simulations and experiments of the scrutinized PMSM-powered EMLA-actuated mechanisms.

Abstract:
，重力补偿已被广泛应用于下肢外骨骼用于减轻腿部负荷并缓解肌肉疲劳。被�?补偿方式具有固有的安全性和轻量�?优势;然而，现有设计常常受限于体�?以及基于弹簧的机制复杂性，这会损害外骨骼的紧凑性和可靠性。为了解决这些问�?局限性，针对可穿戴的被动双侧下肢外骨骼，具有紧凑型，设计简单且坚固。基�?合成质心映射方法首先被引入，随后进行扩展通过微分结构实现双侧配置。综合赛采用自适应恒定力机制，最终系�?保持结构简洁和空间效率，使�?适合集成空间有限的可穿戴应用。一�?外骨骼原型基于所提出的机制构建，其性能通过实验进行

Abstract:
To empower mobile robots with usable maps as well as highest state estimation accuracy and robustness, we present OKVIS2-X: a state-of-the-art multi-sensor Simultaneous Localization and Mapping (SLAM) system building dense volumetric occupancy maps, while scalable to large environments and operating in realtime. Our unified SLAM framework seamlessly integrates different sensor modalities: visual, inertial, measured or learned depth, LiDAR and Global Navigation Satellite System (GNSS) measurements. Unlike most state-of-the-art SLAM systems, we advocate using dense volumetric map representations when leveraging depth or range-sensing capabilities. We employ an efficient submapping strategy that allows our system to scale to large environments, showcased in sequences of up to 9 kilometers. OKVIS2-X enhances its accuracy and robustness by tightly-coupling the estimator and submaps through map alignment factors. Our system provides globally consistent maps, directly usable for autonomous navigation. To further improve the accuracy of OKVIS2-X, we also incorporate the option of performing online calibration of camera extrinsics. Our system achieves the highest trajectory accuracy in EuRoC against state-of-the-art alternatives, outperforms all competitors in the Hilti22 VI-only benchmark, while also proving competitive in the LiDAR version, and showcases state of the art accuracy in the diverse and large-scale sequences from the VBR dataset.

Abstract:
Learning diverse policies for non-prehensile manipulation is essential for improving skill transfer and generalization to out-of-distribution scenarios. In this work, we enhance exploration through a two-fold approach within a hybrid framework that tackles both discrete and continuous action spaces. First, we model the continuous motion parameter policy as a diffusion model, and second, we incorporate this into a maximum entropy reinforcement learning framework that unifies both the discrete and continuous components. The discrete action space, such as contact point selection, is optimized through Q-value function maximization, while the continuous part is guided by a diffusion-based policy. This hybrid approach leads to a principled objective, where the maximum entropy term is derived as a lower bound using structured variational inference. We propose the Hybrid Diffusion Policy algorithm (HyDo) and evaluate its performance on both simulation and zero-shot sim2real tasks. Our results show that HyDo encourages more diverse behavior policies, leading to significantly improved success rates across tasks - for example, increasing from 53% to 72% on a real-world 6D pose alignment task. Project page: https://leh2rng.github.io/hydo

Abstract:
Robotics simulating swallowing hold the potential to enhance our understanding of the swallowing process, support the development of safer texture- or viscosity-modified foods and beverages, and act as medical education tools for both patients with dysphagia and healthcare professionals. Although robotic models in the literature offer insightful actuation mechanisms, many tackle only isolated stages of swallowing, have reduced physiological accuracy, and tend to be mechanically complex and costly. This paper addresses these limitations by developing a novel robotic model that replicates the oral and pharyngeal stages of swallowing, featuring a passive system to simulate the protective closure of the epiglottis. This paper presents the design, function and experimental validation of the robot model. The proposed model can transport thickened fluids from the tongue to the pharynx, preventing aspiration. By enabling passive epiglottis closure, this model advances the physiological fidelity of swallowing robotics, offering insights into actuation mechanisms for future studies.

Abstract:
This paper presents a mechanistic analysis of stiffness in cable-driven serpentine manipulators (CDSMs), incorporating both cable tension and cable stiffness. First, we derive an analytical stiffness model based on robot statics, identifying cable tension and stiffness as the dominant factors governing robot stiffness at a given configuration. Crucially, we characterize a previously overlooked tension-stiffness coupling effect: cable tension induces nonlinear stiffness variations in driving cables, significantly altering overall robot stiffness. Due to this interdependence, quantifying cable tensions specific influence on stiffness remains a challenging research gap. To address this, simulations and experiments validate the model and quantify their coupled effects on robotic stiffness. Results demonstrate that for CDSMs using multi-strand cables with nonlinear stiffness, robot stiffness increases sharply with rising tension. Conversely, when cable elasticity is constant, robot stiffness decreases with increasing tension. These findings provide critical insights for advancing stiffness control accuracy in CDSMs.

Abstract:
Real-world drone experiments are essential but often constrained by safety, regulations, and space, while simulations fail to capture key physical effects. This paper presents a compact indoor testbed that bridges this gap by integrating physical drone dynamics with display-based virtual environments. A real drone interacts with displayed scenes on screens, enabling closed-loop image-based control. By combining physical motion with virtual environment shifting, the system supports flexible and controllable experiments. The effectiveness of the proposed testbed is demonstrated through three scenarios: drone delivery, farmland monitoring, and disaster response.

Abstract:
Robotic manipulators have the capability to engage in physical interaction with human operators, sharing not only the same workspace but also offering physical assistance to alleviate the human physical workload. In this study, we explore whether a robot should act as a collaborator or a cooperator in a co-manipulation task with a human partner, and investigate different collaboration strategies. In a previous study, we addressed the same question for a humanhuman dyad and found that collaboration is preferable to make fewer errors at the expense of increased arm stiffness for the humans. In our current investigation, a human physically interacts with a Franka robot in various co-manipulation conditions. In the cooperation conditions, the robot is either a leader or a follower, exhibiting fixed impedance strategies. In the collaborative conditions, the robot exhibits either reciprocal or mirrored adaptive impedance strategies that vary according to an online EMG-based function of the human arm stiffness. Our findings indicate that, for co-manipulation tasks, a robot collaborator is preferable to a robot cooperator (leader or follower), similarly to human dyads. However, unlike the behavior observed within human dyads, the reciprocal strategy for impedance adjustment appears to be the most effective for humanrobot collaboration.

Abstract:
While autonomous multi-robots can achieve safe and coordinated navigation, they often struggle to adapt to unforeseen conditions and to capture operator-driven objectives in unstructured environments. We present a Virtual Reality (VR)-based shared control framework for teams of drones operating in constrained and unknown environments, enabling real-time, user-guided exploration. Our approach integrates a novel user-guided motion-primitive-based planner with an admittance controller, generating dynamically feasible, collision-free trajectories while allowing the operator to flexibly influence team behavior. By leveraging user input, the framework enables the robot team to explore regions of interest that autonomous planners may overlook. The system supports mixed-reality operations with both physical and simulated drones, and implements a bilateral VR-based interface, allowing the operator to guide the robot team via migration points while receiving immediate visual feedback of the team state. Experimental results show that shared control improves obstacle avoidance, maintains inter-agent spacing, and reduces operator effort, demonstrating the feasibility and advantages of immersive, human-in-the-loop swarm navigation

Abstract:
This paper introduces ByteWrist, a novel highly-flexible and anthropomorphic parallel wrist for robotic manipulation. ByteWrist addresses the critical limitations of existing serial and parallel wrists in narrow-space operations through a compact three-stage parallel drive mechanism integrated with arc-shaped end linkages. The design achieves precise RPY (Roll-Pitch-Yaw) motion while maintaining exceptional compactness, making it particularly suitable for complex unstructured environments such as home services, medical assistance, and precision assembly. The key innovations include: (1) a nested three-stage motor-driven linkages that minimize volume while enabling independent multi-DOF control, (2) arc-shaped end linkages that optimize force transmission and expand motion range, and (3) a central supporting ball functioning as a spherical joint that enhances structural stiffness without compromising flexibility. Meanwhile, we present comprehensive kinematic modeling including forward / inverse kinematics and a numerical Jacobian solution for precise control. Empirically, we observe ByteWrist demonstrates strong performance in narrow-space maneuverability and dual-arm cooperative manipulation tasks, outperforming Kinova-based systems. Results indicate significant improvements in compactness, efficiency, and stiffness compared to traditional designs, establishing ByteWrist as a promising solution for next-generation robotic manipulation in constrained environments.

Abstract:
This paper presents a novel learning-based approach for online estimation of maximal safe sets for local trajectory planning in unknown static environments. The neural representation of a set is used as the terminal set constraint for a model predictive control (MPC) local planner, resulting in improved recursive feasibility and safety. To achieve real-time performance and desired generalization properties, we employ the idea of hypernetworks. We use the Hamilton-Jacobi (HJ) reachability analysis as the source of supervision during the training process, allowing us to consider general nonlinear dynamics and arbitrary constraints. The proposed method is extensively evaluated against relevant baselines in simulations for different environments and robot dynamics. The results show a success rate increase of up to 52 % compared to the best baseline while maintaining comparable execution speed. Additionally, we deploy our proposed method, NTC-MPC, on a physical robot and demonstrate its ability to safely avoid obstacles in scenarios where the baselines fail.

Abstract:
Mobile robot teams often require decentralised autonomous navigation through narrow gaps in limited commu- nication environments (e.g., underground search-and-rescue op- erations). Existing navigation approaches exhibit suboptimal per- formance for avoiding multi-robot collisions in such bottlenecks due to an inability to address the dynamic nature of the robots. Initial work utilising reinforcement learning has demonstrated success in navigating a single robot through narrow gaps. However, when training agents to produce give-way behaviour for navigat- ing through constrained gaps, end-to-end reinforcement learning using simple rewards suffers from slow convergence due to the increased search space of viable policies. This paper introduces a novel curriculum reinforcement learning framework, incorpo- rating a multi-robot bootstrap curriculum with preprogrammed behaviour to guide initial policy formation, subsequently refined by a gap curriculum that progressively reduces training complexity towards an optimal policy. This framework learns multi-robot in- teraction behaviours, which are impractical to program manually. Our model achieves a 99% success-rate in give-way behaviour generation without inter-agent communications in high-fidelity simulations. The success-rate reduced to 73% in simulations incor- porating noisy sensors, and 60% in field-robot tests, substantiating our models practical viability despite sensor noise and real-world uncertainties. The simple benchmark methods lack efficiency in basic interaction behaviours.

Abstract:
Multi-camera systems hold considerable promise for enhancing visual odometry by expanding the field of view, yet simply adding more cameras does not guarantee higher accuracy. Because increasing the number of cameras also raises the likelihood of degraded or misaligned views, appropriate handling is essential to prevent severe outliers and corrupted global pose estimates. Previous methods discard points in back-end optimization based on residuals, which has been a bottleneck for real-time performance since erroneous measurements are inevitably incorporated into the main pipeline before removal. In response, we propose a direct Multi-RGB-D Inertial Odometry framework driven by confidence-based weighting, which adaptively down-weights unreliable cameras based on photometric quality and viewpoint alignment. To manage the heavy data load typical of multi-camera setups, we also incorporate a motion-guided selection strategy, filtering out non-informative points before costly alignment. This early pruning reduces computation yet retains critical constraints for odometry. By combining these techniques, our system achieves robust, scale-consistent pose estimation in real time, even with four cameras, as validated through challenging indoor-outdoor experiments involving saturation, occlusions, low-light conditions, and severe glare. We publicly release our multi-RGB-D-inertial dataset at https://github.com/seungsang07/multi-rgbd-inertial-dataset.

Abstract:
RGB-thermal (RGB-T) semantic segmentation improves the environmental perception of autonomous platforms in challenging conditions. Prevailing models employ encoders pre-trained on RGB images to extract features from both RGB and infrared inputs, and design additional modules to achieve cross-modal feature fusion. This results in limited thermal feature extraction and suboptimal cross-modal fusion, while the redundant encoders further compromises the models real-time efficiency. To address the above issues, we propose TUNI, with an RGB-T encoder consisting of multiple stacked blocks that simultaneously perform multi-modal feature extraction and cross-modal fusion. By leveraging large-scale pre-training with RGB and pseudo-thermal data, the RGB-T encoder learns to integrate feature extraction and fusion in a unified manner. By slimming down the thermal branch, the encoder achieves a more compact architecture. Moreover, we introduce an RGB-T local module to strengthen the encoders capacity for cross-modal local feature fusion. The RGB-T local module employs adaptive cosine similarity to selectively emphasize salient consistent and distinct local features across RGB-T modalities. Experimental results show that TUNI achieves competitive performance with state-of-the-art models on FMB, PST900 and CART, with fewer parameters and lower computational cost. Meanwhile, it achieves an inference speed of 27 FPS on a Jetson Orin NX, demonstrating its real-time capability in deployment. Codes are available at https://github.com/xiaodonguo/TUNI.

Abstract:
In the vision of smart cities, technologies are being developed to enhance the efficiency of urban services and improve residents' quality of life. However, most existing research focuses on optimizing individual services in isolation, without adequately considering the reciprocal interactions among heterogeneous urban services that could yield higher efficiency and improved resource utilization. For example, human couriers could collect traffic and air quality data along their delivery routes, while sensing robots could assist with on-demand delivery during peak hours, enhancing both sensing coverage and delivery efficiency. However, the joint optimization of different urban services is challenging due to their potentially conflicting objectives and real-time coordination in dynamic environments. In this paper, we propose UrbanHuRo, a two-layer human-robot collaboration framework for joint optimization of heterogeneous urban services, demonstrated through the examples of crowdsourced delivery and urban sensing. There are two innovative designs in UrbanHuRo, i.e., (i) a scalable distributed MapReduce-based K-Submodular maximization module for efficient order dispatch and (ii) a deep submodular reward reinforcement learning algorithm for sensing route planning. Experimental evaluations on real-world datasets from a food delivery platform demonstrate that our UrbanHuRo improves sensing coverage by 29.7% and courier income by 39.2% on average in most settings, while also significantly reducing the number of overdue orders.

Abstract:
Learning from Demonstration (LfD) for contact-rich tasks faces a fundamental challenge: arbitrating between tracking a demonstrated trajectory and reproducing an interaction force. This paper introduces a novel one-shot LfD framework that resolves this conflict by leveraging the operator's grip force as an intuitive, continuous signal for arbitration. This signal allows the controller to seamlessly transition between a trajectory-tracking impedance controller and a force-tracking admittance controller, prioritizing path accuracy when the demonstrated grip was light and interaction force fidelity when it was firm. To ensure verifiably safe interaction, the adaptive control law is integrated within a dual-layer passivity assurance framework. This mechanism intelligently distributes potentially non-passive energy between an energy tank and adaptive null-space dissipation to guarantee energetic stability. The proposed framework was experimentally validated on a 7-DOF manipulator, demonstrating that the controller autonomously reproduces interaction forces and shows significant robustness against environmental position uncertainties, a scenario where conventional impedance controllers can fail.

Abstract:
Robot failures during collaborative tasks can frustrate users and reduce trust. To address this, we developed a failure communication module that combines large language models (LLMs) with Behavior Trees (BTs) to generate interactive, context-aware explanations for task failures. The module supports three key processes: (1) initial (high/medium/low) leveled explanations, (2) interactive clarifications for user follow-up questions, and (3) explicit verification of user actions to close the recovery loop. By leveraging the BT structure and persistent interaction history, it generates responsive, multi-turn explanations and reduces redundancy for repeated failures. We implemented and evaluated this module in real-time robotic pick-and-place tasks as a user study with 33 participants across three high/medium/low explanation conditions. The user study showed that the module improved resolution rates for challenging failures and reduced resolution times for simpler failures, demonstrating the effectiveness of LLM-powered, BT-grounded explanations in human-robot collaboration (HRC).

Abstract:
In dual-arm dexterous teleoperation, cross-platform generalization of motion retargeting and interactivity of grasping are crucial. However, the heterogeneity of robotic architectures and the wide variety of grasping objects pose significant challenges to achieving precise motion retargeting and compliant grasping in dual-arm dexterous teleoperation. To address these challenges, a dual-arm dexterous teleoperation system (DexTele) is proposed based on motion retargeting and adaptive force control. First, a vision-based motion retargeting module is designed to generate preliminary robot motions from human images. In this module, a motion-graph encoder and latent optimization are proposed for precise and convenient cross-platform motion retargeting. Second, an adaptive grasping module is designed to achieve compliant grasping. This module combines a vision-language model (VLM) with model predictive control (MPC), allowing the system to predict the required grasping force for a target object and perform gradient-based online optimization. Finally, extensive experiments demonstrate that the DexTele achieves precise motion retargeting and compliant grasping with generalization across multiple robot platforms. The source code will be released upon paper acceptance.

Abstract:
Enabling robots to manipulate articulated objects is essential for their successful integration into human-centric environments. Such manipulation is often part of a larger multistep task, where achieving a specific joint configuration is necessary for subsequent actions-for example, in a cluttered environment, a cabinet door must be rotated to a precise angle that creates just enough clearance to retrieve an object, beyond which it would collide with surrounding obstacles. In this work, we present an approach to learning goal-directed policies for articulated object manipulation using force and proprioceptive feedback. We formulate the manipulation problem as a Partially Observable Markov Decision Process (POMDP) with a continuous state space and a set of low-level control actions. Due to the limitations of standard POMDP solvers in this setting, we introduce Planning using Belief Summaries (PuBS), which approximates the POMDP as a Markov Decision Process (MDP) over compact particle-filter belief summaries encoding estimated state and uncertainty. This approximate MDP is then solved using reinforcement learning techniques to learn goal-directed policies that enable safe exploration while efficiently guiding the object toward the goal. We evaluate our approach through simulation and real-world robotic experiments, demonstrating reliable goal-reaching performance.

Abstract:
Tool-based scooping is vital in robot-assisted tasks, enabling interaction with objects of varying sizes, shapes, and material states. Recent studies have shown that flexible, reconfigurable soft robotic end-effectors can adapt their shape to maintain consistent contact with container surfaces during scooping, improving efficiency compared to rigid tools. These soft tools can adjust to varying container sizes and materials without requiring complex sensing or control. However, the inherent compliance and complex deformation behavior of soft robotics introduce significant control complexity that limits practical applications. To address this challenge, this paper presents the development of a physics-based simulation model of a deformable soft conical robotic hand that captures its passive reconfiguration dynamics and enables systematic trajectory optimization for scooping tasks. We propose a novel physics-based simulation approach that accurately models the soft tools morphing behavior from flat sheets to adaptive conical structures, combined with an evolutionary strategy framework that automatically optimizes scooping trajectories without manual parameter tuning. We validate the optimized trajectories through both simulation and real-robot experiments. The results demonstrate strong generalization and successfully address a range of challenging tasks previously beyond the reach of existing approaches. Videos of our experiments are available online: https://sites.google.com/view/scoopsh

Abstract:
Motion planning for robotic manipulators is a fundamental problem in robotics. Classical optimization-based methods typically rely on the gradients of signed distance fields (SDF) to impose collision-avoidance constraints. However, these methods are susceptible to local minima and may fail when the SDF gradients vanish. Recently, Configuration Space Distance Fields (CDF) have been introduced, which directly model distances in the robots configuration space. Unlike workspace SDF, CDF are differentiable almost everywhere and thus provide reliable gradient information. On the other hand, gradient-free approaches such as Model Predictive Path Integral (MPPI) control leverage long-horizon rollouts to achieve collision avoidance. While effective, these methods are computationally expensive due to the large number of trajectory samples, repeated collision checks, and the difficulty of designing cost functions with heterogeneous physical units. In this paper, we propose a framework that integrates CDF with MPPI to enable direct navigation in the robots configuration space. Leveraging CDF gradients, we unify the MPPI cost in joint-space and reduce the horizon to one step, substantially cutting computation while preserving collision avoidance in practice. We demonstrate that our approach achieves nearly 100% success rates in 2D environments and consistently high success rates in challenging 7-DOF Franka manipulator simulations with complex obstacles. Furthermore, our method attains control frequencies exceeding 750 Hz, substantially outperforming both optimization-based and standard MPPI baselines. These results highlight the effectiveness and efficiency of the proposed CDF-MPPI framework for high-dimensional motion planning.

Abstract:
Recent advancements in robotics have focused on developing foundation models capable of generating both actions and future states. Typically, these policies leverage world models to depict human-like imagination. However, most methods remain confined to the 2D domain, where they forecast only the final outcome state rather than the evolving interaction process, thereby offering limited guidance for step-by-step control. To address these limitations, we propose a hierarchical framework that couples 3D imagination, 3D perception, and action generation. A triplane-based world model captures future scene dynamics in a computationally efficient manner, providing predictive cues for decision-making. Based on these representations, the action expert, implemented with a flow-based policy network, converts the outputs of 3D imagination and perception into executable commands. We further introduce an adaptive Classifier-Free Guidance strategy to balance action quality with condition adherence. On Adroit, Meta-World, and real-world tasks, our method achieves a 92% voxel IoU in future state prediction and up to 8% higher success rates than state-of-the-art baselines. The performance gains highlight the effectiveness and generalizability of our method in complex robotic manipulation.

Abstract:
MorphoBall is a bio-inspired, deformable spherical robot designed for multimodal locomotion across terrestrial and aquatic environments. By integrating a dual-mode drive system (spherical rolling and differential-drive) with a morphology-mediated propulsion mechanism, MorphoBall achieves adaptive mobility in diverse terrains, including flat ground, slopes, and water surfaces. A key innovation lies in the dual-function ciliary band, which provides both passive damping during terrestrial rolling and active propulsion during aquatic navigation. Model-based controllers are developed to regulate forward velocity, trajectory curvature, and roll tilt angle, demonstrating superior stability and responsiveness compared to baseline PID implementations. Experimental results validate MorphoBall's ability to autonomously navigate structured indoor environments and traverse unstructured outdoor terrains, achieving seamless mode transitions and completing missions 34% faster than single-morphology strategies. This work highlights the potential of morphology adaptation as a tool for enhancing environmental adaptability in mobile robotics.

Abstract:
In this work, we propose a novel Mamba block DenVisCoM, as well as a novel hybrid architecture specifically tailored for accurate and real-time estimation of optical flow and disparity estimation. Given that such multi-view geometry and motion tasks are fundamentally related, we propose a unified architecture to tackle them jointly. Specifically, the proposed hybrid architecture is based on DenVisCoM and a Transformer-based attention block that efficiently addresses real-time inference, memory footprint, and accuracy for at the same time for joint estimation of motion and 3D dense perception tasks. We extensively analyze the benchmark trade-off of accuracy and real-time processing on a large number of datasets. Our experimental results and related analysis suggest that our proposed model can accurately estimate optical flow and disparity estimation in real time. All models and associated code are available at https://github.com/vimstereo/DenVisCoM.

Abstract:
A key challenge towards reliable robotic control is devising computational models that can both learn policies and guarantee robustness when deployed in the field. Inspired by the free energy principle in computational neuroscience, to address these challenges, we propose a model for policy computation that jointly learns environment dynamics and rewards, while ensuring robustness to epistemic uncertainties. Expounding a distributionally robust free energy principle, we propose a modification to the maximum diffusion learning framework. After explicitly characterizing robustness of our policies to epistemic uncertainties in both environment and reward, we validate their effectiveness on continuous-control benchmarks, via both simulations and real-world experiments involving manipulation with a Franka Research 3 arm. Across simulation and zero-shot deployment, our approach narrows the sim-to-real gap, and enables repeatable tabletop manipulation without task-specific fine-tuning.

Abstract:
In many applications of social navigation, existing works have shown that predicting and reasoning about human intentions can help robotic agents make safer and more socially acceptable decisions. In this work, we study this problem for autonomous valet parking (AVP), where an autonomous vehicle ego agent must drop off its passengers, explore the parking lot, find a parking spot, negotiate for the spot with other vehicles, and park in the spot without human supervision. Specifically, we propose an AVP pipeline that selects parking spots by explicitly predicting where other agents are going to park from their motion history using learned models and probabilistic belief maps. To test this pipeline, we build a simulation environment with reactive agents and realistic modeling assumptions on the ego agent, such as occlusion-aware observations, and imperfect trajectory prediction. Simulation experiments show that our proposed method outperforms existing works that infer intentions from future predicted motion or embed them implicitly in end-to-end models, yielding better results in prediction accuracy, social acceptance, and task completion. Our key insight is that, in parking, where driving regulations are more lax, explicit intention prediction is crucial for reasoning about diverse and ambiguous long-term goals, which cannot be reliably inferred from short-term motion prediction alone, but can be effectively learned from motion history.

Abstract:
This paper presents the SoftHand Model-W: a 3D-printed, underactuated, anthropomorphic robot hand based on the Pisa/IIT SoftHand, with an integrated antagonistic tendon mechanism and 2 degree-of-freedom tendon-driven wrist. These four degrees-of-acuation provide active flexion and extension to the five fingers, and active flexion/extension and radial/ulnar deviation of the palm through the wrist, while preserving the synergistic and self-adaptive features of such SoftHands. A carpal tunnel-inspired tendon routing allows remote motor placement in the forearm, reducing distal inertia and maintaining a compact form factor. The SoftHand-W is mounted on a 6-axis robot arm and tested with two reorientation tasks requiring coordination between the hand and arm's pose: cube stacking and in-plane disc rotation. Results comparing task time, arm joint travel, and configuration changes with and without wrist actuation show that adding the wrist reduces compensatory and reconfiguration movements of the arm for a quicker task-completion time. Moreover, the wrist enables pick-and-place operations that would be impossible otherwise. Overall, the SoftHand Model-W demonstrates how proximal degrees of freedom are key to achieving versatile, human-like manipulation in real world robotic applications, with a compact design enabling deployment in research and assistive settings.

Abstract:
Dexterous in-hand manipulation remains a foundational challenge in robotics, with progress often constrained by the prevailing paradigm of imitating the human hand. This anthropomorphic approach creates two critical barriers: 1) it limits robotic capabilities to tasks humans can already perform, and 2) it makes data collection for learning-based methods exceedingly difficult. Both challenges are caused by traditional force-closure which requires coordinating complex, multi-point contacts based on friction, normal force, and gravity to grasp an object. In this work, we propose a paradigm shift: moving away from replicating human mechanics toward the design of novel robotic embodiments. We introduce the Suction Leap Hand (SLeap Hand), a multi-fingered hand featuring integrated fingertip suction cups that realize a new form of suction-enabled dexterity. By replacing complex force-closure grasps with stable, single-point adhesion, our design fundamentally simplifies in-hand teleoperation and facilitates the collection of high-quality demonstration data. More importantly, this suction-based embodiment unlocks a new class of dexterous skills that are difficult or even impossible for the human hand, such as one-handed paper cutting and in-hand writing. Our work demonstrates that by moving beyond anthropomorphic constraints, novel embodiments can not only lower the barrier for collecting robust manipulation data but also enable the stable, single-handed completion of tasks that would typically require two human hands. Our webpage is https://sites.google.com/view/sleaphand.

Abstract:
The grasping capabilities of robotic arms have been extensively studied by researchers in terms of accuracy, flexibility, and versatility, enabling robots to perform various tasks in domestic, industrial, and medical scenarios. However, grasping flat objects has remained a significant challenge and is often overlooked as a limiting case in robotic manipulation. To address this highly difficult task, this paper proposes a novel gripper design that combines a pneumatic soft gripper with a unilateral fingernail structure. We evaluate two grasping strategies and corresponding target generation methods tailored for this design. The proposed system significantly improves the success rate of stably grasping flat objects that lie flush on tables. Moreover, its soft interaction with the table surface reduces the need for highly precise object-table height detection, thereby saving computational time and cost. Finally, we conduct experimental tests on various common flat objects, validating the effectiveness of both the gripper design and the grasping strategies.

Abstract:
Transformable robots adapt to various environments by changing their shape or functionality. The robots are able to further expand their task range by replacing their end-effectors (EEs). In this paper, we propose an adaptive-limb transformable robot capable of replacing multiple types of mounted EEs. First, 7 degrees of freedom (DoF) limbs can reach multiple types of EEs mounted on the front body surface, and replace them using that single limb without relying on external devices. Second, we develop a compact Lock-Spin mechanism that integrates a locking mechanism into the rotor of the motor to enable continuous rotation. Experimental results demonstrate that the proposed transformable robot can replace EEs on-site and that this replacement enables locomotion and manipulation adapted to the environment.

Abstract:
This paper presents a tightly coupled Rao-Blackwellized particle filter (TC-RBPF) for global navigation satellite system (GNSS) positioning that eliminates the need for carrier-phase integer ambiguity resolution. The previously proposed loosely coupled RBPF (LC-RBPF) approach uses carrier-phase residuals to estimate particle likelihoods, enabling positioning without integer ambiguity resolution. However, the position estimation accuracy depends on the performance of the state transition. The previous approach estimates velocity using a Kalman filter (KF) based on least-squares Doppler measurements, which are vulnerable to non-line-of-sight (NLOS) multipath errors. This often leads to complete positioning failure in urban environments. To overcome these limitations, the proposed TC-RBPF tightly integrates raw Doppler measurements into the KF. This enables consistent estimation of both velocity and receiver clock drift within a time-series framework. Furthermore, a robust KF based on Student's t-distribution and particle-wise NLOS rejection using double-differenced pseudorange residuals are introduced to mitigate the impact of outliers. Together, these mechanisms enhance outlier robustness and transition reliability. Experimental evaluations in six challenging urban scenarios demonstrate that the proposed method achieves superior positioning performance compared to existing methods, confirming its effectiveness under degraded GNSS conditions.

Abstract:
Mobile robots demand power efficiency as well as accuracy and high performance in their computations. Embedded microcontrollers and FPGAs, which can consume as much as 1000x less power than large CPUs and GPUs, offer a promising solution to these power needs. However, these power-efficient platforms often lack full floating-point support and rely on fixed-point computations to deliver performance. This is a challenge as most robotics software uses floating-point datatypes (double, float) to conservatively ensure accuracy, and prior works that use fixed-point types employ unreliable ad hoc approaches to select the datatype precision (i.e., quantity and allocation of bits). We address this challenge with the RoboPrec framework, where we: (i) develop a transpiler that integrates code transformations and robot-specific code generation with traditional numerical stability analysis methods (which calculate guaranteed error bounds), and adapts them to be practical and usable for real-world robotics software; and then leverage this to (ii) generate guaranteed-accuracy fixed- point code that is deployable to embedded computing platforms. We use rigid body dynamics, a fundamental robotics workload, as a motivating case study. We find that RoboPrec-generated 32-bit fixed-point code can be up to 8x faster than float and 122x faster than double on embedded processors while, critically, also providing guaranteed accuracy bounds with lower worst-case error than float. This work provides a foundation for practical and reliable low-power embedded robotics computing.

Abstract:
Flexible needles provide enhanced adaptability for navigating puncture pathways and avoiding obstacles when compared to conventional rigid needles. However, developing a three dimensional (3D) curved path for flexible needle is challenging, particularly in achieving both effective obstacle avoidance and precise targeting. To this end, we proposed an improved particle swarm optimization-based path planning approach by incorporating good point set initialization and heuristic multi-mutation strategy. Such incorporation greatly enhanced the algorithms global search capability while ensuring fast convergence speed. In addition, 3D biarc curve fitting was employed to develop a kinematically reachable path for bevel tip needles. Obstacle-avoidance simulations conducted demonstrate the superior performance of proposed method against state-of-the-art algorithms in the aspect of path length and distance to obstacles, repeatability and local minima trap avoidance. Needle puncturing experiments performed using duty cycling control achieved a small curvature radius of 49.6 mm and targeting errors of less than 4 mm. This algorithm facilitates efficient variable curvature path planning for flexible needles, ensuring precise targeting while effectively avoiding obstacles.

Abstract:
Estimating the 6D pose of rigid objects is a critical upstream task in many robotics applications. Most existing methods rely on RGB or RGB-D sensing modalities, which suffer from limitations under challenging lighting conditions and high-speed motion. In contrast, event-based cameras offer unique advantages such as high temporal resolution and high dynamic range, making them well-suited for such scenarios. However, current event-based pose estimation methods are typically optimization-based, designed for relatively simple objects, and require hand-crafted parameters. In this work, we introduce the first learning-based approach for 6D object pose estimation using event cameras, employing an Augmented Event Encoder (AEE) trained entirely only on synthetic data and validated on the E-POSE dataset. Our model leverages an augmented autoencoder with domain randomization to map synthetic templates into a latent space, enabling accurate matching with real event query images. The method demonstrates robust performance across various scenarios, including changes in illumination and camera speeds, and achieves strong results on the ADD-S (Rotation) metric.

Abstract:
The application of mobile mapping systems (MMS) has increased continuously in the last decades in fields like infrastructure or ecosystem monitoring. Equipped with multiple laser scanners and cameras, these systems can generate high-resolution 3D point clouds of the environment in a short time. In this process, the accuracy of the trajectory of the system is of central importance as it directly affects the accuracy of the resulting point cloud. However, since the trajectory estimation depends on sensor observations that are often affected by unknown systematic errors, the actual accuracy of the trajectory remains mainly unknown. To uncover the gap in the trajectory accuracy assessment, we present a method to create ground truth trajectories for mobile mapping systems by integrating millimeter-accurate total station measurements. We mount two 360-degree prisms on a mobile platform, track them with two Robotic Total Stations (RTS) during motion, and fuse these prism measurements with the readings from an Inertial Measurement Unit (IMU) using a factor graph-based trajectory estimation approach. To evaluate the quality of this ground truth trajectory, we record repeated measurements on a closed-loop rail track close to Bonn, Germany. The results show that the generated ground truth trajectory estimated with RTS and IMU data achieves a precision of around 1 mm in position and 0.05�?in orientation. To show the potential of the method, we detect systematic deviations of an example MSS that uses Real-Time Kinematic Global Navigation Satellite System (RTK-GNSS) and IMU data for trajectory estimation. The results show that even under good GNSS conditions, the ground truth trajectory has significantly better precision and less systematic errors than the trajectory based on RTK-GNSS and IMU data.

Abstract:
Control Barrier Functions (CBFs) provide a powerful framework for enforcing real-time safety in control systems and have seen increasing applications in safety-critical domains, such as surgical robotics. In vitreoretinal microsurgery, where precision and tissue protection are crucial, we propose a shared control approach that leverages CBFs to maintain the robot's end-effector within a safe zone above the retina. Using real-time 3D reconstruction from an instrument-integrated Optical Coherence Tomography (iiOCT) system mounted on the surgical tool, we define a safety band between two offset surfaces derived from the reconstructed retina. A hybrid controller drives the tool into the band when outside and then enforces forward invariance using a CBF-based quadratic program. Concurrently, haptic feedback proportional to the deviation from the band centre guides the surgeon toward the optimal working distance. We validate our method in ex vivo pig eye experiments, performing a simulated Vitreous Shaving (VS), showing improved safety and operator awareness.

Abstract:
Pneumatic soft robots are well-suited for minimally invasive surgery owing to their compliance and safe interaction with tissues. However, achieving highly adaptive control is difficult owing to modeling inaccuracies, inter-chamber coupling, and disturbances from surgical instruments. Non-learning adaptive methods depend on simplified models and perform poorly in unstructured settings. Conversely, learning-based methods often impose high computational costs in multi-degree-of-freedom (multi-DoF) pneumatic systems. A previous study proposed a generative adversarial network (GAN)-based proportionalintegralderivative (G-PID) controller that combined PID stability with learning-based adaptability by aligning system behavior with a reference model. However, its performance in highly coupled multi-DoF pneumatic soft robots was unverified, and its online adversarial training was computationally intensive. We addressed these limitations by developing an offline-trained G-PID controller, shifting adversarial training offline to reduce computational overhead, achieving 23-fold faster convergence, and enabling real-time, model-free control with balanced adaptability and efficiency. We evaluated three multi-DoF data fusion strategies, showing effective coordination of DoF coupling while maintaining individual control fidelity. Validation on a multi-DoF soft robotic mechatronic system for single-port transvesical prostatectomy revealed tip errors below 0.16 mm across surgical instrument

Abstract:
Navigating Linked Multi-Component Robotic Systems (L-MCRS)---robot pairs tethered by passive flexible hoses---through dynamic pedestrian environments is fundamentally harder than rigid multi-robot coordination, as the uncontrollable hose creates a variable-geometry collision footprint spanning 118 pairwise combinations. We propose H-SEPID, unifying zone-aware Hierarchical Reinforcement Learning grounded in Kinematic Flow Theory with VLM-guided cascaded optimization. A phase-aware dual attention value network performs C0-continuous topological policy switching, while a Vision-Language Model infers strategic intent and quantifies action-space constraints governing hose geometry. A seven-category safety shield with ORCA fallback and a threading reward band produce emergent gap-threading maneuvers. H-SEPID achieves 94 success and 4 collision rate in an 8-robot, 5-pedestrian, 4-hose scenario, outperforming five baselines, and is validated on real e-puck2 robots across 12 configurations.

Abstract:
This paper proposes a contact event-guided PPO with Behavior Cloning (PPO-BC) framework for stair locomotion of a 2-wheel 2-leg (2W2L) wheeled bipedal robot. Stair traversal is difficult because successful climbing depends on brief and sparse wheel-stair contact events that require precise leg lifting and posture stabilization. To address this issue, the proposed method trains a student policy using a combined objective of PPO-based reinforcement learning and behavior cloning from a pretrained frozen teacher policy. The teacher learns leg-centered climbing behaviors, while the student learns full 8-DoF control. A soft contact gate detects stair interaction directly from wheel contact forces and increases the BC contribution during critical contact phases without external terrain sensors. The method is validated under a minimal reward structure based on velocity tracking and postural stability, without stair-specific shaping rewards. Experiments in Isaac Lab simulation show that the proposed method outperforms both pure PPO and uniform PPO-BC in stair-crossing performance while maintaining stable locomotion after traversal.

Abstract:
The performance of robot actuators is still primarily evaluated using manufacturer-provided static specifications such as maximum torque and rated speed. However, these metrics are insufficient for assessing dynamic behaviors that are essential for physical interaction, including backdrivability, transparency, and disturbance response. This paper presents HYPERDYNE, a novel proof of concept test platform and evaluation framework for dynamic characterization and quantitative benchmarking of robot actuators. The reconfigurable testbed is developed, enabling three test configurations of no-load, fixed-load, and interaction scenarios within a single hardware setup. In addition, an evaluation protocol is proposed that includes system identification, control performance, load robustness, and disturbance rejection. Experimental validation on a QDD actuator demonstrates that the proposed framework enables the extraction of key dynamic parameters such as backlash, friction, inertia, and frequency response characteristics, while also providing performance indices for objective comparison. The results show that actuator performance can be quantitatively assessed beyond conventional static specifications, supporting the development of robots with improved physical interaction capabilities.

Abstract:
Virtual mass simulation is one of the recent topics in the field of haptic devices (HDs), which can alter the apparent mass of the HD. Simulating negative values of virtual mass leads to a decrease in the apparent effective mass, improving transparency but weakening stability. Positive virtual mass rendering increases the apparent mass, reduces transparency, and enhances stability. This paper analyzes the stability of a haptic device while simulating a virtual environment consisting of a mass, spring, and damper in the presence of a constant time delay. The results are closed-form equations that can predict the stability boundary for small and even large values of virtual damping and time delay. These closed-form equations demonstrate that the maximum renderable virtual mass is twice the physical mass of the HD, and the minimum value equals its negative; both occur in the case of zero time delay. Increasing the time delay reduces both the minimum and maximum values of the renderable virtual mass. The study also shows that using virtual mass can improve the maximum value of a renderable virtual spring. The equations show that, in the absence of delay, properly tuning the virtual mass and virtual damping can enlarge the maximum renderable stiffness by up to 5.8 times in theory. In the experiments under time delay, the stiffness increased by a factor of 3.5, compared to the theoretical prediction of 4.1 times. The results further reveal situations where a nonzero minimum stiffne

Abstract:
Control barrier functions (CBFs) have proven to be effective for obstacle avoidance in robot teleoperation systems. However, for classical CBF, model uncertainties and external disturbances can significantly degrade the robustness of safety control. Moreover, the fixed safety boundary lacks adaptability to dynamic switching on operational intentions. To address these limitations, this paper presents a hierarchical safety teleoperation framework that separates the safety layer from the leader-follower teleoperation layers. On this basis, a virtual proxy is introduced to construct a robust control-affine system decoupled from physical robot uncertainties and external disturbances. Building upon this, we propose an intention-aware adaptive control barrier function (IA-ACBF), which consists of two modules: intention detection and intention quantification. The intention detection module determines the operator's transient intention, which belongs to object interaction or obstacle avoidance. The intention quantification module then maps this to the adaptation of safety boundaries. Finally, the performance of the proposed method is validated through simulations and experiments with the physical robot.

Abstract:
Robotic visual segmentation is essential for enabling robots to operate in complex environments. Although supervised methods have achieved remarkable progress, their dependence on dense annotations hinders scalability. Weakly supervised semantic segmentation (WSSS) alleviates this issue but suffers from sparse supervision, leading to noisy pseudo-labels and boundary errors. Large visual models (LVMs), pretrained on diverse data, provide rich semantic priors that can strengthen weak supervision and address these limitations. To this end, we designed a dual-branch architecture, introducing two large pre-trained models with complementary characteristics. We align the feature spaces of the two branches through consistency learning to alleviate the representation differences and weakly supervised noise problems caused by cross-domain migration, thereby obtaining more robust and fine-grained semantic features. Furthermore, to effectively restore spatial details and improve the quality of segmentation boundaries, we introduce a wavelet transform in the decoder. Wavelet decomposition can simultaneously capture low-frequency global information and high-frequency local details at multiple scales, allowing the model to enhance spatial restoration capabilities while maintaining semantic consistency. Experimental results show that our method improves the performance by 7.7% compared with the state-of-the-art methods in WSSS.

Abstract:
This paper presents an autonomous control framework for articulated boom cranes performing prefabricated block assembly in construction environments. The key challenge addressed is precise placement control under passive joint dynamics that cause pendulum-like sway, complicating the accurate positioning of building components. Our integrated approach combines real-time vision-based pose estimation of building blocks, collision-aware B-spline path planning, and nonlinear model predictive control (NMPC) to achieve autonomous pickup, placement, and obstacle-avoidance assembly operations. The framework is validated on a laboratory-scale testbed that emulates crane kinematics and passive dynamics while enabling rapid experimentation. The collision-aware planner generates feasible B-spline references in real-time on CPU hardware with anytime performance, while the NMPC controller actively suppresses passive joint sway and tracks the planned trajectory under continuous vision feedback. Experimental results demonstrate autonomous block stacking and obstacle-avoidance assembly, with sway damping reducing settling times by more than an order of magnitude compared to uncontrolled passive dynamics, confirming the real-time feasibility of the integrated approach for construction automation.

Abstract:
Traditionally, prediction and planning in autonomous driving (AD) have been treated as separate, sequential modules. Recently, there has been a growing shift towards tighter integration of these components, known as Integrated Prediction and Planning (IPP), with the aim of enabling more informed and adaptive decision-making. However, it remains unclear to what extent this integration actually improves planning performance. In this work, we investigate the role of prediction in IPP approaches, drawing on the widely adopted Val14 benchmark, which encompasses more common driving scenarios with relatively low interaction complexity, and the interPlan benchmark, which includes highly interactive and out-of-distribution driving situations. Our analysis reveals that even access to perfect future predictions does not lead to better planning outcomes, indicating that current IPP methods often fail to fully exploit future behavior information. This suggests that planning may not benefit as much from accurate prediction. Instead, we focus on high-quality proposal generation, while using predictions primarily for collision checks. We find that many imitation learning-based planners struggle to generate realistic and plausible proposals, performing worse than PDMa simple lane-following approach. Motivated by this observation, we build on PDM with an enhanced proposal generation method, shifting the emphasis towards producing diverse but realistic and high-quality proposals. This proposal-centric approach significantly outperforms existing methods, especially in out-of-distribution and highly interactive settings, where it sets new state-of-the-art results.

Abstract:
Unmanned underwater vehicles are increasingly employed for maintenance and surveying tasks at sea, but their operation in shallow waters is often hindered by hydrodynamic disturbances such as waves, currents, and turbulence. These unsteady flows can induce rapid changes in direction and speed, compromising vehicle stability and manoeuvrability. Marine organisms contend with such conditions by combining proprioceptive feedback with flexible fins and tails to reject disturbances. Inspired by this strategy, we propose soft morphing wings endowed with proprioceptive sensing to mitigate environmental perturbations. The wings continuous deformation provides a natural means to infer dynamic disturbances: sudden changes in camber directly reflect variations in the oncoming flow. By interpreting this proprioceptive signal, a disturbance observer can reconstruct flow parameters in real time. To enable this, we develop and experimentally validate a dynamic model of a hydraulically actuated soft wing with controllable camber. We then show that curvature-based sensing allows accurate estimation of disturbances in the angle of attack. Finally, we demonstrate that a controller leveraging these proprioceptive estimates can reject disturbances in the lift response of the soft wing. By combining proprioceptive sensing with a disturbance observer, this technique mirrors biological strategies and provides a pathway for soft underwater vehicles to maintain stability in hazardous environments.

Abstract:
This paper addresses the labor-intensive process of converting imprecise hand-drawn sketches into precise, parametric CAD sketches. We present Sketch2CAD, a novel deep learning framework that leverages generative adversarial networks (GANs) to automate this conversion. Our approach consists of two main stages: first, a sketch correction module transforms freehand sketches into clean, standardized CAD-like sketches; second, a semantic segmentation module parses the generated sketches to identify and classify geometric primitives (lines, circles, arcs, points). We further introduce an optimized post-processing algorithm that extracts parametric primitives and infers geometric constraints from the segmentation results, enabling direct integration with commercial CAD software. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in both primitive accuracy (94.56%) and constraint recognition. This work provides a robust solution that reduces manual effort in CAD drafting while maintaining engineering precision, particularly suitable for robotics applications requiring rapid prototyping and design iteration.

Abstract:
Unmanned aerialaquatic vehicles (UAAVs) provide cross-domain adaptability and broad visions, while autonomous underwater vehicles (AUVs) support long-duration operations. This work integrates the two by developing a rapid underwater docking and releasing system. An autonomous clamping mechanism is designed to anchor UAAVs under varied landing attitudes, and a visiontactile state perception algorithm based on decision-level dual-modal fusion is proposed to enable reliable underwater docking with no need of communications between the UAAV and AUV. Experimental results validate autonomous perception and reliable docking in fully underwater environments, achieving a docking time of 6 s and a landing gear recognition accuracy of 3 mm. The proposed framework offers an efficient solution for aerialaquatic cooperation, advancing cross-domain robotic platforms for ocean monitoring, emergency response, and underwater exploration.

Abstract:
This paper extends a previously proposed fall prediction algorithm to a real-time (online) setting, with implementations in both hardware and simulation. The system is validated on the full-sized bipedal robot Digit, where the real-time version achieves performance comparable to the offline implementation while maintaining a zero false positive rate, an average lead time (defined as the difference between the true and predicted fall time) of 1.1s (well above the required minimum of 0.2s), and a maximum lead time error of just 0.03s. It also achieves a high recovery rate of 0.97, demonstrating its effectiveness in real-world deployment. In addition to the real-time implementation, this work identifies key limitations of the original algorithm, particularly under omnidirectional faults, and introduces a fine-tuned strategy to improve robustness. The enhanced algorithm shows measurable improvements across all evaluated metrics, including a 0.05 reduction in average false positive rate and a 1.19s decrease in the maximum error of the average predicted lead time.

Abstract:
For visually impaired individuals, assistive navigation systems play a crucial role in enabling independent mobility. However, long-horizon planning based on natural language (NL) instructions in complex indoor environments remains a significant challenge. Recent studies show the strong potential of Large Language Models (LLMs) in NL understanding and task-level planning. Yet, the inherent limitations of LLMs in mathematical reasoning and their susceptibility to hallucination hinder their reliability in low-level path planning. In this paper, we introduce an LLM-based indoor assistive navigation system that interprets NL instructions from visually impaired users for autonomous navigation. At its core is a novel planning agent that grounds instructions to the environment's topological map and generates optimal route plans. To avoid hallucination in geometric reasoning, the LLM handles only high-level semantic planning, while precise node-level paths are delegated to a classical graph search algorithm. We further implement a wearable assistive device that provides voice and vibrotactile feedback to deliver hands-free navigation. Offline evaluations and real-world experiments demonstrate that our system can reliably plan grounded routes and enable visually impaired users to autonomously complete long-horizon navigation tasks. Anonymous project page is available at https://lhp-ian.github.io.

Abstract:
In the context of asset inspection, the size of the environment to be covered can be large. Mobile robotic systems are capable of acquiring more extensive data than static sensors, but the capacity of the robotic platform used can be limited by its autonomy and sensing capabilities. This is why multi-robot systems are interesting in such applications. However, scaling up to larger robot team sizes requires coordination among robots to be carried out efficiently. In this work, we investigate a hybrid coordination strategy with a team of micro-aerial vehicles, where a ground station centrally assigns tasks in real time to the robots, and the robots distributively coordinate their trajectories to carry out the coverage of a known asset. In particular, we perform the benchmarking and experimental validation of such a strategy. Several variants of the strategy are implemented by adapting existing state-of-the-art solutions to this context. Extensive simulation experiments are carried out in various environments to benchmark each variant and evaluate how their performance scales with the robot team size. The results show that the strategy scales well for larger robot teams, thanks to its efficient task generation process. Notably, despite its relatively simple but efficient task generation technique, it outperforms or is comparable to other methods employing more complex schemes (such as information gain or frontiers). Finally, we validated the proposed strategy with teams of up to three robots in physical experiments.

Abstract:
Electromyography (EMG) signals are widely used in assistive exoskeleton control for predicting human joint torque due to their ability to extract muscle activations before movement onset. The standard procedure for learning the EMG-to-torque model involves training the model on a training set of EMG-torque data, followed by validating the model on a separate test set. The comparison between models is generally undertaken on the test set. However, the analysis of model performance on the data outside the test set remains unaddressed. The lack of a guarantee for unseen data reduces the reliability of EMG-to-torque models in practical exoskeleton control. In this paper, we address this issue by proposing a bounded-generalization-error neural network (BGNN) for EMG-based torque prediction. Using gradient descent to train the network, we formulate at each training step a theoretical upper bound on the generalization error, reflecting the prediction error across the entire data distribution, including unseen data beyond the test set. The NN training is terminated when this upper bound reaches its minimum, thereby achieving the tightest guarantee on the generalization error. Experimental results on torque prediction demonstrated that, while ensuring such a bounded generalization error, our method still gave results comparable to those of classical models. The use of our BGNN in assistive exoskeleton control was also tested with 13 participants on a pick-and-place task with an upper limb exoskeleton. Experimental results on assistive control revealed that our method can reduce human physical fatigue without compromising movement speed or accuracy compared to natural human movement characteristics, particularly for generalization in novel tasks.

Abstract:
High-Intensity Focused Ultrasound (HIFU) is a non-invasive therapeutic technology enabling precise energy delivery for the selective ablation of tumors, while preserving surrounding healthy tissue. Currently, there is no gold standard for defying sonication parameters to cover a tumor surface and volume, as this decision relies solely on the physicians experience. This work proposes a novel planning algorithm for robotic HIFU procedures that ensure automatic and optimized tumor coverage. The approach relies on a predictive model that estimates the dimensions of HIFU-induced thermal lesions based on sonication parameters (source pressure amplitude and sonication time) and leverages genetic algorithms to compute single lesion placement over the treatment area. The optimization function primarily aims to maximize the surface coverage over a defined target area and then integrates motion planning algorithms. In addition to planar lesions, a volumetric ablation composed by a set of co-planar surface lesions was also evaluated. The method was experimentally validated on ex-vivo tissues through a robotic ultrasound-guided (USg) HIFU platform. This study bridges pre-operative lesion prediction and intra-operative robotic execution, supporting standardized and effective HIFU therapy.

Abstract:
Communication latency in long-distance telerobotic surgery poses a critical safety risk, particularly in high-precision procedures like retinal surgery where tool overshoots can cause irreversible patient injury. This paper introduces the Latency-Aware Telemonitoring for Injection in Ophthalmic Surgery (LATIOS) framework, which enhances safety by adaptively scaling the surgical robot's velocity along the critical axis of tool insertion. Our core contribution is a control algorithm that dynamically modulates the velocity scaling factor based on two real-time, coupled variables: the tool-tip-to-retina distance, estimated via a non-contact, shadow-based method, and the measured communication delay. We validated this system in a transatlantic user study where six participants in North America teleoperated a surgical robot in Europe to perform a series of simulated retinal punctuation tasks. The results demonstrate that LATIOS provides a statistically significant reduction in applied punctuation forces compared to constant control (p = 0.022). This objective safety improvement is achieved through a deliberate safety-efficiency trade-off, with the system enforcing a more cautious pace under high-latency conditions. Our work presents a robust, context-aware safety framework that addresses a key barrier to the clinical adoption of long-distance telerobotic surgery.

Abstract:
Reliable long-range flight of unmanned aerial vehicles (UAVs) in GNSS-denied environments is challenging: integrating odometry leads to drift, loop closures are unavailable in previously unseen areas and embedded platforms provide limited computational power. We present a fully onboard UAV system developed for the SPRIN-D Funke Fully Autonomous Flight Challenge, which required 9~km long-range waypoint navigation below 25~m AGL (Above Ground Level) without GNSS or prior dense mapping. The system integrates perception, mapping, planning, and control with a lightweight drift-correction method that matches LiDAR-derived local heightmaps to a prior geodata heightmap via gradient-template matching and fuses the evidence with odometry in a clustered particle filter. Deployed during the competition, the system executed kilometer-scale flights across urban, forest, and open-field terrain and reduced drift substantially relative to raw odometry, while running in real time on CPU-only hardware. We describe the system architecture, the localization pipeline, and the competition evaluation, and we report practical insights from field deployment that inform the design of GNSS-denied UAV autonomy.

Abstract:
Unmanned aerial vehicles (UAVs) are increasingly used for infrastructure inspection, but conventional joystick and first-person-view (FPV) controllers remain unintuitive, error-prone, and cognitively demanding, particularly in cluttered or safety-critical environments. We present 3DGS-Holo- Inspector, a Mixed Reality (MR) UAV controller that combines holographic goal-setting with autonomous UAV navigation. Using natural hand gestures, operators can define and preview navigation goals directly in MR before flight, ensuring precise and safe data capture at inspection viewpoints. The system complements existing inspection pipelines by leveraging pre-built 3D maps (e.g., photogrammetry or LiDAR reconstructions) to enable refinement of regions of interest (ROIs) where coverage is incomplete or the detail is insufficient. Robust headsetUAV alignment is achieved through a LiDARRGB 3D Gaussian Splatting (3DGS) localization backbone, which provides dense, markerless, and persistent spatial registration in both indoor and outdoor settings. Once goals are placed, the UAV autonomously navigates to the specified pose, with real-time telemetry and live video overlaid in MR to enhance situational awareness. Experimental validation using a ModalAI Starling UAV and Microsoft HoloLens 2 demonstrated accurate UAV-goal alignment, achieving a positional Root Mean Square Error (RMSE) of 0.090 m (median = 0.084 m) indoors and 0.119 m (median = 0.118 m) outdoors, with orientation (yaw) RMSEs of 1.491 �?(median = 1.400 �?) and 2.233 �?(median = 2.268 �?), respectively. These results confirm that 3DGS-Holo-Inspector provides reliable MR-based UAV control, augmenting inspection workflows by enabling safe, intuitive, and high-precision UAV operations in real-world environments.

Abstract:
Designing reliable upper-limb human-machine interfaces (HMIs) with low attentional demand requires strengthening the proprioceptive-motor pathway (PMP). We propose a closed-loop multimodal sensory training that maps three robot joint angles to six bidirectional electrotactile channels and combines visual fading with degrees-of-freedom (DoF) progression to shift reliance from vision to tactile-proprioceptive guidance. The objective is low-load automaticity for supplemental cues and improved native-limb fine motor control. Twenty right-handed adults completed a six-day protocol. Using synchronized kinematics and EEG, we evaluated electrotactile-driven tasks: eyes-closed continuous tracking and static posture reproduction, dual-task posture reproduction with serial subtraction, reversed-mapping generalization, and a proprioceptively constrained maze. Training produced robust gains under tactile-proprioceptive dominance: errors decreased (~30%) and response time shortened. Under dual-task load, posture error and response time decreased while correct subtractions increased and mistakes decreased, supporting low-load automaticity of electrotactile decoding. Although group-level β-event-related desynchronization (ERD) changes were not significant, contralateral ERD reductions and post-movement beta rebound (PMBR) enhancements during tactile decoding were consistent with reduced cortical effort and emerging automatic control. Performance generalized to reversed mapping, and maze completion time decreased significantly, evidencing improved fine motor control. These findings show that closed-loop vision-tactile-proprioceptive integration offers a compact, reproducible route to PMP enhancement, enabling low-load automaticity and finer control, with actionable design targets for prosthetics, exoskeleton rehabilitation, and vision-limited teleoperation.

Abstract:
Navigating unknown environments with a single RGB camera is challenging, as the lack of depth information prevents reliable collision-checking. While some methods use estimated depth to build collision maps, we found that depth estimates from vision foundation models are too noisy for zero-shot navigation in cluttered environments. We propose an alternative approach: instead of using noisy estimated depth for direct collision-checking, we use it as a rich context input to a learned collision model. This model predicts the distribution of minimum obstacle clearance that the robot can expect for a given control sequence. At inference, these predictions inform a risk-aware MPC planner that minimizes estimated collision risk. We proposed a joint learning pipeline that co-trains the collision model and risk metric using both safe and unsafe trajectories. Crucially, our joint-training ensures well calibrated uncertainty in our collision model that improves navigation in highly cluttered environments. Consequently, real-world experiments show reductions in collision-rate and improvements in goal reaching and speed over several strong baselines.

Abstract:
This work introduces a novel compact 7-degree-of-freedom (7-DOF) microsurgical robot with position-orientation decoupling capacity for microvascular anastomosis. The proposed system employs a modular architecture combining a proximal displacement platform for 3D small-stroke translation and a distal compact remote center of motion (RCM) mechanism for wide-range orientation adjustment. This design meets the workspace requirements for microvascular anastomosis, requiring extensive orientation adjustments with minimal positional movement and reducing the system footprint. The parasitic motion reverse self-compensation method has been developed for motorized surgical instruments, effectively reducing operational resistance to improve precision. Theoretical analysis has been performed on both the RCM mechanism and motorized surgical instruments, and kinematics-based parameter optimization and data-driven calibration have been conducted to enhance superior performance. A prototype has been constructed, and its experimental validation demonstrated that the system achieved repeatability of 11.24 ± 2.31 μm (XY) and 12.46 ± 4.48 μm (YZ), and absolute positioning accuracy of 29.80 ± 12.27 μm (XY) and 37.02 ± 19.47 μm (YZ), meeting super-microsurgical requirements. Experiments that include needle-threading and stamen peeling tasks demonstrate the robot's superior dexterity and manipulation capabilities.

Abstract:
Lidar technology has been widely employed across various applications, such as robot localization in GNSS-denied environments and 3D reconstruction. Recent advancements have introduced different lidar types, including cost-effective solid-state lidars such as the Livox Avia and Mid-360. The Mid-360, with its dome-like design, is increasingly used in portable mapping and unmanned aerial vehicle (UAV) applications due to its low cost, compact size, and reliable performance. However, the lack of datasets that include dome-shaped lidars, such as the Mid-360, alongside other solid-state and spinning lidars, significantly hinders the comparative evaluation of novel approaches across platforms. Additionally, performance differences between low-cost solid-state and high-end spinning lidars (e.g., Ouster OS series) remain insufficiently examined, particularly without an Inertial Measurement Unit (IMU) in odometry. To address this gap, we introduce a novel dataset comprising data from multiple lidar types, including the low-cost Livox Avia and the dome-shaped Mid-360, as well as high-end spinning lidars such as the Ouster series. Notably, to the best of our knowledge, no existing dataset comprehensively includes dome-shaped lidars such as Mid-360 alongside both other solid-state and spinning lidars. In addition to the dataset, we provide a benchmark evaluation of state-of-the-art SLAM algorithms applied to this diverse sensor data. Furthermore, we present a quantitative analysis of point cloud registration techniques, specifically point-to-point, point-to-plane, and hybrid methods, using indoor and outdoor data collected from the included lidar systems. The outcomes of this study establish a foundational reference for future research in SLAM and 3D reconstruction across hete

Abstract:
Motion planning for large-scale robotic swarms presents significant challenges in terms of scalability and safety assurance in cluttered environments. To address these issues, this manuscript proposes a Closed-loop hierarchical Risk-aware swarm mOtion planner using Conditional ValuE at Risk (C-ROVER) that enables safe and efficient navigation for swarm robotic systems. The hierarchical structure of C-ROVER comprises a macroscopic planning stage that models the swarm state with Gaussian Mixture Models (GMMs) and generates trajectories for the swarm GMM, followed by a microscopic control stage that computes individual robot control using distributed model predictive control to track the GMM trajectories while achieving robot-level collision avoidance. Robot positions are periodically used to update the swarm GMM, closing the hierarchical planning and control loop. To achieve collision riskawareness between the swarm and environmental obstacles at the macroscopic stage, C-ROVER leverages the stochastic Signed Distance Function to characterize the distance between the swarm GMM and obstacles, which is proven to follow a GMM. Then C-ROVER proposes an analytical expression of Conditional Valueat-Risk (CVaR) of a GMM to enable the swarm collision risk mitigation. Furthermore, C-ROVER designs a novel risk-aware space discretization approach to enhance the ability to navigate constrained spaces.

Abstract:
Redundant robots, with more degrees of freedom than required for a given task, offer enhanced dexterity but can exhibit complex kinematic behaviour in motion planning. Cuspidal robots, which can change inverse kinematic solutions without crossing singularities, have been reported to pose unique challenges for motion feasibility and repeatability. While cuspidality has been extensively studied for 3R and certain 6R robots, no formal classification exists for redundant architectures. This paper presents a systematic framework for classifying 7R wrist-partitioned redundant robots based on their cuspidal properties. The method reduces the 7R structure to a parameterized 3R equivalent via the redundant joint angle, enabling the application of established theory for cuspidal robots. Using this approach, commercially available robots are analysed and categorized as cuspidal or noncuspidal. Results show that the design offsets in commercial cobots may lead to cuspidality, which can potentially cause a nonsingular change of operation mode in collaborative applications. This classification framework provides a foundation for cuspidality-aware path planning and offers practical guidelines for designing non-cuspidal redundant robots to ensure safer and more predictable operation.

Abstract:
The development of guide dog robots is expected to enhance the mobility and safety of visually impaired individuals outdoors. To assist these users in real-world navigation, walking guidance should be useful, comprehensive, and concise so that instructions are both actionable and easy to follow. While recent VLMs show promising capabilities in scene understanding, existing approaches do not address the effective delivery of guidance for visually impaired users. In this work, we propose SA-VLMv2 (Space-Aware VLM), a model designed to generate useful, comprehensive, and concise walking guidance based on ego-centric scenes and target destinations. To this end, we first derived four canonical templates for walking guidance through user evaluation with professional guide dog trainers across diverse images, providing insights into preferred guidance formats. We then collected, manually annotated, curated a dataset of 19,945 samples aligned with these templates and trained SA-VLMv2 from the open-sourced VLM, Qwen2.5VL. Experimental results show that SA-VLMv2 outperforms state-of-the-art proprietary MLLMs (Claude 3.5 Sonnet, Gemini 2.5, GPT-4o) and the open-sourced pretrained VLM (Qwen2.5VL) in both holistic and factor-wise evaluations. SA-VLMv2 generated more concise yet informative guidance while achieving higher scores across multiple evaluation factors.

Abstract:
Vision-based locomotion in outdoor environments presents significant challenges for quadruped robots. Accurate environmental prediction and effective handling of depth sensor noise during real-world deployment remain difficult, severely restricting the outdoor applications of such algorithms. To address these deployment challenges in vision-based motion control, this letter proposes the Redundant Estimator Network (RENet) framework. The framework employs a dual-estimator architecture that ensures robust motion performance while maintaining deployment stability during onboard vision failures. Through an online estimator adaptation, our method enables seamless transitions between estimation modules when handling visual perception uncertainties. Experimental validation through real-world robot demonstrates the framework's effectiveness in complex outdoor environments, showing particular advantages in scenarios with degraded visual perception. This framework demonstrates its potential as a practical solution for reliable robotic deployment in challenging field conditions. Project website: https://RENet-Loco.github.io/

Abstract:
Payload-adaptive locomotion is an essential capability for quadruped robots operating in real-world scenarios, particularly when tasked with transporting dynamic payloads. Existing approaches face fundamental limitations: reactive adaptation strategies respond too slowly to sudden payload changes, while learning-based methods often yield physically inconsistent models of robot dynamics that generalize poorly to novel states. To address these key challenges, we introduce Forward-Looking Adaptation to Dynamic Payloads (FLAP), a novel approach that learns to proactively compensate for discrepancies between expected and actual locomotion behavior induced by dynamic payloads. FLAP combines two critical components: (1) a physics-informed neural network (PINN) that predicts anticipated joint states while enforcing physical consistency through dynamics based loss functions, and (2) a composite adaptive control law that rapidly generates anticipatory joint torque compensations based on the PINNs predictions. Through unifying structured dynamics modeling with real-time anticipatory control, our method enables generalizable and physically consistent adaptation to dynamic payloads. Experimental results demonstrate that FLAP achieves robust locomotion under diverse payload conditions on physical quadruped robots in real-world environments.

Abstract:
This paper addresses the challenging problem of enabling a mobile manipulator with an eye-in-hand camera to track dynamic targets with time-varying positions and orientations in an unbounded workspace. Specifically, we propose an optimization-based whole-body control framework for dynamic target tracking. The framework enables the mobile manipulator to maintain the target within the cameras field of view while reaching the desired pose, by dynamically regulating the priorities of the optimization constraints and objectives according to the task execution state. Moreover, we present an adaptive-predictive position-based visual servoing strategy to generate the Cartesian references sent to the controller. To enhance the tracking performance, we introduce (1) adaptive gains to avoid abrupt motions and the resulting vibrations while preserving final precision; (2) dynamic addition of a feedforward term incorporating a velocity estimate of the target using a Kalman Filter. The proposed approach is validated on a real robotic setup, as compared to a state-of-the-art approach, demonstrating superior performance in dynamic target tracking.

Abstract:
Existing point cloud semantic segmentation models are usually trained and evaluated using data collected under clear weather conditions. Under adverse weather conditions such as rain, snow and fog, point clouds are usually distorted and significant degradation of existing model performance occurs. Many domain adaptation methods try to address this issue by simulating adverse weather or using data augmentation techniques during training. However, they cannot accurately model the actual distortion in the target domain. By analyzing the visualization and statistical information of the target domain data and referring to existing studies, we categorize the distortion of point cloud data into position distortion, intensity distortion, and quantity distortion. To address these distortions of the target domain data, we propose a Point Distortion Learning Network (PDLNet) to integrate the Point Distortion Learning (PDL) module to learn the feature distortion of target domain data due to the adverse weather. Moreover, we also integrate the Cross-domain Feature Association (CFA) module to assist the model learn domain-invariant feature representations to improve the model's adaptability to the target domain. In addition, PDLNet introduces the Point Semantic Knowledge Distillation (PSKD) module, which ensures that only the target domain data is used efficiently in the inference phase while preserving the learned cross-domain knowledge. To further improve the model performance, we also iteratively optimize the model by introducing the curriculum learning module. Our approach establishes a new state-of-the-art level by achieving 40.6% mIoU and 27.7% mIoU in the SemanticKITTI-to-SemanticSTF and SynLiDAR-to-SemanticSTF benchmarks, respectively. Source code will be released at https://github.com/JerryD233/PDLNet.

Abstract:
Continuum robots possess intrinsic compliance, high flexibility, and continuously deformable structures, making them well-suited for safe humanrobot interaction (HRI). However, their continuous backbone and high degrees of freedom pose significant challenges for real-time trajectory generation: motions must satisfy curvature constraints while adapting to uncertain and rapidly changing human inputs. Existing methods can generate smooth and feasible paths, but many are computationally intensive, neglect curvature continuity or mechanical constraints, or lack adaptability to dynamic environments. As a result, producing smooth, feasible, and responsive trajectories for continuum robots in interactive scenarios remains challenging. To address this, we propose a real-time trajectory optimization framework that integrates temporally filtered, vision-based human intention signals with curvature-constrained planning. Human hand motions are converted into stable reference signals, which guide a sliding-window sequential quadratic programming (SQP) optimizer. The planner continuously generates smooth and feasible trajectories that adapt in real time to evolving inputs. Simulations and hardware experiments demonstrate accurate tracking, robustness to noise, and timely adaptation, highlighting the frameworks potential to enable safe and natural humancontinuum robot collaboration in real-world applications.

Abstract:
Human operators are still frequently exposed to hazardous environments such as disaster zones and industrial facilities, where intuitive and reliable teleoperation of mobile robots and Unmanned Aerial Vehicles (UAVs) is essential. In this context, hands-free teleoperation enhances operator mobility and situational awareness, thereby improving safety in hazardous environments. While vision-based gesture recognition has been explored as one method for hands-free teleoperation, its performance often deteriorates under occlusions, lighting variations, and cluttered backgrounds, limiting its applicability in real-world operations. To overcome these limitations, we propose a multimodal gesture recognition framework that integrates inertial data (accelerometer, gyroscope, and orientation) from Apple Watches on both wrists with capacitive sensing signals from custom gloves. We design a late fusion strategy based on the log-likelihood ratio (LLR), which not only enhances recognition performance but also provides interpretability by quantifying modality-specific contributions. To support this research, we introduce a new dataset of 20 distinct gestures inspired by aircraft marshalling signals, comprising synchronized RGB video, IMU, and capacitive sensor data. Experimental results demonstrate that our framework achieves performance comparable to a state-of-the-art vision-based baseline while significantly reducing computational cost, model size, and training time, making it well suited for real-time robot control. We therefore underscore the potential of sensor-based multimodal fusion as a robust and interpretable solution for gesture-driven mobile robot and drone teleoperation.

Abstract:
Forestry cranes operate in dynamic, unstructured outdoor environments where simultaneous collision avoidance and payload sway control are critical for safe navigation. Existing approaches address these challenges separately, either focusing on sway damping with predefined collision-free paths or performing collision avoidance only at the global planning level. We present the first collision-free, sway-damping model predictive controller (MPC) for a forestry crane that unifies both objectives in a single control framework. Our approach integrates LiDAR-based environment mapping directly into the MPC using online Euclidean distance fields (EDF), enabling real-time environmental adaptation. The controller simultaneously enforces collision constraints while damping payload sway, allowing it to (i) replan upon quasi-static environmental changes, (ii) maintain collision-free operation under disturbances, and (iii) provide safe stopping when no bypass exists. Experimental validation on a real forestry crane demonstrates effective sway damping and successful obstacle avoidance.

Abstract:
Training generalizable robotic agents requires large datasets of diverse, physically consistent 3D scenes, yet their generation remains a critical bottleneck. Current text-to-scene methods are inefficient for this task; generating diverse layouts requires repeated, expensive sampling that fails to maintain the semantic consistency required for robust policy learning. In this paper, we address this with our one-plan-to-many-layouts method. A Large Language Model (LLM) generates a single declarative plan, which a force-directed physics simulation then realizes into multiple layouts that share semantics but differ in geometry. We validate our method by transfer to photorealistic 3D reconstructions of real environments (Replica) within simulation, where a navigation agent trained on our scenes attains a Success Rate of 0.84. These results establish our pipeline as a scalable method for producing the controlled, diverse data required for embodied AI training.

Abstract:
Dexterous in-hand telemanipulation demands precise control and realistic haptic feedback to achieve stable and intuitive humanrobot interaction. Existing systems often emphasize isolated control policies or unidirectional force feedback, limiting performance in tasks that require coordinated bidirectional information flow. In this work, we introduce Bi-Hap, a bi-directional learning-based control and momentum-based haptic feedback system for real-time, in-hand telemanipulation. On the control side, Bi-Hap leverages an inertial measurement unit to capture operator motion and drives a deep reinforcement learning policy that enables robust and adaptive manipulation of objects with fine rotational dexterity. On the feedback side, a compact, palm-sized momentum-actuated mechanism delivers torque and vibration cues directly to the operator, augmented by an error-adaptive strategy that modulates feedback intensity based on task states. When integrated, this closed-loop design establishes an immersive bidirectional controlfeedback framework. Experimental results show that Bi-Hap achieves low feedback latency (<0.03s), high torque fidelity (RMSE <0.01Nm), and significantly improved telemanipulation performance by elevating manipulation accuracy, responsiveness, and operator situational awareness in diverse task settings.

Abstract:
In humans, the acquisition of a new motor skill is associated with the development of a wide range of cognitive areas and can create contexts in which new cognitive capacities develop. Motor development is linked to language development in infants, as crawling and walking promote active exploration of the environment, while manipulating objects and pointing draw the caregivers attention and help establish joint attention. Together, these motor experiences broaden communication contexts and support the learning of nouns (object-based words) and verbs (action-based words). However, many questions remain unanswered about how children's actions influence language development, qualitatively and quantitatively, and how they help the acquisition of different types of words, particularly the learning of verbs. In this paper, we propose a robot architecture to study how gestures can affect early language learning. The architecture follows the developmental robotics paradigm, i.e. inspired by the way human children develop and acquire language according to multiple developmental theories. The experimental results demonstrate that enabling the robot to produce gestures expands its vocabulary size and facilitates the acquisition of verbs. These results are in line with the finding that verb learning lags behind noun learning since the acquisition of verbs depends more on motor abilities and requires the maturation of motor development.

Abstract:
Bimanual and humanoid robots are appealing because of their human-like ability to leverage multiple arms to efficiently complete tasks. However, controlling multiple arms at once is computationally challenging due to the growth in the hybrid discrete-continuous action space. Task and Motion Planning (TAMP) algorithms can efficiently plan in hybrid spaces but generally produce plans, where only one arm is moving at a time, rather than schedules that allow for parallel arm motion. In order to extend TAMP to produce schedules, we present ScheduleStream, the first general-purpose framework for planning & scheduling with sampling operations. ScheduleStream models temporal dynamics using hybrid durative actions, which can be started asynchronously and persist for a duration that's a function of their parameters. We propose domain-independent algorithms that solve ScheduleStream problems without any application-specific mechanisms. We apply ScheduleStream to Task and Motion Planning & Scheduling (TAMPAS), where we use GPU acceleration within samplers to expedite planning. We compare ScheduleStream algorithms to several ablations in simulation and find that they produce more efficient solutions. We demonstrate ScheduleStream on several real-world bimanual robot tasks at https://schedulestream.github.io.

Abstract:
Existing stretch and touch sensors for robots are generally expensive with respect to at least one of material costs, required manufacturing equipment, or manufacturing time. We present and experimentally characterize a conductive fiber made using only inexpensive commercial off-the-shelf parts (conductive thread at 0.07/ft, silicone tubing at 0.94/ft) and tools (loop-style needle threader at 2), which can be manufactured quickly (20 cm length in 2 minutes.) We demonstrate its use as a resistive strain sensor with three applications: Triggering a grasp in a pneumatically actuated assistive finger, sensing the pose of a pneumatically actuated robotic strap, and estimating the pose of a flexible solid. We also demonstrate that it can be used as a capacitive sensor with two applications: First, as a touch sensor which triggers a commercial robot arm to move, and second, as a near-field sensor enabling the robot arm to follow a moving hand. The capacitive sensors are knitted, showcasing the high flexibility of the fiber. We discuss methods for improving manufacturing scalability and their cost trade-offs. Finally, we demonstrate a method for repairing a cut fiber.

Abstract:
Robotic surgery has revolutionized minimally invasive procedures by offering enhanced precision, dexterity, and patient outcomes. However, the training and operational paradigms in robotic surgery have not evolved in parallel. Current apprenticeship models fall short in this domain, as robotic surgery isolates the primary surgeon in a teleoperated control loop, limiting opportunities for hands-on learning by trainees. To address this, we present the first implementation of a multilateral controller on a da Vinci Research Kit (dVRK), enabled by a four-channel teleoperation architecture and learning-based force estimation on a dual-console setup. This framework allows an expert and novice to share motion and force authority on the patient side robots through an adjustable dominance factor. We validated the system in three experiments. In transparency tests, the architecture achieved sub-millimeter position tracking errors (PTE <= 0.2mm) and force tracking errors (FTE) <= 1N. In a palpation pilot user study (N=10) with tumor-tissue phantoms, participants identified stiffer regions, without visual feedback, with 83% accuracy in single-user mode (alpha = 1) and 74% accuracy in dual-user shared mode (alpha = 0.5). In a suturing force control pilot user study (N=10), novices significantly reduced force error and increased time within the safe range after expert-guided training, with no suture breakage observed post-training. These results on a dual-console dVRK setup demonstrate the feasibility of expert-in-the-loop training with real-time haptic guidance, positioning multilateral teleoperation as a promising approach for surgical skill transfer.

Abstract:
Legged robots with egocentric forward-facing depth cameras can couple exteroception and proprioception to achieve robust forward agility on complex terrain. When these robots walk backward, the forward-only field of view provides no preview. Purely proprioceptive controllers can remain stable on moderate ground when moving backward but cannot fully exploit the robot's capabilities on complex terrain and must collide with obstacles. We present Look Forward to Walk Backward (LF2WB), an efficient terrain-memory locomotion framework that uses forward egocentric depth and proprioception to write a compact associative memory during forward motion and to retrieve it for collision-free backward locomotion without rearward vision. The memory backbone employs a delta-rule selective update that softly removes then writes the memory state along the active subspace. Training uses hardware-efficient parallel computation, and deployment runs recurrent, constant-time per-step inference with a constant-size state, making the approach suitable for onboard processors on low-cost robots. Experiments in both simulations and real-world scenarios demonstrate the effectiveness of our method, improving backward agility across complex terrains under limited sensing.

Abstract:
Knots provide compact, lightweight, and mechanically stable configurations that are invaluable for aerial transportation and construction. However, autonomous knot formation in midair remains an open challenge due to the dexterity and complexity of manipulating flexible cables. In this paper, we present a method for midair knot formation that employs two types of aerial robots: lifting robots, which hold the cable endpoints, and support robots, which stabilize intermediate spans to enable interlacing. Our approach focuses on minimizing the number of support robots required while ensuring that the knots topology is preserved. Our method proceeds in three stages: (i) encode the knot projection as a grid of directional segments and crossings, (ii) apply our Loop Consistency Filter (LCF) to identify the minimal set of support robots required to preserve topology, and (iii) reconstruct continuous Cartesian trajectories using a cable model governed by a springdamper force and a straightening force. Our results show a reduction in the required robots to form a knot of at least fifty percent compared to the baseline grid-based method. We demonstrate that our method is effective on actual robots, enabling the formation of knots with multiple quadrotors.

Abstract:
Spatiotemporal planning is critically important in fields like robotics, logistics, and naval operations, especially for problem specifications involving multiple constraints. Traditional approaches place the burden on end users to manually specify cost functions, constraints, or model parameters, a time-consuming and laborious process often resulting in less-than-ideal plans. We present a novel architecture integrating an LLM-based natural language interface with MILP scheduling and A motion planning for multi-constraint spatiotemporal planning. We validate our LLM-planning approach through a within-subjects user study using a simulated maritime route-planning domain against manual control, and against autonomous planning with classical template-based constraint specification. Results showed our LLM-planning approach not only improved usability and reduced workload over alternative input modalities but also maintained the path optimality of traditional constraint specification interfaces while decreasing planning time. These findings demonstrate that bridging LLM-powered interfaces with robust schedulers and motion planners can enhance human-autonomy interaction in complex planning tasks, potentially making advanced spatiotemporal planning tools more practical for a broader range of users.

Abstract:
We developed a microscopic cell/tissue extraction device that employed a translational/rotational piezoelectric impact drive mechanism (Piezo IDM). To perform the correlation between gene expression and localized tissue sample at the micrometer scale, the system inserted a knife-edged glass capillary driven by the Piezo IDM and extracted the cells/tissues. The hybridized use of the translational and rotational impact motion significantly improved suction performance, resulting in the reliable acquisition of small, localized cells and tissues, which were previously difficult to be isolated. To characterize the motion of the Piezo IDM, the amplitude and frequency dependence were measured, and were compared with the simulation model. In addition, we found that the synchronous chopping motion could exert the rotational motion efficiently. For the automation, a specialized controller was developed to exert bidirectional motion. The experimental demonstration was performed for both the artificial gel sample and the practical mouse cranial window (CW). The result of the gel sample clearly exhibited the effectiveness of hybridizing the translational and rotational motion of Piezo IDM for cell/tissue extraction. The practical demonstration of the neutrophil extraction experiments in thrombus-induced mice also elucidated the potential performance of the accurate tissue extraction from the in-vivo environment.

Abstract:
Stable autonomous driving in unstructured off-road environments remains a longstanding challenge. In the absence of structured roads and in the presence of uneven terrain, vegetation, and soil slopes, vehicles must rely on LiDARCamera fusion to identify stable and traversable roads. However, existing terrain perception methods largely remain at the level of semantic segmentation and struggle to capture physical attributes such as surface roughness and load-bearing capacity. Meanwhile, constructing datasets annotated with accurate physical properties is prohibitively costly and inherently limited in class diversity, making it difficult to cover unseen terrains. To address these limitations, we propose an online ground bumpiness cost learning framework for off-road vehicles, which enables continuous and direct learning of terrain-specific bumpiness costs during operation without the need for manual annotation. The framework consists of four key components: (i) ground bumpiness cost computation, (ii) a lightweight multimodal terrain segmentation model, (iii) an instance-level incremental update strategy, and (iv) a bumpiness cost mapping module. Extensive experiments on the EV-56 vibroseis truck demonstrate that the proposed framework can finely discriminate terrains with varying bumpiness costs and incrementally estimate costs for previously unseen terrains, thereby providing strong support for safe and reliable off-road autonomous driving.

Abstract:
As multimodal language models advance, their application to 3D scene understanding is a fast-growing frontier, driving the development of 3D Vision-Language Models (VLMs). Current methods show strong dependence on object detectors, introducing processing bottlenecks and limitations in taxonomic flexibility. To address these limitations, we propose a scene-centric 3D VLM for 3D Gaussian splat scenes that employs language- and task-aware scene representations. Our approach directly embeds rich linguistic features into the 3D scene representation by associating language with each Gaussian primitive, achieving early modality alignment. To process the resulting dense representations, we introduce a dual sparsifier that distills them into compact, task-relevant tokens via task-guided and location-guided pathways, producing sparse, task-aware global and local scene tokens. Notably, we present the first Gaussian splatting-based VLM, leveraging photorealistic 3D representations derived from standard RGB images, demonstrating strong generalization: it improves performance of prior 3D VLM five folds, in out-of-the-domain settings.

Abstract:
A painting is more than just a picture on a wall; a painting is a process comprised of many intentional brush strokes, the shapes of which are an important component of a painting's overall style and message. Prior work in modeling brush stroke trajectories either does not work with real-world robotics or is not flexible enough to capture the complexity of human-made brush strokes. In this work, we introduce Spline-FRIDA which can model complex human brush stroke trajectories. This is achieved by recording artists drawing using motion capture, modeling the extracted trajectories with an autoencoder, and introducing a novel brush stroke dynamics model to the existing robotic painting platform FRIDA. We conducted a survey and found that our open-source Spline-FRIDA approach successfully captures the stroke styles in human drawings and that Spline-FRIDA's brush strokes are more human-like, improve semantic planning, and are more artistic compared to existing robot painting systems with restrictive Bezier curve strokes.

Abstract:
Road inspection is crucial for maintaining road's serviceability and ensuring traffic safety, as road defects gradually develop and compromise functionality. Traditional inspection methods, which rely on manual evaluations, are labor-intensive, costly, and time-consuming. While data-driven approaches are gaining traction, the scarcity and spatial sparsity of real-world road defects present significant challenges in acquiring high-quality datasets. Existing simulators designed to generate detailed synthetic driving scenes, however, lack models for road defects. Moreover, advanced driving tasks that involve interactions with road surfaces, such as planning and control in defective areas, remain underexplored. To address these limitations, we propose a multi-modal sensor platform integrated with an urban digital twin (UDT) system for intelligent road inspection. First, hierarchical road models are constructed from real-world driving data collected using vehicle-mounted sensors, resulting in highly detailed representations of road defect structures and surface elevations. Next, digital road twins are generated to create simulation environments for comprehensive analysis and evaluation of algorithm's performance. These scenarios are then imported into a simulator to facilitate both data acquisition and physical simulation. Experimental results demonstrate that driving tasks, including perception and decision-making, benefit significantly from the high-fidelity road defect scenes generated by our system.

Abstract:
We consider the problem of vision-based 6-DoF object pose estimation in the context of the notional Mars Sample Return campaign, in which a robotic arm would need to localize multiple objects of interest for low-clearance pickup and insertion, under severely constrained hardware. We propose a novel localization algorithm leveraging a custom renderer together with a new template matching metric tailored to the edge domain to achieve robust pose estimation using only low-fidelity, textureless 3D models as inputs. Extensive evaluations on synthetic datasets as well as from physical testbeds on Earth and in situ Mars imagery shows that our method consistently beats the state of the art in compute and memory-constrained localization, both in terms of robustness and accuracy, in turn enabling new possibilities for cheap and reliable localization on general-purpose hardware.

Abstract:
Sailboats are purely wind-driven and thus have great potential for long-term voyaging. For robotic sailboats, the constraints on the energy are crucial to the sustainability of automation. Reducing the control frequency of actuators is crucial for energy conservation. This study proposes an energy-efficient long-short term (EeLsT) approach for sustainable sailing. Our approach can be generally applied as an energy management module in sailing robots. It explicitly leverages the sailing motion characteristics and the dynamic model of the robot considering marine disturbances. We have designed an experimental enhanced simulation platform to evaluate motion performance and energy consumption. Both baseline approach and the scheme incorporating EeLsT method have been conducted. In simulation, EeLsT approach saves 31.8% of energy. In the real marine environment, experiments are conducted with OceanVoy. The results show that 27.4% of the energy is saved during stable sailing. In the long-term sailing, compared to the standby mode when the motors are not working, the average power of the full automation mode has increased by no more than 1W, i.e. 4% relatively.

Abstract:
Magneto-responsive soft materials have gained attention in biomedical engineering, with applications spanning robotics to regenerative medicine and drug delivery. These materials are the backbone of magnetic soft robots (MSRs), enabling customization of the magnetic domains that dictate their morphological capabilities and behavior. However, reliance on intuition to configure MSR magnetization profiles often results in a trial-and-error design approach, consuming time and resources. To address these challenges, this study optimizes an intelligent framework that uses a Covariant Matrix Adaptation Evolutionary Strategy along with a Material Point Method simulation environment to determine the magnetization profile of voxel-based MSRs to achieve ultimate performance. This study shows that unique, non-intuitive designs can be evolved. This intelligent design framework is linked to physical prototyping through additive manufacturing to realize these designs. Experimental validation of the generated designs confirms that the algorithm-based MSRs achieve a 10-fold increase in walking performance compared to the intuitively designed MSRs. This study also demonstrates the ability to improve upon both specific and random magnetization profiles and the ability to adapt to design constraints such as various modes of actuation. In general, the evolutionary algorithm, combined with physical prototyping, establishes an effective and efficient framework for the optimization of MSR behavior.

Abstract:
This paper presents a coordinated compliant control strategy for space manipulator systems to enable safe and robust target capture during on-orbit servicing and assembly missions. The proposed controller operates in two distinct phases: free-space motion and contact interaction. In the free-space phase, end-effector accurate trajectory tracking is required, while in the contact phase, the control objective shifts to force regulation, maintaining contact forces within safe bounds and ensuring stable interaction durations to enhance operational safety. A key feature of the developed method is its ability to provide a smooth and continuous transition between free-space and contact phases without requiring controller switching. Furthermore, the controller preserves the attitude stability of the chaser spacecraft during manipulation, mitigating mission-critical risks such as communication loss or solar panel misalignment. The approach explicitly incorporates the coupled rigid-body dynamics of the SMS and demonstrates robustness to various real-world disturbances. Simulation results validate the effectiveness of the proposed controller. Future work includes experiments, using the planar Space Robotics Emulator of our lab at the National Technical University of Athens to assess real-world readiness.

Abstract:
Sensor degradation in unstructured natural environments---manifesting as LiDAR point cloud sparsity or visual feature dropout---and out-of-sequence measurement challenges critically undermine localization robustness in autonomous systems. To address these limitations, we present STEAM-LIVO, a Spatio-Temporally Adaptive Manifold LiDAR-Inertial-Visual Odometry framework that enables tightly coupled multi-sensor fusion via a spatio-temporal manifold-driven iterative Kalman filter. The proposed method formulates an error-state iterative update mechanism on Lie group manifolds, executes IMU-centric real-time estimation, and ensures resilience under sensor degradation through an incremental observation model integrating LiDAR point-to-plane geometric residuals with visual feature reprojection errors within a shared filtering framework. Comprehensive evaluations in vegetated terrestrial landscapes and dynamic aquatic surfaces demonstrate an average relative pose error of 1.77%, with sustained robustness during partial sensor failures. Rigorous ablation studies further corroborate the efficacy of our spatio-temporal adaptive manifold architecture.

Abstract:
Legged robots are increasingly being adopted in industries such as oil, gas, mining, nuclear, and agriculture. However, new challenges exist when moving into natural, less-structured environments, such as forestry applications. This article presents a prototype system for autonomous, undercanopy forest inventory with legged platforms. Motivated by the robustness and mobility of modern legged robots, we introduce a system architecture, which enabled a quadruped platform to autonomously navigate and map forest plots. Our solution involves a complete navigation stack for state estimation, mission planning, and tree detection and trait estimation. We report the performance of the system from trials executed over one and a half years in forests in three European countries. Our results with the ANYmal robot demonstrate that we can survey plots up to 1-ha plot under 30 min while also identifying trees with typical diameter at breast height (DBH) accuracy of 2 cm. The findings of this project are presented as five lessons and challenges. In particular, we discuss the maturity of hardware development, state estimation limitations, open problems in forest navigation, future avenues for robotic forest inventory, and more general challenges to assess autonomous systems. By sharing these lessons and challenges, we offer insight and new directions for future research on legged robots, navigation systems, and applications in natural environments. Additional videos can be found in https://dy

Abstract:
Synchronization between a wearer and a lower limb powered prosthesis is important for effective control. Typically, phase variable-based phase estimation methods are employed. However, there is a noticeable lack of studies focusing on estimating the gait phase during stair descent, likely due to the difficulty in generating a reliable phase variable. In most studies, the thigh angle is used to generate phase variables for level walking because it follows a sinusoidal pattern. However, during stair descent, the thigh angle exhibits only a partially sinusoidal shape, making it challenging to apply the methods used for level walking. In this study, we propose a novel phase variable generation method to address the difficulty of using only the thigh angle for stair descent. To estimate the gait phase reliably, the phase variable is defined differently for the stance and swing phases: the hip position is used to generate the phase variable during the stance phase, and the thigh angle is used during the swing phase. These phase variables are then unified into a single phase variable (PV-ENT) for the entire gait cycle of stair descent. During this unification process, a non-smooth transition occurs around the phase transition point. To address this, a blending method is applied. The proposed method was validated using the data from 12 healthy subjects, collected through a motion capture system and IMU sensors. The results demonstrate a reliable phase estimation performance. Moreover, the blending method successfully improves the smoothness of the phase variable around the phase transition point without reducing the overall phase estimation performance.

Abstract:
Decentralized collaborative simultaneous localization and mapping (C-SLAM) is essential to enable multirobot missions in unknown environments without relying on preexisting localization and communication infrastructure. This technology is anticipated to play a key role in the exploration of the Moon, Mars, and other planets. In this article, we share insights and lessons learned from C-SLAM experiments involving three robots operating on a Mars analogue terrain and communicating over an ad hoc network. We examine the impact of limited and intermittent communication on C-SLAM performance, as well as the unique localization challenges posed by planetary-like environments. Additionally, we introduce a novel dataset collected during our experiments, which includes real-time peer-to-peer inter-robot throughput and latency measurements. This dataset aims to support future research on communication-constrained, decentralized multirobot operations.

Abstract:
Accurate, real-time Sea State Estimation (SSE) is crucial for the safety and operational efficiency of Autonomous Surface Vessels (ASVs). However, existing deep learning methods for this task commonly face three major challenges: the inherent class imbalance of marine environments, the ambiguous boundaries between discrete sea state levels, and the difficulty of extracting multi-scale temporal features from vessel motion. To address these challenges, this paper proposes a novel framework named the Class-guided Rare-boosted Multi-Scale Net (CRUISE). The framework is built upon a multi-scale encoder-decoder architecture and integrates two key innovations: a Rare-Boosted Class Embedding (RBCE) module at the network's bottleneck and a class-guided decoding mechanism. The RBCE module first generates a preliminary class prediction and then dynamically enhances the representation of rare sea state classes to create a class-balanced conditional vector. This vector subsequently provides top-down guidance to the decoder, injecting class-aware information by modulating the feature reconstruction process. This synergistic design fundamentally addresses the data imbalance problem at the feature level and effectively sharpens the decision boundaries between easily confused transitional sea states. Extensive experiments on multiple public benchmarks, simulated ship motion, and real-world datasets demonstrate that CRUISE significantly outperforms existing state-of-the-art methods, showing a pronounced advantage in improving the recognition accuracy of rare and high-risk sea states. Furthermore, real-time inference tests on a physical model vessel validate the model's performance on edge computing devices, further confirming its feasibility and robustness for deployment in real-world marine environments.

Abstract:
This paper presents an integrated control method in air road navigation for multi-UAV systems, combining an efficient reinforcement learning (RL) controller with a control barrier function (CBF)-based filter that guarantees flight safety. First, an air road construction method based on arbitrary quadrilateral combinations is proposed, which enables flexible air road design. Second, two specific CBFs are designed: an air road CBF which keeps UAVs within designed air roads, and a collision avoidance CBF which prevents collisions between UAVs. Based on the CBF-based filter, the RL controller is allowed to be trained in a simple, single-agent environment, which reduces computational costs and enhances training efficiency. Furthermore, the RL reward is carefully designed, which considers both the stability during movement and the optimality of energy conservation. The performance, safety, and efficiency of the proposed approach are rigorously validated through comprehensive simulations and real-world experiments.

Abstract:
Vision-based tactile soft sensors are increasingly applied to robotic perception and manipulation by leveraging high-resolution imaging during contact with environmental surfaces, thereby enabling more adaptable and robust inter-actions.Nonetheless, ensuring optimal contact force to achieve uniform, conformal, and stable contact between sensors and surfaces remains a key challenge, particularly within complex and unstructured environments. Inspired by the highly versatile suction cups of biological octopuses for environmental surface sensing, we introduce OcTac, a prototype that seamlessly com-bines adaptive adhesion capabilities with vision-based tactile perception. OcTac harnesses its self-guided adhesion mechanism and the intrinsic ffexibility of soft materials to autonomously achieve alignment with target surfaces, even when initially misaligned at signiffcant anglesfacilitating tactile perception without relying on precise external control. We conducted experiments demonstrating that OcTac exhibits robust adaptive adhesion and self-detachment capabilities on surfaces with inclination angles ranging from 0° to 90°, as well as on surfaces with varying levels of roughness (with particle sizes up to 150 µm). On challenging inclined surfaces, OcTacs self-aligning adhesion mechanism enables stable and uniform con-tact,achieving a signiffcant improvement in image uniformity by a factor of 4.53 compared to conventional vision-based tactile soft sensors. Additionally, we demonstrated OcTac mounted on a continuum soft robotic arm, enabling it to navigate around obstacles and perform surface perception, object recognition, and grasping tasks. This work presents a new approach for achieving adaptive tactile perception in complex environments by harnessing the inherent physical intelligence of soft adhesive materials.

Abstract:
Subretinal injection is a highly delicate procedure that demands micron-level precision to avoid irreversible retinal damage. Current robotic systems achieve accurate positioning but remain limited by retinal motion and the lack of tip-force feedback. We present the first adaptive tip-force compensation framework for robotic subretinal injection, fusing intraoperative optical coherence tomography (iOCT) vision with fiber Bragg grating (FBG) force sensing. Our architecture integrates a finite-state machine (FSM) for surgical phase coordination, a Long Short-Term Memory (LSTM) enhanced residual Kalman filter for real-time motion prediction, and an adaptive compliance estimator for safe force regulation. Compared to previous vision-only and force-only method, ex vivo experiments on porcine eyes demonstrate robust improvements: the root-mean-square tracking error reduced by 40% (to 18.5μm), the maximum absolute error lowered by 2.5 times, and 96.7% of tip forces maintained within ± 0.7mN. Control delays were minimized to 0.25s, enabling low-latency corrections beyond freehand capabilities. Our system enhances precision and safety in fragile retinal tissues, advancing the potential for reliable robot-assisted surgeries for retinal diseases.

Abstract:
Recent advances in deep monocular visual Simultaneous Localization and Mapping (SLAM) have achieved impressive accuracy and dense reconstruction capabilities, yet their robustness to scale inconsistency in large-scale indoor environments remains largely unexplored. Existing benchmarks are limited to room-scale or structurally simple settings, leaving critical issues of intra-session scale drift and inter-session scale ambiguity insufficiently addressed. To fill this gap, we introduce the ScaleMaster Dataset, the first benchmark explicitly designed to evaluate scale consistency under challenging scenarios such as multi-floor structures, long trajectories, repetitive views, and low-texture regions. We systematically analyze the vulnerability of state-of-the-art deep monocular visual SLAM systems to scale inconsistency, providing both quantitative and qualitative evaluations. Crucially, our analysis extends beyond traditional trajectory metrics to include a direct map-to-map quality assessment using metrics like Chamfer distance against high-fidelity 3D ground truth. Our results reveal that while recent deep monocular visual SLAM systems demonstrate strong performance on existing benchmarks, they suffer from severe scale-related failures in realistic, large-scale indoor environments. By releasing the ScaleMaster dataset and baseline results, we aim to establish a foundation for future research toward developing scale-consistent and reliable visual SLAM systems.

Abstract:
Dexterous grasping remains a central challenge in robotics due to the complexity of its high-dimensional state and action space. We introduce T(R,O) Grasp, a diffusion-based framework that efficiently generates accurate and diverse grasps across multiple robotic hands. At its core is the T(R,O) Graph, a unified representation that models spatial transformations between robotic hands and objects while encoding their geometric properties. A graph diffusion model, coupled with an efficient inverse kinematics solver, supports both unconditioned and conditioned grasp synthesis. Extensive experiments on a diverse set of dexterous hands show that T(R,O) Grasp achieves average success rate of 94.83%, inference speed of 0.21s, and throughput of 41 grasps per second on an NVIDIA A100 40GB GPU, substantially outperforming existing baselines. In addition, our approach is robust and generalizable across embodiments while significantly reducing memory consumption. More importantly, the high inference speed enables closed-loop dexterous manipulation, underscoring the potential of T(R,O) Grasp to scale into a foundation model for dexterous grasping.

Abstract:
This paper investigates position control and force estimation for a hydraulic folded pouch actuator. First, experimental platforms are designed to characterize the actuator and the results show two key properties: (i) angular hysteresis when the motion direction reverses, and (ii) strong nonlinearity between liquid volume, pressure, and angle. For position control, we explore three strategies: fully open-loop control, observerbased control, and sensor-based closed-loop control with angle feedback. The closed-loop controller employs dynamically tuned PID gains and an MLP feedforward predictor. Under a sinusoidal reference, the closed-loop controller achieves mean absolute error (MAE) = 4.82° and root mean square error (RMSE) = 5.48°. For force estimation, we train both MLP and LSTM models using liquid volume, angle, pressure, and angular rate as features to predict the external force on the actuator. Compared to the MLP, the LSTM incorporates temporal dynamics, which allows it to capture force variations more effectively and generate smoother prediction results. Under dynamic loads, both models capture the applied force, with the LSTM yielding the lower errors (MAE = 0.96 mN·m, RMSE = 1.23 mN·m).

Abstract:
This paper presents an embedded soft sensor for proprioceptive feedback in a soft continuum actuator (SCA) forming the neck of the social robot HARU. The sensor is fabricated in a single-step multi-material additive manufacturing process, co-extruding conductive and non-conductive thermoplastic polyurethane to form an integrated structure. Several sensor geometries are evaluated, with a gauge-type configuration selected based on linearity and repeatability criteria. The design is embedded in a cross-configuration to measure the actuators two dominant degrees of freedom, pitch and roll. Sensor signals are mapped to angle estimates using linear regression, a static neural network, and a continual-learning framework that updates parameters online. Experiments involving predefined trajectories, randomized motions, and repeated test cycles show that the continual-learning model achieves R2 > 0.97 and mean absolute errors below 1 degree, consistently improving upon the baseline models. The results demonstrate the feasibility of directly embedding 3D-printed soft sensors into functional actuators and highlight the role of adaptive learning in supporting long-term soft robotic proprioception.

Abstract:
Visual Place Recognition (VPR) based on Dynamic Vision Sensors (DVSs) has gained attention due to their high temporal resolution and robustness under challenging lighting conditions. However, the sparse and asynchronous event stream output of DVS introduces unique challenges for effective VPR. In this paper, we propose ST-HNet, a novel framework for VPR that introduces improvements in event representation, spatio-temporal feature extraction, and loss design. Specifically, we introduce a compact and efficient event representation called Bipolar Binary Voxel Grid (BBVG). Then, we propose a hybrid feature extractor that combines a Convolutional Neural Network (CNN) for spatial encoding and a Liquid State Machine (LSM) for temporal aggregation. We refer to this combination as a CNN-LSM hybrid architecture. Moreover, we introduce a soft-margin triplet loss to better accommodate the gradual transitions between nearby locations in the event-based VPR task. Extensive experiments conducted on the Brisbane-Event-VPR and DDD20 datasets demonstrate that our method outperforms state-of-the-art approaches, achieving improvements of 11% and 23% in Recall@1 performance, respectively.

Abstract:
The viability of long-distance telesurgery hinges on reliable network Quality of Service (QoS), yet the impact of realistic network degradations on task performance is not sufficiently understood. This paper presents a comprehensive analysis of how packet loss, delay, and communication loss affect telesurgical task execution. We introduce NetFI, a novel fault injection tool that emulates different network conditions using stochastic QoS models informed by real-world network data. By integrating NetFI with a surgical simulation platform, we conduct a user study involving 15 participants at three proficiency levels, performing a standardized Peg Transfer task under varying levels of packet loss, delay, and communication loss. We analyze the effect of network QoS on overall task performance and the fine-grained motion primitives (MPs) using objective performance and safety metrics and subjective operator's perception of workload. We identify specific MPs vulnerable to network degradation and find strong correlations between proficiency, objective performance, and subjective workload. These findings offer quantitative insights into the operational boundaries of telesurgery. Our open-source tools and annotated dataset provide a foundation for developing robust and network-aware control and mitigation strategies.

Abstract:
We present DisFlow, a novel framework for online scene flow estimation from distance field that enables 6DoF dynamic object pose estimation, motion tracking, and surface reconstruction. The scene is represented by Gaussian Process Implicit Surfaces (GPIS), with surface normals serving as derivative constraints, enabling accurate signed distance computations near the surface and gradient queries with uncertainty. With this representation as a foundation, we compute a scene flow from the distance field that describes how surface points are transported over time in consecutive frames. Through our flow, we can estimate an object's pose and motion by incrementally registering a new observed point cloud via an elegant closed-form optimisation. Unlike prior methods that operate in the camera or world frame, our approach performs probabilistic fusion directly in the object frame, where the object remains geometrically consistent over time. The tight coupling of the DisFlow method in space and time yields dense geometry, surface normals, object pose trajectories, velocities, and uncertainty, all at real-time rates. We evaluate DisFlow on dynamic object sequences and demonstrate that it achieves accurate pose and motion tracking while simultaneously reconstructing high-quality object surfaces.

Abstract:
Human safety is critical in applications involving close human-robot interactions (HRI) and is a key aspect of physical compatibility between humans and robots. While measures of human safety in HRI exist, these mainly target industrial settings involving robotic manipulators. Less attention has been paid to settings where mobile robots and humans share the space. This paper introduces a new robot-centered directional framework of human safety. It is particularly useful for evaluating mobile robots as they operate in environments populated by multiple humans. The framework integrates several key metrics, such as each human's relative distance, speed, and orientation. The core novelty lies in the framework's flexibility to accommodate different application requirements while allowing for both the robot-centered and external observer points of view. We instantiate the framework by using RGB-D based vision integrated with a deep learning-based human detection pipeline to yield a proxemics-guided generalized safety index (GSI) that instantaneously assesses human safety. We extensively validate GSI's capability of producing appropriate and fine-grained safety measures in real-world experimental scenarios and demonstrate its superior efficacy against extant safety models.

Abstract:
To establish a foundational understanding for creating J-shaped trajectories with Concentric Tube Steerable Drilling Robots (CT-SDRs), this paper presents a systematic characterization of two operational factors: drill feed rate and rotational speed. We developed and compared a custom High-Speed Drill (HSD) and a Low-Speed Drill (LSD) to analyze how these parameters affect performance in flexible robotic drills versus conventional systems utilizing rigid instruments. By integrating the CT-SDRs with a seven degree-of-freedom robotic manipulator, we conducted experiments in synthetic bone phantoms of varying densities, assessing metrics such as motor current, hole diameter, radius of curvature, and drilling time. The results reveal critical performance trade-offs, demonstrating that high-speed drilling in CT-SDRs is essential for successfully penetrating dense bone. Further, we found that while slower feed rates improve trajectory accuracy and reduce hole enlargement, they significantly increase procedural time. These findings offer a quantitative guideline for design choices, component selection, and operational control of CT-SDRs tailored to patient-specific bone quality.

Abstract:
Mobile manipulators are designed to perform complex sequences of navigation and manipulation tasks in human-centered environments. While recent optimization-based methods such as Hierarchical Task Model Predictive Control (HTMPC) enable efficient multitask execution with strict task priorities, they have so far been applied mainly to static or structured scenarios. Extending these approaches to dynamic human-centered environments requires predictive models that capture how humans react to the actions of the robot. This work introduces Safe Mobile Manipulation with Interactive Human Prediction via Task-Hierarchical Bilevel Model Predictive Control (SM2ITH), a unified framework that combines HTMPC with interactive human motion prediction through bilevel optimization that jointly accounts for robot and human dynamics. The framework is validated on two different mobile manipulators, the Stretch 3 and the RidgebackUR10, across three experimental settings: (i) delivery tasks with different navigation and manipulation priorities, (ii) sequential pick-and-place tasks with different human motion prediction models, and (iii) interactions involving adversarial human behavior. Our results highlight how interactive prediction enables safe and efficient coordination, outperforming baselines that rely on weighted objectives or open-loop human models. Code: https://github.com/utiasDSL/sm2ith.git

Abstract:
Dynamic modeling and control are critical to unlocking soft robots potential, yet remain challenging due to complex constitutive behaviors and real-world operating conditions. Bio-inspired musculoskeletal robots, which integrate rigid skeletons with soft actuators, combine the advantages of heavy load-bearing capacity and inherent flexibility. Although actuation dynamics has been studied through experimental methods and surrogate models, accurate and effective modeling and simulation still pose a significant challenge when soft actuators are applied at a large scale, especially in hybrid rigid-soft robots with continuously distributed mass, kinematic loops and diverse motion modes. To address these challenges, we propose EquiMus, an energy-equivalent dynamic modeling and MuJoCo-based simulation for musculoskeletal rigid--soft hybrid robots with linear elastic actuators. The equivalence and effectiveness are proven in detail and examined through simulations and real experiments on a bionic robotic leg. EquiMus further demonstrates utility for downstream tasks, including controller design and learning-based control.

Abstract:
Advancements in physical Human-Robot Interaction (pHRI) aim to achieve natural and efficient collaboration between humans and robots, especially in dynamic environments where task performance is essential. This study focuses on co-manipulative human-robot joint activities, exploring key components of performance and synchronization. The primary objective was to design an active control technique for the iCub robot's arms that enhances task efficiency with a distinct approach than traditional force feedback controls. Comparing an iCub's passive behavior with the designed active one has registered an increase in its contribution, given through adaptive velocity and mimicry, and showcasing its ability to respond dynamically to changes in human actions. Furthermore, a measurement of the exertion applied by the counterparts revealed that the active behavior required greater energy consumption to reach those levels of synchronization and performance. These results highlight the implications of balancing active behavior with effort intensity to achieve task efficiency in pHRIs.

Abstract:
Factor graphs have demonstrated remarkable efficiency for robotic perception tasks, particularly in localization and mapping applications. However, their application to optimal control problems---especially Model Predictive Control (MPC)---has remained limited due to fundamental challenges in constraint handling. This paper presents a novel integration of the Barrier Interior Point Method (BIPM) with factor graphs, implemented as an open-source extension to the widely adopted g2o framework. Our approach introduces specialized inequality factor nodes that encode logarithmic barrier functions, thereby overcoming the quadratic-form limitations of conventional factor graph formulations. To the best of our knowledge, this is the first g2o-based implementation capable of efficiently handling the constraints within a unified optimization backend. We validate the method through a multi-objective adaptive cruise control application for autonomous vehicles. Benchmark comparisons with state-of-the-art constraint-handling techniques demonstrate faster convergence and improved computational efficiency. (Code repository: https://github.com/snt-arg/bipm_g2o)

Abstract:
As multi-robot collaboration becomes increasingly prevalent in modern industrial settings, ensuring collision-free operation among robots sharing the same workspace remains a critical challenge. This paper proposes an integrated framework that combines 3D Gaussian Splatting (3D-GS) for high-fidelity scene reconstruction, Generalized Iterative Closest Point (GICP) with Fast Global Registration (FGR) for robust pose estimation, a Deep Graph Convolutional Neural Network (DGCNN) for joint angle regression from point cloud data, Dynamic Mode Decomposition (DMD) for trajectory prediction, and a Control Barrier Function (CBF) for real-time safety enforcement. Through experiments, we validated the trajectory prediction of 0DOF objects and confirmed that joint angle prediction is possible from 3D-GS-based PLY data using DGCNN-based regression, utilizing joint angle training data collected at intervals of 15 to 45 degrees.

Abstract:
Model Predictive Control (MPC) is widely adopted for agile multirotor vehicles, yet achieving both stability and obstacle-free flight is particularly challenging when a payload is suspended beneath the airframe. This paper introduces a Safety Enhanced Passivity-Based Nonlinear MPC (SEP-NMPC) that provides formal guarantees of stability and safety for a quadrotor transporting a slung payload through cluttered environments. Stability is enforced by embedding a strict passivity inequality, which is derived from a shaped energy storage function with adaptive damping, directly into the NMPC. This formulation dissipates excess energy and ensures asymptotic convergence despite payload swings. Safety is guaranteed through high-order control barrier functions (HOCBFs) that render user-defined clearance sets forward-invariant, obliging both the quadrotor and the swinging payload to maintain separation while interacting with static and dynamic obstacles. The optimization remains quadratic-program compatible and is solved online at each sampling time without gain scheduling or heuristic switching. Extensive simulations and real-world experiments confirm stable payload transport, collision-free trajectories, and real-time feasibility across all tested scenarios. The SEP-NMPC framework therefore unifies passivity-based closed-loop stability with HOCBF-based safety guarantees for UAV slung-payload transportation.

Abstract:
Several machine learning (ML)-based measurement systems have been proposed to estimate difficult-to-measure quantities from the values of distance sensor arrays. However, variations in sensor output characteristics (OCs) can lead to degradation in the estimation accuracy when transferring training data acquired from the original acquisition sensors to new target sensors. Moreover, acquiring training data from target sensors is time and labor intensive. We propose two methods to convert previously collected training data to reflect different OCs, enabling their repeated use. For evaluation, we use a device that estimates the relative position and orientation of vehicles based on the values of distance sensor arrays. The correction approach for the training data based on the OC data reduces the root-mean-square error (RMSE) by up to 23% compared with transferring training data. The augmentation approach transforms the training data into data that include different OCs using a mapping function constructed from a small batch of training data. Furthermore, a method for collecting a small batch of training data to achieve a higher OC conversion accuracy is demonstrated. The RMSE is reduced by up to 58% by the proposed method compared with transferring training data. The results of this study demonstrate the feasibility of the practical applications of ML-based measurement systems using distance sensor arrays, which may facilitate the development of simple and fast calibration methods.

Abstract:
Drifting, characterized by controlled vehicle motion at high sideslip angles, is crucial for safely handling emergency scenarios at the friction limits. While recent reinforcement learning approaches show promise for drifting control, they struggle with the significant simulation-to-reality gap, as policies that perform well in simulation often fail when transferred to physical systems. In this paper, we present a reinforcement learning framework with GPU-accelerated parallel simulation and systematic domain randomization that effectively bridges the gap. The proposed approach is validated on both simulation and a custom-designed and open-sourced 1/10 scale Individual Wheel Drive (IWD) RC car platform featuring independent wheel speed control. Experiments across various scenarios from steady-state circular drifting to direction transitions and variable-curvature path following demonstrate that our approach achieves precise trajectory tracking while maintaining controlled sideslip angles throughout complex maneuvers in both simulated and real-world environments.

Abstract:
This paper presents the design, control, and experimental validation of a lightweight hip exoskeleton for walking assistance. By integrating quasi-direct drive actuators, single-piece stainless steel frames, and passive revolute joints, the device achieves a high torque-to-mass ratio while maintaining a compact and lightweight structure. A delayed output feedback control strategy synchronizes assistive torque with the gait cycle by actively leading the wearer's hip motion, with user studies identifying a consistent optimal phase difference across participants and walking speeds, eliminating repeated calibration. Surface electromyography validates the assistance, demonstrating substantial reductions in activation of the vastus medialis and vastus lateralis at the optimal time delay. Power analysis further confirms that this setting maximizes positive power transfer while minimizing resistive effects. The proposed exoskeleton delivers physiologically meaningful and energetically efficient hip assistance suitable for everyday mobility support.

Abstract:
Accurate depth estimation plays a critical role in the navigation of endoscopic surgical robots, forming the foundation for 3D reconstruction and safe instrument guidance. Fine-tuning pretrained models heavily relies on endoscopic surgical datasets with precise depth annotations. While existing self-supervised depth estimation techniques eliminate the need for accurate depth annotations, their performance degrades in environments with weak textures and variable lighting, leading to sparse reconstruction with invalid depth estimation. Depth completion using sparse depth maps can mitigate these issues and improve accuracy. Despite the advances in depth completion techniques in general fields, their application in endoscopy remains limited. To overcome these limitations, we propose EndoDDC, an endoscopy depth completion method that integrates images, sparse depth information with depth gradient features, and optimizes depth maps through a diffusion model, addressing the issues of weak texture and light reflection in endoscopic environments. Extensive experiments on two publicly available endoscopy datasets show that our approach outperforms state-of-the-art models in both depth accuracy and robustness. This demonstrates the potential of our method to reduce visual errors in complex endoscopic environments. Our code will be released at https://github.com/yinheng-lin/EndoDDC.

Abstract:
This paper provides a new repetitive control framework for robot manipulators with periodic reference and disturbance signals. We first take the inverse dynamics (ID) approach to a robot manipulator to transform its nonlinear input/output behavior into an equivalent linear time-invariant (LTI) system, for which the conventional repetitive control strategy is employed. To facilitate an optimal controller synthesis and an associated stability analysis, we next derive the so-called delay-feedback system. We then provide two linear matrix inequality (LMI)-based optimal controller synthesis procedures for minimizing the H_infty and the generalized H_2 norms from the disturbance to the tracking error, respectively. We next established operator-theoretic stability assertions in terms of the monodromy operator. In particular, a necessary and sufficient condition for the exponential stability of the delay-feedback system is derived for the case without external disturbances and we show that the delay-feedback system is input-to-state stable if it is exponentially stable. Finally, experiment comparisons are given to demonstrate the overall developed arguments.

Abstract:
Precise localization with respect to a set of objects of interest enables mobile robots to perform various tasks. With the rise of edge devices capable of deploying deep neural networks (DNNs) for real-time inference, it stands to reason to use artificial intelligence (AI) for the extraction of object-specific, semantic information from raw image data, such as the object class and the relative six degrees of freedom (6-DoF) pose. However, fusing such AI-based measurements in an Extended Kalman Filter (EKF) requires quantifying the DNNs' uncertainty and outlier rejection capabilities. This paper presents the benefits of reformulating the measurement equation in AI-based, object-relative state estimation. By deriving an EKF using the direct object-relative pose measurement, we can decouple the position and rotation measurements, thus limiting the influence of erroneous rotation measurements and allowing partial measurement rejection. Furthermore, we investigate the performance and consistency improvements for state estimators provided by replacing the fixed measurement covariance matrix of the 6-DoF object-relative pose measurements with the predicted aleatoric uncertainty of the DNN.

Abstract:
This paper investigates the multi-robot efficient search (MuRES) problem in uncertain topological networks. One unique characteristic of the studied problem is that the topology of the underlying network is uncertain, posing great challenges to canonical MuRES solutions which presumes a fixed network topology. To address the challenge, this paper proposes the STructure-Adaptive Graph-Encoded policy gradient (STAGE) algorithm for moving target search. STAGE comprises two main components: (1) the bi-scale graph attention network (GAT) encoder, which fuses a k-hop local GAT with a distance-augmented long-range GAT to enable the encoder to capture both local and long-range network structural changes; and (2) the entropy-regularized counterfactual policy gradient module, which employs a structure-aware centralized critic to estimate both the team returns and the network structure information, and train the decentralized actors via counterfactual marginalization with entropy regularization. Extensive simulation results and physical experiment demonstrate the feasibility and superiority of STAGE for solving MuRES in uncertain topological environments.

Abstract:
In imitation learning, robots are supposed to learn from demonstrations of the desired behavior. Most of the work in imitation learning for swarm robotics provides the demonstrations as rollouts of an existing policy. In this work, we provide a framework based on generative adversarial imitation learning that aims to learn collective behaviors from human demonstrations. Our framework is evaluated across six different missions, learning both from manual demonstrations and demonstrations derived from a PPO-trained policy. Results show that the imitation learning process is able to learn qualitatively meaningful behaviors that perform similarly well as the provided demonstrations. Additionally, we deploy the learned policies on a swarm of TurtleBot 4 robots in real-robot experiments. The exhibited behaviors preserved their visually recognizable character and their performance is comparable to the one achieved in simulation.

Abstract:
Deep reinforcement learning (DRL) is a core technology for mobile robot navigation in diverse environments, yet existing Behavior Cloning (BC)-enhanced DRL methods suffer two critical challenges: fixed imitation constraints suppress autonomous exploration in late training stages despite stabilizing early learning, and goal-obstacle avoidance task conflicts impede robust action selection during navigation. To address these issues, this paper proposes an Adaptive Strategy Deep Reinforcement Learning (ADRL) method, which reformulates BC as a progressively released transitional constraint and builds a stage-aware transition framework for robot navigation. Specifically, ADRL dynamically fuses Twin Delayed Deep Deterministic Policy Gradient (TD3) with BC via a value-driven imitation scheduling mechanism, which adaptively modulates the expert-online data mixing ratio and BC regularization strength based on critic feedback to accelerate convergence and realize a smooth shift from imitation-dominant to exploration-driven learning. A phase-aligned dynamic weight composite reward function is designed, which embeds motion constraints and stage-aware priority adjustment to mitigate reward sparsity and align learning objectives with policy maturity. Additionally, a lightweight adaptive replanning mechanism is developed as an evaluation stabilizer, which generates obstacle-avoiding waypoints by obstacle density when the robot stagnates, resolving goal-obstacle avoidance conflicts without al

Abstract:
Accurate seabed mapping is essential for habitat monitoring and infrastructure inspection. In turbid, shallow coastal waters, such as shellfish aquaculture farms, the effectiveness of traditional optical methods is limited. Autonomous surface vehicles (ASVs) equipped with forward-looking sonar (FLS) offer a promising alternative. However, existing sonar-based systems face challenges in achieving fine resolution mapping over long trajectories due to low-resolution positioning measurements and accumulated drift over long trajectories. In this paper, we present a drift-resilient seabed mapping framework that integrates local FLS frame alignment using the FourierMellin transform (FMT) with global trajectory optimization based on an extended Kalman filter (EKF) that fuses global positioning system (GPS), inertial measurement unit (IMU), and compass data. A variance-based image blending strategy is used to further reduce visual artifacts in overlapping regions. Field trials on a structured oyster farm site show that our framework helps reduce drift in RMSE by 9.5% relative to the FMT-only baseline. This framework also enables sub-meter reconstruction accuracy and preservation of high-resolution textures needed for oyster inventory estimation within the mapped areas.

Abstract:
This work presents a geometric backstepping controller for a variable-tilt omnidirectional multirotor that explicitly accounts for both servo and rotor dynamics. Considering actuator dynamics is essential for more effective and reliable operation, particularly during aggressive flight maneuvers or recovery from sudden disturbances. While prior studies have investigated actuator-aware control for conventional and fixed-tilt multirotors, these approaches rely on linear relationships between actuator input and wrench, which cannot capture the nonlinearities induced by variable tilt angles. In this work, we exploit the cascade structure between the rigid-body dynamics of the multirotor and its nonlinear actuator dynamics to design the proposed backstepping controller and establish exponential stability of the overall system. Furthermore, we reveal parametric uncertainty in the actuator model through experiments, and we demonstrate that the proposed controller remains robust against such uncertainty. The controller was compared against a baseline that does not account for actuator dynamics across three experimental scenarios: fast translational tracking, rapid rotational tracking, and recovery from sudden disturbance. The proposed method consistently achieved better tracking performance, and notably, while the baseline diverged and crashed during the fastest translational trajectory tracking and the recovery experiment, the proposed controller maintained stability and successfully completed the tasks, thereby demonstrating its effectiveness.

Abstract:
Recent advances in geometric foundation models have emerged as a promising alternative for addressing the challenge of dense reconstruction in monocular visual simultaneous localization and mapping (SLAM). Although geometric foundation models enable SLAM to leverage variable input views, the previous methods remain confined to two-view pairs or fixed-length inputs without sufficient deliberation of geometric context for view selection. To tackle this problem, we propose AIM-SLAM, a dense monocular SLAM framework that exploits an adaptive and informative multi-view keyframe prioritization with dense pointmap predictions from visual geometry grounded transformer (VGGT). Specifically, we introduce the selective information and geometric-aware multi- view adaptation (SIGMA) module, which employs voxel overlap and information gain to retrieve a candidate set of keyframes and adaptively determine its size. Furthermore, we formulate a joint multi-view Sim(3) optimization that enforces consistent alignment across selected views, substantially improving pose estimation accuracy. The effectiveness of AIM-SLAM is demonstrated on real-world datasets, where it achieves state-of-the-art pose estimation performance and accurate dense reconstruction results. Our system supports ROS integration, with code is available at https://aimslam.github.io/.

Abstract:
Efficiently decoding human movement and/or intention is essential for controlling advanced prosthetic and robotic systems. Various muscle-machine interfaces have been researched for this purpose, including electromyography and lightmyography based interfaces. However, the decoding effectiveness of lightmyography signals for multi-finger hand motions remains insufficiently explored. This study investigates the decoding of human multi-finger movements using different machine learning methods. Lightmyography and finger motion data were collected from six participants grasping five common objects. Data were preprocessed using the sliding window method and decoded using three machine learning algorithms: random forest, convolutional neural networks, and multi-layer perceptron. Moreover, models were trained in a grasp-specific manner increasing decoding accuracy. Finally, statistical analysis demonstrated that the random forest model significantly outperformed the other methods, establishing it as the most suitable technique for decoding multi-finger motions from lightmyography signals.

Abstract:
World models enable robots to imagine future observations given current observations and planned actions, and have been increasingly adopted as generalized dynamics models to facilitate robot learning. Despite their promise, these models remain brittle when encountering novel visual distractors such as objects and background elements rarely seen during training. Specifically, novel distractors can corrupt action outcome predictions, causing downstream failures when robots rely on the world model imaginations for planning or action verification. In this work, we propose Reimagination with Observation Intervention (ReOI), a simple yet effective test-time strategy that enables world models to predict more reliable action outcomes in open-world scenarios where novel and unanticipated visual distractors are inevitable. Given the current robot observation, ReOI first detects visual distractors by identifying which elements of the scene degrade in physically implausible ways during world model prediction. Then, it modifies the current observation to remove these distractors and bring the observation closer to the training distribution. Finally, ReOI reimagines future outcomes with the modified observation and reintroduces the distractors post-hoc to preserve visual consistency for downstream planning and verification. We validate our approach on a suite of robotic manipulation tasks in the context of action verification, where the verifier needs to select desired action plans based on predictions from a world model. Our results show that ReOI is robust to both in-distribution and out-of-distribution visual distractors. Notably, it improves task success rates by up to 3× in the presence of novel distractors, significantly outperforming action verification that relies on world model predictions without imagination interventions.

Abstract:
This paper explores traversability estimation for robot navigation. A key bottleneck in traversability estimation lies in efficiently achieving reliable and robust predictions while accurately encoding both geometric and semantic information across diverse environments. We introduce Navigation via Mixture of Experts (NAVMOE), a hierarchical and modular approach for traversability estimation and local navigation. NAVMOE combines multiple specialized models for specific terrain types, each of which can be either a classical model-based or a learning-based approach that predicts traversability for specific terrain types. NAVMOE dynamically weights the contributions of different models based on the input environment through a gating network. Overall, our approach offers three advantages: First, NAVMOE enables traversability estimation to adaptively leverage specialized approaches for different terrains, which enhances generalization across diverse and unseen environments. Second, our approach significantly improves efficiency with negligible cost of solution quality by introducing a training-free lazy gating mechanism, which is designed to minimize the number of activated experts during inference. Third, our approach uses a two-stage training strategy that enables the training for the gating networks within the hybrid MoE method that contains nondifferentiable modules. Extensive experiments show that NAVMOE delivers a better efficiency and performance balance than any individual expert or full ensemble across different domains, improving cross- domain generalization and reducing average computational cost by 81.2% via lazy gating, with less than a 2% loss in path quality.

Abstract:
We present Triple Zero Path Planning (TZPP), a collaborative framework for heterogeneous multi-robot systems that requires zero training, zero prior knowledge, and zero simulation. TZPP employs a coordinatorexplorer architecture: a humanoid robot handles task coordination, while a quadruped robot explores and identiffes feasible paths using guidance from a multimodal large language model. We implement TZPP on Unitree G1 and Go2 robots and evaluate it across diverse indoor and outdoor environments, including obstacle-rich and landmark-sparse settings. Experiments show that TZPP achieves robust, human-comparable eﬀﬀciency and strong adaptability to unseen scenarios. By eliminating reliance on training and simulation, TZPP offers a practical path toward real-world deployment of heterogeneous robot cooperation. Our code and video are provided at: https://github.com/triple-zeropp/Triple-zero-robot-agent

Abstract:
Optimal path planning is prone to convergence to local, rather than global, optima. This is often the case for mobile manipulators due to nonconvexities induced by obstacles, robot kinematics and constraints. This paper focuses on planning under end effector path constraints and attempts to circumvent the issue of converging to a local optimum. We propose a pipeline that first discovers multiple homotopically distinct paths, and then optimizes them to obtain multiple distinct local optima. The best out of these distinct local optima is likely to be close to the global optimum. We demonstrate that our pipeline is able to circumvent the problem of local optima and produces a final local optimum that is close to the global optimum.

Abstract:
Robotic perception of transparent objects presents unique challenges due to their refractive properties, lack of texture, and limitations of conventional RGB-D sensors in capturing reliable depth information. These challenges significantly hinder robotic manipulation capabilities in real-world settings such as household assistance, hospitality, and healthcare. To address these issues, we propose SPILL: A lightweight perception pipeline for Size, Pose, and Internal Liquid Level estimation of unknown transparent glassware using a single view. SPILL combines object detection with semantic keypoint detection, and operates without requiring object-specific 3D models or depth completion. We demonstrate its effectiveness in autonomous robotic pouring tasks. Additionally, to enhance the robustness and generalization of keypoint detection to diverse real-world scenarios, we introduce Glasses-in-the-Wild, a new dataset that captures a wide variety of glass types in realistic environments. Evaluated on a robot manipulator, SPILL achieves a 93.6% success rate across 500 autonomous pours with 20 unseen glasses in three diverse real-world scenes. We further demonstrate robustness through multiple live public events in real-world, human-centered environments. In one recorded session, the robot autonomously served 62 drinks with a 98.3% success rate. These results demonstrate that task-relevant keypoint detection enables scalable, real-world transparent object interaction, paving the way for practical applications in service and assistive robotics - without spilling a drop. Dataset and code will be released upon acceptance.

Abstract:
Ensuring survival and self-preservation is essential to design intelligent robots that adapt to dynamic and unfamiliar environments. Inspired by the dual-pathway model from neuroscience, we introduce a control architecture designed to ensure the adaptability of robotic behavior during navigation. This approach parallels the neuroscientific ``Low Road'' paradigm by incorporating constructs resembling the thalamus, implemented as a nonlinear filter; the amygdala, modeled as a Soft Actor-Critic (SAC) reinforcement learning agent; and the brainstem-cerebellum connection, represented by a Nonlinear Model Predictive Controller (NMPC). Our findings indicate superior adaptiveness, generalizability, and computational efficiency compared to standard NMPCs and Artificial Potential Fields in both static and dynamic environments with obstacles of varying risk levels.

Abstract:
When performing 3D manipulation tasks, robots have to execute action planning based on perceptions from multiple fixed cameras. The multi-camera setup introduces substantial redundancy and irrelevant information, which increases computational costs and forces the model to spend extra training time extracting crucial task-relevant details. To filter out redundant information and accurately extract task-relevant features, we propose the VERM (Virtual Eye for Efficient Robotic Manipulation) method, leveraging the knowledge in foundation models to imagine a virtual optimal view from the constructed 3D point cloud, which efficiently captures necessary information and mitigates occlusion. To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to-fine procedure. Extensive experimental results on both simulation benchmark RLBench and real-world evaluations demonstrate the effectiveness of our method, surpassing previous state-of-the-art methods while achieving 1.89x speedup in training time and 1.54x speedup in inference speed.

Abstract:
This work presents a novel design of achieving multiple curvatures in a tendon-driven continuum robot (TDCR) system with only a single set of acturation tendons. The TDCR used in this work is assembled from multiple sub-sections made of low-melting point alloy (LMPA), which each of them has independent binary stiffness by localized thermal phase transitions. Through localized thermal phase transitions, the robot can dynamically "lock" or "release" specific sub-sections, enabling multi-curvature configurations, enabling independent curvature control with only one set of actuation tendons. is approach eliminates the need for additional segments or complex locking mechanisms, significantly reducing mechanical complexity and control challenges. Experimental validation confirms the systems ability to execute complex shapes (C/J/S-shape configurations), maintain structural rigidity in locked states, as well as different spatial movements can be achieved by changing the configuration of subsegments. The SR-TDCR demonstrates potential for confined-space applications merging dexterity with actuation efficiency.

Abstract:
Collision-free motion planning for redundant robot manipulators in complex environments is yet to be explored. Although recent advancements at the intersection of deep reinforcement learning (DRL) and robotics have highlighted its potential to handle versatile robotic tasks, current DRL-based collision-free motion planners for manipulators are highly costly, hindering their deployment and application. This is due to an overreliance on the minimum distance between the manipulator and obstacles, inadequate exploration and decision-making by DRL, and inefficient data acquisition and utilization. In this article, we propose URPlanner, a universal paradigm for collision-free robotic motion planning based on DRL. URPlanner offers several advantages over existing approaches: it is platform-agnostic, cost-effective in both training and deployment, and applicable to arbitrary manipulators without solving inverse kinematics. To achieve this, we first develop a parameterized task space and a universal obstacle avoidance reward that is independent of minimum distance. Second, we introduce an augmented policy exploration and evaluation algorithm that can be applied to various DRL algorithms to

Abstract:
Magnetically-driven surgical tools are a new class of millimetre-scale devices that could enable procedures such as minimally invasive neurosurgery due to their high dexterity at a small size. However, safe and effective control of these magnetic tools necessitates real-time observation of tool joint angles, which is challenging inside a surgical environment. Optical coherence tomography (OCT) is an emerging volumetric imaging technique offering 3D visualization of tissue and tools simultaneously, which we explore for joint angle estimation. While some previous studies have used OCT for estimating the pose of rigid instruments, those methods are specific to needle-like tools, and often have slow processing speed. In this work, we benchmark eight deep-learning models adapted from other 3D modalities to OCT data showing magnetic tools in a mock surgical environment. The models are tested in the presence of other objects, occlusion, noise, and the tool being partially outside of the OCT's field of view. The best performing model, VoxelNeXt, is adapted from 3D object detection in LiDAR scans, the first time a model of this kind is used on medical data. It infers tool pose with 0.6 mm position and 5° angular errors, with 40 ms inference time. We use this model to provide feedback for controlling a multi-jointed magnetic tool, demonstrating the robustness of OCT-based feedback control.

Abstract:
Vision-based tactile sensors enable high-resolution tactile perception by capturing image-based contact data. However, their utility in tactile localization is limited by their inherently small and local sensing area, as well as their dependence on distinct object surface features. We propose TacTape, a novel tactile fiducial system that enables accurate and efficient tactile localization by attaching textured tape to object surfaces. A lightweight algorithm allows real-time estimation of contact position and orientation from partially observed structured 3D textures. Experiments demonstrate that TacTape achieves sub-millimeter positional and sub-degree angular localization accuracy, and operates significantly faster than classic tactile mapping methods.

Abstract:
In this letter, we propose a novel upper limb rehabilitation framework based on dual-arm robotics for therapist-like traction training. Prioritizing patient safety, an 8-DOF kinematic model of the upper limb complex is derived to evaluate the reachable workspace of the end-of-arm and forearm during interaction with a dual-arm robot. Leveraging the characteristics of dual-arm rehabilitation, a non-redundant inverse kinematics method is proposed to constrain joint angles, thereby establishing a safety mechanism under dual constraints. Secondly, considering the training science and compliance, a potential field control strategy is introduced to enable the robot to learn the therapist's traction characteristics from a single demonstration. Combined with the master-slave control, it reproduces the therapist's assistance and allows for compliant interaction. Experimental results show that the proposed framework combines the strong adaptability and comfort of end-effector robots with the precise rehabilitation of exoskeleton robots. As dual-arm and humanoid robots become more widely adopted, the proposed scheme holds promise for delivering therapist-like safe, scientific, and compliant rehabilitation in clinical and home settings.

Abstract:
Current methods for formation flight primarily focus on maintaining formations, often neglecting the swarm's agility. Furthermore, most of these approaches fail to leverage global information from the swarm for obstacle avoidance, making them incapable of generating efficient and safe trajectories in large obstacle scenarios. To address these limitations, this letter proposes a novel swarm trajectory planning framework that utilizes a virtual core to control the swarm. We employ virtual core penalties and dynamic maximum speed allocation to strike a balance between swarm flexibility and formation keeping, allowing the drones to avoid obstacles more smoothly and safely while maintaining formation stability. For large obstacle avoidance, we design a collaborative large obstacle boundary search strategy and a global swarm planning method to enable the rapid and safe generation of drone trajectories.To validate the performance of the proposed methods, we develop a comprehensive set of experimental scenarios that include both simulations and real-world environments. The experimental results confirm the effectiveness of our approach.

Abstract:
This paper presents a novel Sliding Mode-Based Nonlinear Model Predictive Control (SM-NMPC) for controlling Unmanned Aerial Vehicles (UAVs) such as Quadrotors and a 10-propeller drone (Cube-Drone). The proposed method combines Aggregated Hierarchical Sliding Mode Control (AHSMC) strategies with Nonlinear Model Predictive Control (NMPC), designed to operate on resource-constrained microcontrollers. First, an AHSMC that provided a virtual input reference is introduced to ensure the UAV's robustness, which is then leveraged by the NMPC to solve the optimization problem. A comprehensive comparison to existing approaches in terms of stability and computational efficiency demonstrates that the SM-NMPC framework excels, enabling quadrotor UAVs to accurately track reference trajectories even in the presence of a degraded motor. The proposed method also showcases the capability to implement robust optimal control on a microcontroller. Extensive experiments, both on real UAVs and their physical models in Gazebo/ROS2, are conducted to validate the effectiveness of the approach. A comparison to other state-of-the-art controllers further highlights the feasibility and superior performance of the proposed methodology. The open-source code has also been made available for further investigation.

Abstract:
In this work, we address the challenge of predicting human-applied force and velocity during collaborative object transportation over extended distances (58 m). We enhance state-of-the-art predictors by refining their input data processing, which significantly improves prediction accuracy. Furthermore, we extend the temporal prediction horizon from 1 s to 2 s without compromising performance, by introducing an extra environmental prediction module that conditions force and velocity estimations based on anticipated sensory input. This integration captures the contextual dependency of human behaviour during joint transport. Experimental evaluations, both on dataset and in real-world settings, validate the effectiveness of our approach. Specifically, our best model manages to achieve success rates in testset of up to 90.4% in predicting the humans exerted force and up to 93.0% in the velocity of the human-robot pair during the next 2 s, and up to 87.1% and 91.3% respectively in real experiments.

Abstract:
Reachability analysis has become increasingly important in robotics to distinguish safe from unsafe states. Unfortunately, existing reachability and safety analysis methods often fall short, as they typically require known system dynamics or large datasets to estimate accurate system models, are computationally expensive, and assume full state information. A recent method, called MORALS, aims to address these shortcomings by using topological tools to estimate Regions of Attraction (ROA) in a low-dimensional latent space. However, MORALS still relies on full state knowledge and has not been studied when only sensor measurements are available. This paper presents Visual Morse Graph-Aided Estimation of Regions of Attraction in a Learned Latent Space (V-MORALS). V-MORALS takes in a dataset of image-based trajectories of a system under a given controller, and learns a latent space for reachability analysis. Using this learned latent space, our method is able to generate well-defined Morse Graphs, from which we can compute ROAs for various systems and controllers. V-MORALS provides capabilities similar to the original MORALS architecture without relying on state knowledge, and using only high-level sensor data. Our project website is at: https://v-morals.onrender.com.

Abstract:
Reduction gearboxes play a key role in robotics actuation. Among the existing designs, cycloidal gears are gaining popularity for their efficiency, torque density, and robustness. The compact-cam architecture is a variant of the classical cycloidal drive that employs two rigidly coupled cycloidal disks to achieve high reduction ratios within a minimized radial profile. However, this design tends to suffer from low regularity, which can degrade performance in robotics applications. Building on the concept of the double disks with a phase offset to increase regularity, in this work, we present a novel Quadruple-disk CYcloidal Compact-cam (Q- CYC) reducer that applies the phase-offset principle to the compact-cam architecture. By incorporating two additional coupled disks, the proposed design enhances load distribution and motion regularity. Two open-source, 3D-printed prototypes (one implementing the conventional compact-cam transmission and one featuring the presented quadruple-disk architecture) are designed and experimentally evaluated. The analysis focuses on friction, gear play, backdriveability, and speed regularity, demonstrating that the quadruple-disk design offers significant improvements. Therefore, the results validate the effectiveness of the proposed approach in addressing known performance limitations of cycloidal compact-cam reducers, reducing gear play and improving both speed regularity and backdrivability.

Abstract:
Insertion tasks in robotic manipulation demand precise, contact-rich interactions that vision alone cannot resolve. While tactile feedback is intuitively valuable, existing studies have shown that naive visuo-tactile fusion often fails to deliver consistent improvements. In this work, we propose a Cross-Modal Transformer (CMT) for visuo-tactile fusion that integrates wrist-camera observations with tactile signals through structured self- and cross-attention. To stabilize tactile embeddings, we further introduce a physics-informed regularization that encourages bilateral force balance, reflecting principles of human motor control. Experiments on the TacSL benchmark show that CMT with symmetry regularization achieves a 96.59% insertion success rate, surpassing naive and gated fusion baselines and closely matching the privileged wrist + contact force configuration (96.09%). These results highlight two central insights: (i) tactile sensing is indispensable for precise alignment, and (ii) principled multimodal fusion, further strengthened by physics-informed regularization, unlocks complementary strengths of vision and touch, approaching privileged performance under realistic sensing.

Abstract:
Driver distraction recognition (DDR) degrades under deployment-time shifts in camera/ISP pipelines and illumination. We frame this as a single-source domain generalization (SSDG) problem: training on one labeled source domain and testing on unseen devices and lighting. Motivated by this, we propose Low-Frequency Generative Augmentation (LFGA), which separates each image into a fixed high-frequency structure and a re-renderable low-frequency base. Multi-stage, feature-conditioned generators perturb only the photometric low-frequency content and recombine it with the original high-frequency structure to yield "hard-but-correct" views to teach the model photometric invariances. Training imposes decision consistency via cross-entropy and logit matching, and promotes stage-wise separation along class-agnostic factors with a feature-dissimilarity loss. Generators are training-only. On two DDR benchmarks with synthetic cross-photometric shifts and a zero-shot real cross-device video test, LFGA improves cross-domain performance over strong SSDG and DDR baselines while preserving in-domain accuracy.

Abstract:
Robot swarms require cohesive collective behaviour to address diverse challenges, including shape formation and decision-making. Existing approaches often treat consensus in discrete and continuous decision spaces as distinct problems. We present DANCeRS, a unified, distributed algorithm leveraging Gaussian Belief Propagation (GBP) to achieve consensus in both domains. By representing a swarm as a factor graph our method ensures scalability and robustness in dynamic environments, relying on purely peer-to-peer message passing. We demonstrate the effectiveness of our general framework through two applications where agents in a swarm must achieve consensus on global behaviour whilst relying on local communication. In the first, robots must perform path planning and collision avoidance to create shape formations. In the second, we show how the same framework can be used by a group of robots to form a consensus over a set of discrete decisions. Experimental results highlight our method's scalability and efficiency compared to recent approaches to these problems making it a promising solution for multi-robot systems requiring distributed consensus. We encourage the reader to see the supplementary video demo.

Abstract:
Control of soft robots is considered one of the key elements in achieving their intelligence. However, it faces challenging problems such as nonlinear dynamics, highly deformable structures, and operation in unpredictable situations. Numerous methods have been proposed to overcome these challenges, but most of them focus on controlling only a small part of the soft robot's body, such as the end effector. Whole-body shape control is a problem that has not yet been fully explored, but it is critical for tasks that require whole-body path planning to navigate in confined or crowded spaces. In this study, we developed a convolutional neural network (CNN)-based approach for controlling the robot's whole-body shape. The key novelty of our approach is that it learns a purely image-driven CNN control policy with online adaptive capability. Our approach has three main components: (1) training an offline shape policy to offer basic actions, (2) building a shape model and updating it online to maintain accuracy, (3) conducting Bayesian optimization based on the basic action and shape model to obtain optimal performance. The presented approach is validated on a soft robotic arm and experimental results demonstrate that the soft arm can be controlled to achieve target shapes and adapt to different previously unknown situations. Meanwhile, our approach achieved better shape control performance than the state-of-the-art method. Overall, this work presents a feasible learning-based approach to the whole-body shape control problem and contributes to the development of soft robot intelligence from the control perspective.

Abstract:
Current End-to-End Autonomous Driving (E2E-AD) methods resort to unifying modular designs for various tasks (e.g. perception, prediction and planning). Although optimized with a fully differentiable framework in a planning-oriented manner, existing end-to-end driving systems lacking ego-centric designs still suffer from unsatisfactory performance and inferior efficiency, due to rasterized scene representation learning and redundant information transmission. In this paper, we propose an ego-centric fully sparse paradigm, named EgoFSD, for end-to-end self-driving. Specifically, EgoFSD consists of sparse perception, hierarchical interaction and iterative motion planner. The sparse perception module performs detection and online mapping based on sparse representation of the driving scene. The hierarchical interaction module aims to select the Closest In-Path Vehicle / Stationary (CIPV / CIPS) from coarse to fine, benefiting from an additional geometric prior. As for the iterative motion planner, both selected interactive agents and ego-vehicle are considered for joint motion prediction, where the output multi-modal ego-trajectories are optimized in an iterative fashion. In addition, position-level motion diffusion and trajectory-level planning denoising are introduced for uncertainty modeling, thereby enhancing the training stability and convergence speed. Extensive experiments are conducted on nuScenes and Bench2Drive datasets, which significantly reduces the average L2 error by 59% and collision rate by 92% than UniAD while achieves 6.9X faster running efficiency.

Abstract:
Echocardiography is a key imaging modality for cardiac assessment but remains highly operator-dependent, and access to trained sonographers is limited in underserved settings. Teleoperated robotic echocardiography has been proposed as a solution. However, clinical studies report longer examination times than manual procedures, increasing diagnostic delays and operator workload. Automating non-expert tasks, such as automatically moving the probe to an ideal starting pose, offers a pathway to reduce this burden. Prior vision- and depth-based approaches to estimate an initial probe pose are sensitive to lighting, texture, and anatomical variability. We propose a robot-mounted 2D LiDAR-based approach that reconstructs the chest surface in 3D and estimates the initial probe pose automatically. To the best of our knowledge, this is the first demonstration of robot-mounted 2D LiDAR used for 3D reconstruction of a human body surface. Through plane-based extrinsic calibration, the transformation between the LiDAR and robot base frames was estimated with an overall root mean square (RMS) residual of 1.82 mm and rotational uncertainty below 0.2°. The chest front surface, reconstructed from two linear LiDAR sweeps, was aligned with scale-augmented rigid registration to identify an initial probe pose. Mannequin-based study assessing reconstruction accuracy showed mean surface errors of 2.78 ± 0.21 mm. Human trials (N=5) evaluating the proposed approach found probe initial points typically 2030 mm from the clinically defined initial point, while the variation across repeated trials on the same subject was less than 4 mm.

Abstract:
Control barrier functions (CBFs) are used in safety-critical control strategies, implementing a modification of a nominal control action to achieve invariance of a subset of the state space representing safe operating conditions. In this paper we perform a comparative study involving existing safety-critical CBF designs, including energy-based CBFs and Exponential CBFs. The analysis, performed both theoretically and on a benchmark obstacle avoidance task, provides insights into how these CBFs affect energy transfers and the overall performance of the closed-loop system, highlighting benefits and limitations of each approach. To validate our analysis, we conduct software simulations on a 3R planar robot and a 7-DoF robotic manipulator, complemented by experimental evaluations on a physical robotic platform.

Abstract:
Data scaling has long remained a critical bottleneck in robot learning. For humanoid robots, human videos and motion data are abundant and widely available, offering a free and large-scale data source. Besides, the semantics related to the motions enable modality alignment and high-level robot control learning. However, how to effectively mine raw video, extract robot-learnable representations, and leverage them for scalable learning remains an open problem. To address this, we introduce Humanoid-Union, a large-scale dataset generated through an autonomous pipeline, comprising over 260 hours of diverse, high-quality humanoid robot motion data with semantic annotations derived from human motion videos. The dataset can be further expanded via the same pipeline. Building on this data resource, we propose SCHUR, a scalable learning framework designed to explore the impact of large-scale data on high-level control in humanoid robots. Experimental results demonstrate that SCHUR achieves high robot motion generation quality and strong text-motion alignment under data and model scaling, with 37% reconstruction improvement under MPJPE and 25% alignment improvement under FID comparing with previous methods. Its effectiveness is further validated through deployment in real-world humanoid robot.

Abstract:
Robot-assisted endovascular intervention offers a safe and effective solution for remote catheter manipulation, reducing radiation exposure while enabling precise navigation. Reinforcement learning (RL) has recently emerged as a promising approach for autonomous catheter steering; however, conventional methods suffer from sparse reward design and reliance on static vascular models, limiting their sample efficiency and generalization to intraoperative variations. To overcome these challenges, this paper introduces a sample-efficient RL framework with online expert correction for autonomous catheter steering in endovascular bifurcation navigation. The proposed framework integrates three key components: (1) A segmentation-based pose estimation module for accurate real-time state feedback, (2) A fuzzy controller for bifurcation-aware orientation adjustment, and (3) A structured reward generator incorporating expert priors to guide policy learning. By leveraging online expert correction, the framework reduces exploration inefficiency and enhances policy robustness in complex vascular structures. Experimental validation on a robotic platform using a transparent vascular phantom demonstrates that the proposed approach achieves convergence in 123 training episodesa 25.9% reduction compared to the baseline Soft Actor-Critic (SAC) algorithmwhile reducing average positional error to 83.8% of the baseline. These results indicate that combining sample-efficient RL with online expert correction enables reliable and accurate catheter steering, particularly in anatomically challenging bifurcation scenarios critical for endovascular navigation.

Abstract:
In robot-assisted spinal endoscopy, intraoperative imaging is frequently degraded by bleeding, irrigation fluids, bubbles, smoke, and uneven illumination, which can severely compromise surgical precision, safety, and decisionmaking. Accurate identification of anatomical structures is particularly critical in spinal procedures, yet acquiring paired clean and degraded images in real clinical settings is infeasible. To address this challenge, we propose DCP-Net, an unpaired endoscopic image restoration framework tailored for robotic spinal surgery. DCP-Net integrates Diffusion-Prior Contrastive Learning (DPCL) to leverage generative priors and contrastive objectives for robust latent representations, and Physics-Informed Constraints (PIC) to ensure anatomically consistent restoration. Furthermore, we introduce Diffusion-Prior Uncertainty Estimation (DPUE), providing pixel-wise confidence maps that quantify restoration reliability and guide risk-aware robotic perception. We further constructed a dataset comprising 21,845 paired/unpaired samples of intraoperative visual degradations in spinal endoscopy, primarily involving bleeding, bubbles, and other artifacts. Extensive experiments show that DCP-Net outperforms existing methods in both quantitative metrics and perceptual quality, significantly improving visual clarity and supporting various robotic navigation tasks. Among these tasks, accurate bleeding point detection plays a particularly critical role in ensuring safe and precise navigation in clinical practice.

Abstract:
High-precision liquid droplet manipulation is widely used in life science, biomedical engineering, and industry. However, there still lacks a robotic approach supporting the direct in-petri dish manipulation of liquid droplets without the need of additives (such as magnetic beads) or other setup requirements (such as on hydrophobic or conductive surfaces). From the robotics perspective, this challenge concerns with designing an adequate robot end-effector capable of firmly holding and effectively moving droplet on hydrophilic surfaces. In this paper, we propose an automated robotic system for the direct in-petri dish liquid droplet manipulation based on ultrasonic phased transducer array (UPTA) and microscope, which can interact with users to selectively grasp droplet, follow user designated trajectories to transport droplet, and positioning droplet in high precision. The core working mechanism of the proposed system is to precisely generate an inclined single focal point, acting as a noncontact end-effector, inside the droplet under the guidance of microscope, which can induce sufficient hydrodynamic actuation forces on the peripheral contact line of the droplet to keep the droplet moving along the ultrasonic end-effector. Since it is additive free, our system is inherently compatible to medical, chemical and industrial protocols. Details regarding the system design and implementation, the ultrasound focusing strategy, and the visual servo control scheme are elaborated in this paper. Experiments validated the effectiveness of the proposed system.

Abstract:
While model-based controllers have demonstrated remarkable performance in autonomous drone racing, their performance is often constrained by the reliance on pre-computed reference trajectories. Conventional approaches, such as trajectory tracking, demand a dynamically feasible, full-state reference, whereas contouring control relaxes this requirement to a geometric path but still necessitates a reference. Recent advancements in reinforcement learning (RL) have revealed that many model-based controllers optimize surrogate objectives, such as trajectory tracking, rather than the primary racing goal of directly maximizing progress through gates. Inspired by these findings, this work introduces a reference-free method for time-optimal racing by incorporating this gate progress objective, derived from RL reward shaping, directly into the Model Predictive Path Integral (MPPI) formulation, which only depends on waypoint positions. The sampling-based nature of MPPI makes it uniquely capable of optimizing the discontinuous and non-differentiable objective in real-time. We also establish an empirical testbed that leverages MPPI to systematically and fairly compare three distinct objective functions with a consistent dynamics model and parameter set: classical trajectory tracking, contouring control, and the proposed gate progress objective. We compare the performance of these three objectives when solved via both MPPI and a traditional gradient-based solver. Our results demonstrate that the proposed reference-free approach achieves competitive racing performance, rivaling or exceeding reference-based methods.

Abstract:
For effective deployment in real-world environments, humanoid robots must autonomously navigate a diverse range of complex terrains with abrupt transitions. While the Vanilla mixture of experts (MoE) framework is theoretically capable of modeling diverse terrain features, in practice, the gating network exhibits nearly uniform expert activations across different terrains, weakening the expert specialization and limiting the model's expressive power. To address this limitation, we introduce CMoE, a novel single-stage reinforcement learning framework that integrates contrastive learning to refine expert activation distributions. By imposing contrastive constraints, CMoE maximizes the consistency of expert activations within the same terrain while minimizing their similarity across different terrains, thereby encouraging experts to specialize in distinct terrain types. We validated our approach on the Unitree G1 humanoid robot through a series of challenging experiments. Results demonstrate that CMoE enables the robot to traverse continuous steps up to 20 cm high and gaps up to 80 cm wide, while achieving robust and natural gait across diverse mixed terrains, surpassing the limits of existing methods. To support further research and foster community development, we will release our code publicly.

Abstract:
Lower-limb exoskeletons have the potential to enhance mobility and reduce the metabolic cost of walking,while conventional control strategies often lack adaptability and require labor-intensive tuning. Recent advances in reinforcement learning (RL) provide new opportunities for generating efficient and personalized assistance. In this study, we propose a predictive simulation framework that integrates a reflex-based musculoskeletal walking model with a hip exoskeleton controller trained using Proximal Policy Optimization (PPO) with a Long Short-Term Memory (LSTM) actor network. The reflex-based model reproduces realistic gait kinematics without relying on experimental motion data, while the LSTM-PPO controller learns to map kinematic states directly to assistive torques. Domain randomization was applied during training to enhance robustness and facilitate sim-to-real transfer. The learned controller was deployed onto a physical hip exoskeleton and evaluated in human subject experiments. Results showed that the LSTM-PPO controller reduced the metabolic cost of walking by an average of 9.1%. These findings highlight the potential of predictive simulation and deep RL for developing intelligent, experiment-free exoskeleton controllers that improve walking efficiency and robustness in real-world conditions.

Abstract:
Designing controllers that are both safe and performant is inherently challenging. This co-optimization can be formulated as a constrained optimal control problem, where the cost function represents the performance criterion and safety is specified as a constraint. While sampling-based methods, such as Model Predictive Path Integral (MPPI) control, have shown great promise in tackling complex optimal control problems, they often struggle to enforce safety constraints. To address this limitation, we propose DualGuard-MPPI, a novel framework for solving safety-constrained optimal control problems. Our approach integrates Hamilton-Jacobi reachability analysis within the MPPI sampling process to ensure that all generated samples are provably safe for the system. On the one hand, this integration allows DualGuard-MPPI to enforce strict safety constraints; at the same time, it facilitates a more effective exploration of the environment with the same number of samples, reducing the effective sampling variance and leading to better performance optimization. Through several simulations and hardware experiments, we demonstrate that the proposed approach achieves much higher performance compared to existing MPPI methods, without compromising safety.

Abstract:
3D reconstruction is a fundamental task in robotics that has gained attention due to its major impact in a wide variety of practical settings, including agriculture, underwater, and urban environments. While this task can be carried out using a large number of arbitrarily taken 2D images, their processing may become laborious, time-consuming, and in some instances may not provide the necessary information about the object of interest. An efficient alternative is the so-called view planning (VP), which aims to optimally place a certain number of cameras in positions that maximize the visual information. Nonetheless, in most real-world settings, existing environmental noise can significantly affect the performance of 3D reconstruction. To that end, this work advocates a novel geometric-based reconstruction quality function for VP, that accounts for the existing noise of the environment, without requiring its closed-form expression. With no analytic expression of the objective function, this work puts forth an adaptive Bayesian optimization algorithm for accurate 3D reconstruction in the presence of noise. Numerical tests on simulated and real noisy agricultural environments showcase the merits of the proposed VP approach for efficient 3D reconstruction with even a small number of available cameras.

Abstract:
Gas pipelines damaged by aging or earthquakes need a robotic system that can quickly inspect 50-mm-diameter service lines, consisting of horizontal and vertical pipes connected by elbow joints, tees, or sockets, from the inside. However, conventional pipe inspection robots do not target 50-mm pipes and various pipe types or lack sufficient speed and a reverse function. Thus, this study develops an inpipe inspection robot, SPIRA, capable of traveling through 50-mm pipelines while meeting the above requirements. SPIRA has three wheels inclined at 30 degrees to the horizontal, arranged around a cylinder. The cylinder is rotated by a motor around the robots axis, and the wheels move in a spiral motion while pushing the pipe wall, enabling the robot to move stably and quickly. To overcome socket steps of up to 3 mm and diameter changes in elbow joints, the cylinder and each wheel are connected by a leg with a spring pantograph mechanism. SPIRA has two traveling units linked by a bending joint with a servomotor. When the front and rear parts rotate clockwise (counterclockwise), SPIRA moves forward (backward). When both units rotate in the opposite direction, only the central part of SPIRA rotates on the spot, changing the bending direction so that SPIRA can select any travel direction in a tee. We evaluated the travel performance of SPIRA in a pipeline of horizontal, vertical, elbow joints, tees, or sockets, and found that it could smoothly travel through the pipelines, which is difficult for conventional robot systems to achieve.

Abstract:
This study tackles robotic picking of multi-part deformable objects--common in warehouses yet underexplored in the literature--such as cable-attached appliances and pouch drinks, which comprise both rigid and deformable components. Their deformability poses a challenge to model-based 6D pose estimators, such as FoundationPose, that assume rigid bodies. To address this, we present PartPose, which estimates the 6D pose of the multi-part deformable objects by focusing on the rigid components. PartPose uses Bayesian optimization to select an appropriate region of interest (ROI) and then estimates its pose with a render-and-compare pipeline. We evaluate pose-estimation and picking success rates on nine multi-part deformable objects, counting a pose estimate as successful if the translational error is <30 mm and the rotational error is <0.3 radians. PartPose significantly outperforms a FoundationPose baseline, achieving success rates of 98.2% (translational), 96.4% (rotational), and 87.2% (picking), versus 47.9%, 35.9%, and 22.8%, respectively. Moreover, PartPose generalizes category-level semantic knowledge to new instances within the same category without performance degradation when those instances have semantically similar components. This capability is crucial for large logistics centers that handle diverse and novel objects.

Abstract:
Hand-eye calibration is a fundamental task in robotics, requiring high precision to ensure accurate manipulation. This is especially crucial for recent markerless methods, which depend on precise pose estimation for effective end-effector calibration. In this paper, we propose a novel approach that improves calibration performance by adjusting the end-effector's pose to reduce prediction error. Our method utilizes a reward structure derived from trained pose estimation networks, enabling a Soft Actor-Critic-Discrete agent to learn in a simulated environment how to enhance calibration performance through action selection. Our experiments show that calibration results achieved with our method outperform those from initial poses alone in both markerless and marker-based methods. Real-world experiments further validate the efficacy of our approach in actual robotic systems. These results demonstrate that our proposed method effectively enhances the performance of pose estimation-based hand-eye calibration.

Abstract:
Exploration of extraterrestrial surfaces, such as the lunar surface, can prove treacherous for humans and robots alike, and requires highly specialized mobility platforms to ensure the success of a mission and the safety of any operators. However, these specialized machines may limit the overall scope of a mission by limiting performance outside a particular environment. Thus, for maximum capabilities, a team of distinct but complementary specialized robots and vehicles may be used to expand mission capabilities in lunar environments. In this paper, a concept of operations for exploration of a lunar crater from utilizing a collaboration between a wheeled rover, represented by the RAD Exploration Vehicle (REV) and a non-traditional spherical robot, represented by RoboBall II, is introduced. These robots are used as an analog for mission-capable robots such as NASAs Chariot rover and the larger RoboBall III. Design of these robots, along with collaborative features and intended operational environments, is discussed. A controller for RoboBall to attempt controlled descent on slopes is presented. Further, a ballistic sample return module for collection and ex situ analysis of a sample from the bottom of a lunar crater, along with potential navigational mechanisms to facilitate efficient recovery, is presented. Finally, a mission analog using RoboBall III and the ballistic sample return conducted in a former quarry is demonstrated.

Abstract:
Multimodal fusion has an important research value in environmental perception for autonomous driving. Among them, BEVFusion has become one of the mainstream framework for LiDAR camera fusion by unifying multimodal features in the birds-eye view (BEV) space. However, its performance is limited by inefficient cross-modal interaction and information loss during BEV projection, especially for dynamic objects and edge cases. To address these limitations, we propose AttBEV, an advanced fusion architecture that introduces a CBAM at the feature fusion layer: a lightweight attention mechanism that improves the models ability to capture key information through dynamic feature calibration of channel and spatial dimensions.Extensive experiments on the nuScenes dataset demonstrate that AttBEV achieves superior performance compared to BEVFusion on most evaluation metrics. NDS reaches 0.6795, which is 2.63% higher than BEVFusions 0.6532, and mAP reaches 0.6426, which is 1.79% higher than BEVFusions 0.6247. In general, AttBEV outperforms existing methods in both model accuracy and generalization ability and significantly improves the performance of 3D object detection in autonomous driving scenarios.

Abstract:
We study the task of language-conditioned pick and place in clutter, where a robot should grasp a target object in open clutter and move it to a specified place. Some approaches learn end-to-end policies with features from vision foundation models, requiring large datasets. Others combine foundation models in a zero-shot setting, suffering from cascading errors. In addition, they primarily leverage vision and language foundation models, focusing less on action priors. In this paper, we aim to develop an effective policy by integrating foundation priors from vision, language, and action. We propose A2, an action prior alignment method that aligns unconditioned action priors with 3D vision-language priors by learning one attention layer. The alignment formulation enables our policy to train with less data and preserve zero-shot generalization capabilities. We show that a shared policy for both pick and place actions enhances the performance for each task, and introduce a policy adaptation scheme to accommodate the multi-modal nature of actions. Extensive experiments in simulation and the real-world show that our policy achieves higher task success rates with fewer steps for both pick and place tasks in clutter, effectively generalizing to unseen objects and language instructions. Videos and codes are available at https://xukechun.github.io/papers/A2.

Abstract:
This article presents ROGIBOT, an autonomous robot designed for high-speed unloading inside standard containers. Previous unloading robots have typically been limited to handling only box-shaped packages and a small number of items at a time. These limitations hinder their applicability in general logistics environments, which involve a wide variety of package types and require high throughput. To address these challenges, we propose a novel dual-arm robot equipped with multi-functional end effectors, including a vacuum array, a two-finger gripper, and an on-hand sliding surface. The components are designed to handle the most common types of logistics packages, such as boxes, plastic pouches, and bundle sacks. In particular, the on-hand sliding surface can draw packages efficiently using the frictional force generated by a rotating elastic conveyor belt. This enables the robot to unload multiple packages consecutively at high speed by sweeping the end effectors across the pile. In addition to the hardware, we introduce a package recognition method and a finite-state-machine-based task planner that detects logistics packages and selects actions to maximize operational efficiency. In our evaluation, ROGIBOT was tested in a mock container and achieved a throughput of 2,030 pieces per hour, surpassing both state-of-the-art robots and typical human performance.

Abstract:
Precise control of soft pneumatic actuators is impeded by significant nonlinearities, particularly large internal volume variations during actuation---a factor often overlooked in conventional modeling. This paper proposes an adaptive robust control (ARC) framework designed for high-performance, energy-efficient control of soft actuators with non-negligible volume dynamics. The framework integrates a Modified Prandtl-Ishlinskii (MPI) model for hysteresis compensation with a real-time volume estimator using an internal Time-of-Flight (ToF) sensor. The ARC law then systematically handles uncertainties from both valve parameter variations and the volume estimation process. Experimental validation, through direct comparison with a conventional fixed-volume model, demonstrates that this volume-aware approach achieves robust trajectory tracking with significantly reduced control effort and energy consumption. This work establishes that explicitly modeling internal volume dynamics is crucial for developing high-performance control systems for a broad class of soft pneumatic actuators.

Abstract:
Control research on humanoid robots requires environments where control strategies can be designed, tuned, debugged, validated, and transferred to real hardware with minimal friction. For commercial platforms such as the Unitree G1, this process is often fragmented across separate tools for modeling, simulation, communication, visualization, and deployment, slowing down controller development and experimental iteration. This work presents a MATLAB/Simulink-based Sim2Real framework for the Unitree G1 that integrates MuJoCo and ROS 2 into a unified workflow for control-oriented research. The framework is organized around a modular Variant Subsystem that switches between a MuJoCo simulation backend and a ROS 2 real-robot backend while preserving compatible interfaces, enabling reuse of high-level control logic, monitoring, and system instrumentation across both domains. This is especially valuable for robotics control, where rapid prototyping, closed-loop debugging, structured block based design, and repeatable validation are critical for moving from controller concept to hardware testing. As a representative example, a standard ankle motion command task was validated both in simulation and on the real robot. The proposed framework establishes a practical model-based environment for implementing and evaluating control strategies on the Unitree G1, while providing an extensible basis for future estimation, stabilization, perception, and humanoid control modules.

Abstract:
Flow matching has emerged as a competitive framework for learning high-quality generative policies in robotics; however, we find that generalisation arises and saturates early along the flow trajectory, in accordance with recent findings in the literature. We further observe that increasing the number of Euler integration steps during inference counter-intuitively and universally degrades policy performance. We attribute this to (i) additional, uniformly spaced integration steps oversample the late-time region, thereby constraining actions towards the training trajectories and reducing generalisation; and (ii) the learned velocity field becoming non-Lipschitz as integration time approaches 1, causing instability. To address these issues, we propose a novel policy that utilises non-uniform time scheduling (e.g., U-shaped) during training, which emphasises both early and late temporal stages to regularise policy training, and a dense-jump integration schedule at inference, which uses a single-step integration to replace the multi-step integration beyond a jump point, to avoid unstable areas around 1. Essentially, our policy is an efficient one-step learner that still pushes forward performance through multi-step integration, yielding up to 23.7% performance gains over state-of-the-art baselines across diverse robotic tasks. Code is anonymously open-sourced at urlhttps://github.com/DenseJumpFM/DenseJump_FlowMatching

Abstract:
Exoskeleton robots promise to enhance safety by supporting workers' back strength during heavy lifting tasks, thereby improving work efficiency and productivity. However, the components of these robots, such as exoskeletal structures, actuators, and batteries, often increase their size and weight, which can reduce wearability and mobility. To tackle this issue, we propose a lightweight, passive wearable suit designed to assist back muscles during lifting tasks. The proposed system features a single elastic belt connected to multiple pulleys, which are attached to the back and lower limb sleeves. These pulleys are attached to the upper and lower limbs, and their relative distances change depending on body movements such as lifting or walking, thereby producing an effect similar to that of the moving pulley system. This innovative design allows the suit to deliver substantial support while efficiently distributing anchoring pressure across the wearer's skin during squatting and stooping positions. Additionally, the movement of belts through the pulleys minimizes the restrictions on gait motion compared to traditional designs. By adjusting the length of the belt, assist mode can be easily turned on and off, and flexibly applied to various body sizes. The supporting force is characterized by modeling and experimental tests. We evaluated the immediate effect of the prototype passively supporting back muscles during lifting tasks and reducing gait restriction during walking tasks.

Abstract:
Precise and agile flight maneuvers are essential for quadrotor applications, yet traditional control methods are limited by their reliance on flat trajectories or computationally intensive optimization. Reinforcement learning (RL)-based policies offer a promising alternative by directly mapping observations to actions, reducing dependency on system knowledge and actuation constraints. However, the sim-to-real gap remains a significant challenge, often causing instability in real-world deployments. In this work, we identify five key factors for learning robust RL-based control policies capable of zero-shot real-world deployment: (1) integrating velocity and rotation matrix into actor inputs, (2) incorporating time vector into critic inputs, (3) regularizing action differences for smoothness, (4) applying system identification with selective randomization, and (5) using large batch sizes during training. Based on these insights, we develop textitSimpleFlight, a PPO-based framework that integrates these techniques. Extensive experiments on the Crazyflie quadrotor demonstrate that SimpleFlight reduces trajectory tracking error by over 50% compared to state-of-the-art RL baselines. It excels in both smooth polynomial and challenging infeasible zigzag trajectories, particularly on small thrust-to-weight quadrotors, where baseline methods often fail. To enhance reproducibility and further research, we integrate SimpleFlight into the GPU-based Omnidrones simulator and provide open-source code and model checkpoints. For more details, visit our project website at urlhttps://sites.google.com/view/simpleflight/.

Abstract:
Soft robots have gained widespread attention due to their lightweight nature and inherent safety. Among them, soft growing robots (SGRs) are inspired by the growth mechanism of vines, achieving movement through tip eversion. However, their load-bearing capacity remains a significant challenge due to material limitations. The stiffness modulation approach based on layer jamming is constrained in high-curvature tip regions, preventing it from fully exhibiting its potential in unstructured environments. In this letter, motivated by enhancing the load-bearing capacity of SGR and optimizing their tip motion performance, we propose a novel mixed-layer soft growing robot (MLSGR) and introduce an innovative modification to the conventional layer jamming fabrication method. Furthermore, we establish a more accurate kinematics model and, for the first time, propose a statics model to characterize tip behavior. Experimental results demonstrate that, compared to previous work, MLSGR exhibits more than twice in load capacity, a 9% reduction in energy consumption and mechanical resistance for tip growth, a 17% improvement in tip retraction capability, and a 41.2% enhancement in kinematic model prediction accuracy (MAPE).

Abstract:
6D pose estimation of textureless objects is valu- able for industrial robotic applications, yet remains challenging due to the frequent loss of depth information. Current multi-view methods either rely on depth data or insufficiently exploit multi-view geometric cues, limiting their performance. In this paper, we propose DKPMV, a pipeline that achieves dense keypoint-level fusion using only multi-view RGB images as input. We design a three-stage progressive pose optimization strategy that leverages dense multi-view keypoint geometry information. To enable effective dense keypoint fusion, we enhance the keypoint network with attentional aggregation and symmetry-aware training, improving prediction accuracy and resolving ambiguities on symmetric objects. Extensive experiments on the ROBI dataset demonstrate that DKPMV outperforms state-of-the-art multi-view RGB and RGB-D approaches. The code will be available at https://github.com/chenjiahongbq/DKPMV.

Abstract:
Lower-limb exoskeleton robots play a significant role in both rehabilitation and assisted walking, where accurate prediction of lower-limb joint angles is crucial for achieving natural gait. However, due to inter-subject variability and differences across locomotion modes, achieving cross-task generalization in joint angle prediction remains a major challenge. This work proposes a novel framework for multi-joint angle prediction in the lower-limb, which includes a non-redundant muscle synergy feature extraction algorithm and a Generalizable Joint Angle Prediction Network (GenJAPNet) across speeds and subjects. The feature extraction algorithm employs Non-negative Matrix Factorization (NMF) to extract activation coefficient matrix from Surface Electromyography (sEMG) signals, followed by further dimensionality reduction using Uniform Manifold Approximation and Projection (UMAP) to obtain more discriminative and non-redundant features. GenJAPNet leverages pre-trained shared features and few-shot fine-tuning to rapidly adapt to new task. Through feature extraction algorithm comparison experiments, cross-speed and cross-subject experiments, and exoskeleton-assisted walking physical experiments, the effectiveness and generalizability of this method are validated, demonstrating its potential for enhancing the performance of lower-limb exoskeleton rehabilitation and assistive applications.

Abstract:
The demand for assistive robots for passenger transport, such as intelligent wheelchairs, is increasing rapidly due to demographic changes. To allow passengers to navigate in crowded environments, such as shopping malls and hospitals, these systems must navigate in a socially accepted manner that ensures the comfort of both passengers and surrounding pedestrians. Although deep reinforcement learning (DRL) has shown promising results for social navigation, existing planners often learn overly passive behaviors, not engaging in the mutual adaptation characteristic of human interaction. In this paper, we introduce a novel DRL-based local planner that learns navigation behaviors by integrating the Social Force Model (SFM) directly into its reward function, allowing more cooperative interactions for mobile robots and intelligent wheelchairs. This approach encourages the agent to learn more forward-looking and mutual navigation policies by rewarding actions that align with the dynamics of pedestrians. To ensure generalization and straightforward deployment, our method utilizes the standard Navigation 2 local costmap augmented with pedestrian detections as an observation. The experiments demonstrate that our agent achieves a higher success rate in crowded scenarios with fewer space intrusions, outperforming the state-of-the-art DRL planner based on velocity obstacles by up to 11%.

Abstract:
A class of planar bipedal robots with unique mechanical properties has been proposed, where all links are balanced around the hip joint, preventing natural swinging motion due to gravity. A common property of their equations of motion is that the inertia matrix is a constant matrix, there are no nonlinear velocity terms, and the gravity term contains simple nonlinear terms. By performing a Taylor expansion of the gravity term and making a linear approximation, it is easy to derive a linearized model, and calculations for future states or walkability determination can be performed instantaneously without the need for numerical integration. This paper extends the method to a planar biped robot model with knees. First, we derive the equations of motion, constraint conditions, and inelastic collisions for a planar 6-DOF biped robot, design its control system, and numerically generate a stable bipedal gait on a horizontal plane. Next, we reduce the equations of motion to a 3-DOF model, and derive a linearized model by approximating the gravity term as linear around the expansion point for the thigh frame angle. Through numerical simulations, we demonstrate that calculations for future states and walkability determination can be completed in negligible time. By applying control inputs to the obtained model, performing state-space realization, and then discretizing it, instantaneous walkability determination through iterative calculation becomes possible. Through detailed gait analysis, we discuss how the knee joint flexion angle and the expansion point affect the accuracy of the linear approximation, and the issues that arise when descending a small step.

Abstract:
The spatial information inherent in 3D point clouds is crucial for robotic manipulation. However, existing 3D pre-training methods face a fundamental trade-off: Masked Autoencoding (MAE) excels at capturing spatial-geometric features but lacks semantics, whereas contrastive learning, while able to distill semantics from 2D foundation models, is ill-suited for the fine-grained details required for manipulation tasks. To address these challenges, we propose CLAR, a novel 3D pre-training framework that synergizes global understanding with fine-grained local alignment. Our framework unifies MAE with global cross-modal contrastive learning to integrate robust spatial awareness with rich semantic understanding. To enhance its focus on fine-grained details, at the local level, we introduce an adaptive alignment mechanism that leverages deformable attention to force precise correspondences between local 3D geometry and 2D visual features, thereby overcoming the limitations of conventional global alignment in manipulation tasks. Extensive experiments in simulation and the real world demonstrate that CLAR achieves state-of-the-art performance, significantly outperforming existing methods in visuomotor policy learning.

Abstract:
Autonomous underwater robotics faces significant challenges, particularly in the reliable recovery of Autonomous Underwater Vehicles (AUVs) after mission completion. To address this, small AUVs can dock onto a moving mothership for safe transport to recovery sites, reducing operational risks. This paper presents an explicit analytical derivation of the LSL (Left-Straight-Left) and RSR (Right-Straight-Right) Dubins' Paths for intercepting an uniformly moving target, a critical problem for robust rendezvous in dynamic marine and underwater environments. The proposed approach leverages the classical Dubins' Path model to generate time optimal, real-time, curvature-constrained paths suitable for 2D AUVs. Experimental validation on an Unmanned Surface Vehicle (USV) demonstrates the effectiveness of the developed motion planning strategy.

Abstract:
Latent Action Models (LAMs) enable Vision-Language-Action (VLA) systems to learn semantic action representations from large-scale unannotated data. Yet, we identify two bottlenecks of LAMs: 1) the commonly adopted end-to-end trained image encoder suffers from poor spatial understanding; 2) LAMs can be fragile when input frames are temporally distant, leading to limited temporal perception. Such factors inevitably hinder stable and clear action modeling. To this end, we propose Farsighted-LAM, a latent action framework with geometry-aware spatial encoding and multi-scale temporal modeling, capturing structural priors and dynamic motion patterns from consecutive frames. We further propose SSM-VLA, an end-to-end VLA framework built upon Farsighted-LAM, which integrates structured perception with a visual Chain-of-Thought module to explicitly reason about environmental dynamics, enhancing decision consistency and interpretability. We validate SSM-VLA on multiple VLA tasks in both simulation and real-world settings, and achieve state-of-the-art performance. Our results demonstrate that our strategy of combining geometry-aware modeling, temporal coherence, and explicit reasoning is effective in enhancing the robustness and generalizability of embodied intelligence.

Abstract:
Recent advancements in artificial intelligence have broadened the spectrum of tasks that robots can effectively tackle. However, the seamless execution of prolonged action sequences continues to pose a considerable challenge, attributed to limitations in the abilities of today's robots to react to unforeseen situations or failures. In response, we introduce Reaction Templates (RTs), a formal approach for integrating reactivity into task and motion planning. Operating concurrently with the primary execution logic, RTs enable a clear differentiation between planned actions and the necessary recovery strategies for handling unexpected events. This design promotes scalability by establishing reusable building blocks and customizable parameters, thereby enhancing flexibility in application. We provide a thorough introduction to the RT concept, elucidating its principles, mechanisms, and the rationale behind its design decisions. The resulting benefits of the approach are demonstrated through experimental validation with the humanoid robot Rollin Justin.

Abstract:
This paper presents a novel approach for representing proprioceptive time-series data from quadruped robots as structured two-dimensional images, enabling the use of convolutional neural networks for learning locomotion-related tasks. The proposed method encodes temporal dynamics from multiple proprioceptive signals, such as joint positions, IMU readings, and foot velocities, while preserving the robots morphological structure in the spatial arrangement of the image. This transformation captures inter-signal correlations and gait-dependent patterns, providing a richer feature space than direct time-series processing. We apply this concept in the problem of contact estimation, a key capability for stable and adaptive locomotion on diverse terrains. Experimental evaluations on both real-world datasets and simulated environments show that our image-based representation consistently enhances prediction accuracy and generalization over conventional sequence-based models, underscoring the potential of cross-modal encoding strategies for robotic state learning. Our method achieves superior performance on the contact dataset, improving contact state accuracy from 87.7% to 94.5% over the recently proposed MI-HGNN method, using a 15 times shorter window size.

Abstract:
Advancements in perception, planning, and control, enable the development of wearable robots capable of proactively assisting users in avoiding potentially negative outcomes. However, the introduction of robotic assistance in general is often associated with a loss in the sense of agency, a factor traditionally associated with overall device acceptance. Recent work provides a different perspective, showing that contextual proactive assistance is well-received for teleoperation or shared workspace tasks. Still, no works have investigated the impact of proactive assistance for wearable grasping devices, where physical interactions have increased potential for disrupting the user's experience. In this study, we analyze the impact of proactive assistance in a hand exoskeleton with an abstracted grasping task of varying difficulty. We show that in general, the presence of assistance does not significantly reduce experience or the sense of agency. In fact, in a difficult task, subjects strongly prefer proactive assistance, likely as a result of its provided utility. When the task is easily completed without assistance, subjects indicate no strong preference for assisted conditions. Our results challenge the notion of a direct trade-off between robotic assistance and agency, suggesting that well-designed assistance can improve performance and user preference without compromising their sense of control.

Abstract:
Deformable object manipulation is pivotal to numerous real-world robotic applications. A promising paradigm in this field is the shape servoing task, focusing on controlling deformable objects into desired goal shapes. However, prior works typically rely on impractical goal shape acquisition methods, such as laborious domain-knowledge engineering or manual manipulation. Crucially, existing methods fail in multi-modal goal settings, where multiple distinct goal shapes can all lead to successful task completion, a common scenario in many robotic applications. In this paper, we address this problem by developing DiffDef, a novel neural network that leverages a denoising diffusion model to learn a distribution over multiple valid goal shapes, rather than predicting a single deterministic outcome. DiffDef enables the generation of diverse goal shapes, thereby avoiding the mode-averaging artifacts inherent in deterministic models used by previous approaches. We demonstrate our methods effectiveness on several robotic tasks inspired by both manufacturing and surgical applications, both in simulation and on two physical robotic platforms: the da Vinci Research Kit (dVRK) robot and a bimanual KUKA-based robotic system.

Abstract:
Modern robotic systems increasingly employ nonlinear coupled joints, which present significant challenges in control. Unlike traditional serial chain configurations, where simplicity was the primary concern, parallel mechanisms such as those found in humanoid ankle joints add another layer of complexity. In this work, we propose an actuation controller for nonlinear coupled joints based on Model Predictive Path Integral (MPPI) control framework: a sampling-based model predictive control framework that incorporates nonlinearity and coupling effect simultaneously. Highly nonlinear Actuator-Joint mapping, expressed through lightweight neural network, enables intuitive controller design by exposing the actuator space control to the joint space command. Also, our method enables posing joint limit constraints, enabling safe operation on a real-robot platform. To experimentally validate our method, joint position control of a humanoid ankle joint with 2-DOF has been conducted, where accurate, real-time control and constraint-respecting behavior has been demonstrated.

Abstract:
Robust 6D pose estimation of novel textured objects under challenging illumination remains a significant challenge, often requiring a trade-off between accurate initial pose estimation and efficient real-time tracking. We present a unified framework explicitly designed for efficient execution on edge devices, which synergizes a robust initial estimation module with a fast motion-based tracker. The key to our approach is a shared, lighting-invariant color-pair feature representation that forms a consistent foundation for both stages. For initial estimation, this feature facilitates robust registration between the live RGB-D view and the object's 3D mesh. For tracking, the same feature logic validates temporal correspondences, enabling a lightweight model to reliably regress the object's motion. Extensive experiments on benchmark datasets demonstrate that our integrated approach is both effective and robust, providing competitive pose estimation accuracy while maintaining high-fidelity tracking even through abrupt pose changes. Code: https://github.com/smartslab/Color-Pair-Guided-Zero-Shot-6D-Pose

Abstract:
Fully actuated system approach (FASA) provides a promising control framework for robots with redundant actuation, offering simplified controller design and increased design freedom. However, its application to legged robots remains challenging due to hybrid actuation from intermittent ground contact and redundant inputs. To address this, we propose Virtual Force-based FASA (VF-FASA), which introduces virtual forces as intermediaries to construct full-actuation conditions required by FASA. FASA generates virtual control laws based on a simplified torso dynamics model, and a matrix-weighted pseudoinverse optimization is employed to map these virtual inputs into actual torso joint torques and foot contact forces. This method achieves coordinated control of both the floating base and redundant torso, effectively leveraging joint redundancy for improved whole-body motion. Simulation results on a redundant-torso quadruped robot demonstrate robust trajectory tracking and effective whole-body coordination under dynamic locomotion. The framework expands FASA to legged systems, providing an effective approach for controlling quadruped robots.

Abstract:
Since robots are often disregarded in public interactions, many studies have examined how nonverbal cues and dialogue strategies encourage users to initiate engagement. However, the impact of robot movement remains insufficiently investigated. This study examined the psychological effects of movement behavior on willingness to engage in dialogue in a scenario where a mobile guide robot leads people to a stationary robot. A field experiment in a shopping mall showed that guidance by a mobile robot significantly increased dialogue duration, whereas no correlation was found between moving distance and willingness to engage. These results suggest that physical commitment induced by guided movement may enhance user motivation, and that interaction designs leveraging movement behavior may be important for advancing the social implementation of interactive robots.

Abstract:
NASA's forthcoming Lunar Gateway space station, which will be uncrewed most of the time, will need to operate with an unprecedented level of autonomy. One key challenge is enabling the Canadarm3, the Gateway's external robotic system, to detect hazards in its environment using its onboard inspection cameras. This task is complicated by the extreme and variable lighting conditions in space. In this paper, we introduce the visual anomaly detection and localization task for the space domain and establish a benchmark based on a synthetic dataset called ALLO (Anomaly Localization in Lunar Orbit). We show that state-of-the-art visual anomaly detection methods often fail in the space domain, motivating the need for new approaches. To address this, we propose MRAD (Model Reference Anomaly Detection), a statistical algorithm that leverages the known pose of the Canadarm3 and a CAD model of the Gateway to generate reference images of the expected scene appearance. Anomalies are then identified as deviations from this model-generated reference. On the ALLO dataset, MRAD surpasses state-of-the-art anomaly detection algorithms, achieving an AP score of 62.9% at the pixel level and an AUROC score of 75.0% at the image level. Given the low tolerance for risk in space operations and the lack of domain-specific data, we emphasize the need for novel, robust, and accurate anomaly detection methods to handle the challenging visual conditions found in lunar orbit and beyond.

Abstract:
Developing a robotic hand that integrates high fingertip force, rapid response, and multi-degree-of-freedom (DoF) motion, similar to the human hand, remains a challenge in the field of robotic hands. This study presents the Soft-OmniFunctional Robotic Gripper (SOFRo Gripper), designed to achieve all aforementioned characteristics. The finger module of the SOFRo Gripper incorporates synergistically arranged chambers together with a multi-node tendon routing strategy that distributes actuation forces, enabling both flexion and ab/adduction motions while enhancing fingertip force. Furthermore, to maximize fingertip force, the Force-Enhanced Pleated (FEP) mechanism was applied to the chambers, increasing force by 32.41% compared to conventional chamber designs. The proposed SOFRo Gripper achieves a high fingertip force of 68.76 N and dexterous motion capabilities, enabling a maximum lifting force of 400.0 N and in-hand manipulation. To validate its versatility, extensive experiments were conducted, demonstrating the hand's capability to perform a wide range of tasks. As a result, the SOFRo Gripper successfully performed grasping tasks involving various objects, as well as high-force tasks (e.g., lifting heavy objects, closing a valve), delicate tasks (e.g., grasping tofu, inserting a light bulb), and high-speed tasks (e.g., spinning a top, catching a ball). The system demonstrates high force capability and performs a wide range of tasks.

Abstract:
Vision-based indoor navigation systems have been proposed previously for service robots. However, in real-world scenarios, many of these approaches remain vulnerable to visually challenging environments such as white walls. In-home service robots, which are mass-produced, require affordable sensors and processors. Therefore, this paper presents a lightweight and resilient plugin mapping method called Waliner, using an RGB-D sensor and an embedded processor equipped with a neural processing unit (NPU). Waliner can be easily implemented in existing algorithms and enhances the accuracy and robustness of 2D/3D mapping in visually challenging environments with minimal computational overhead by leveraging a) structural building components, such as walls; b) the Manhattan world assumption; and c) an extended Kalman filter-based pose estimation and map management technique to maintain reliable mapping performance under varying lighting and featureless conditions. As verified in various real-world in-home scenes, the proposed method yields over a 5 % improvement in mapping consistency as measured by the map similarity index (MSI) while using minimal resources.

Abstract:
Robot teleoperation plays a crucial role in collecting data for large-scale imitation learning. Inferring operator's hand pose is crucial for vision-based teleoperation, and current solutions either rely on additional neural network training or hardware to infer the operator's wrist pose. To our knowledge, there is no open-source, general teleoperation toolkit that can be easily deployed to retarget both hand and wrist poses from a single RGB camera. In this paper, we propose OAT (Optimization-based hAnd pose retargeting and wrisT pose estimation), a streamlined approach to retarget human hand and wrist pose to the robot. We leverage the off-the-shelf MediaPipe framework to estimate the operator's hand pose and employ an optimization-based method to infer the operator's wrist pose within the camera frame by 2D/3D hand joint matching. This integrated pipeline facilitates teleoperation from virtually any location using any device equipped with an RGB camera, offering a highly accessible and easily implementable solution. Furthermore, a hand-based camera calibration optimization is proposed to improve the accuracy of wrist pose estimation. In addition to minimal hardware requirements and deployment convenience, our system also demonstrates superior real-time performance compared to state-of-the-art vision-based teleoperation methods.

Abstract:
We present a complete framework for fast motion planning of non-holonomic autonomous mobile robots in highly complex but structured environments. Conventional grid-based planners struggle with scalability, while many kinematically-feasible planners impose a significant computational burden due to their search space complexity. To overcome these limitations, our approach introduces a deterministic free-space decomposition that creates a compact graph of overlapping rectangular corridors. This method enables a significant reduction in the search space, without sacrificing path resolution. The framework then performs online motion planning by finding a sequence of rectangles and generating a near-time-optimal, kinematically-feasible trajectory using an analytical planner. The result is a highly efficient solution for large-scale navigation. We validate our framework through extensive simulations and on a physical robot. The implementation will be made publicly available as open-source software.

Abstract:
Efficient Gas Source Localization (GSL) in real-world settings is crucial, especially in emergency scenarios. Mobile robots equipped with low-cost, in-situ gas sensors offer a safer alternative to human inspection in hazardous environments. Probabilistic algorithms enhance GSL efficiency with scattered gas measurements by comparing gas concentration measurements gathered by robots to physical dispersion models. However, accurately deriving gas concentrations from data acquired with low-cost sensors is challenging due to the nonlinear sensor response, environmental dependencies (e.g., humidity, temperature, and other gas influences), and robot motion. Mitigating these disturbance factors requires frequent sensor calibration in controlled environments, which is often impractical for real-world deployments. To overcome these issues, we propose a novel feature extraction algorithm that leverages the relative ranking of gas measurements within the dynamically accumulated dataset. By comparing the rank differences between gathered and modeled values, we estimate the probabilistic distribution of source locations across the entire environment. We validate our approach in high-fidelity simulations and physical experiments, demonstrating consistent localization accuracy with uncalibrated gas sensors. Compared to existing methods, our technique eliminates the need for gas sensor calibration, making it well-suited for real-world applications.

Abstract:
Magnetic actuation is a powerful, non-contact method for controlling milli-scale robots. However existing mobile magnetic field sources face a difficult trade-off. Single-coil end effectors are simple but underactuated, forcing complex and inefficient robot motions to steer objects. Conversely, multi-coil systems improve dexterity but introduce significant mechanical and control complexity. To cope with this challenge, we present a compact, fixed-configuration triple-coil electromagnetic end effector mounted on a 7-DOF robotic arm. Our innovation lies in a hierarchical control strategy that decouples global and local actuation. A Control Lyapunov Function-based Quadratic Programming (QP-CLF) controller guides the robotic arm for large-scale repositioning, extending the workspace and minimizing required currents. Simultaneously, modulating the currents through the three coils provides fine, high-bandwidth electrical control over the local magnetic field and gradient. We validated this approach by steering 1, 2, and 3 mm magnetic spheres along complex spiral trajectories inside fluid-filled phantoms (water and oil). Our system was teleoperated under operator vision and demonstrated highly repeatable path passing performance, proving that this synergistic robot-electromagnet control provides a compelling balance of dexterity, compactness, and simplicity for advanced magnetic manipulation tasks.

Abstract:
Autonomous control of double-Ackermann-steering robots is essential in agricultural applications, where robots must execute precise and complex maneuvers within a limited space. Classical methods, such as the Timed Elastic Band (TEB) planner, can address this problem, but they rely on parameter tuning, making them highly sensitive to changes in robot configuration or environment and impractical to deploy without constant recalibration. At the same time, end-to-end deep reinforcement learning (DRL) methods often fail due to unsuitable reward functions for non-holonomic constraints, resulting in sub-optimal policies and poor generalization. To address these challenges, this paper presents ManeuverNet, a DRL framework tailored for double-Ackermann systems, combining Soft Actor-Critic with CrossQ. Furthermore, ManeuverNet introduces four specifically designed reward functions to support maneuver learning. Unlike prior work, ManeuverNet does not depend on expert data or handcrafted guidance. We extensively evaluate ManeuverNet against both state-of-the-art DRL baselines and the TEB planner. Experimental results demonstrate that our framework substantially improves maneuverability and success rates, achieving more than a 40% gain over DRL baselines. Moreover, ManeuverNet effectively mitigates the strong parameter sensitivity observed in the TEB planner. In real-world trials, ManeuverNet achieved up to a 90% increase in maneuvering trajectory efficiency, highlighting its robustness and practical applicability.

Abstract:
Autonomous driving involves multiple, often conflicting objectives such as safety, efficiency, and comfort. In reinforcement learning (RL), these objectives are typically combined through weighted summation, which collapses their relative priorities and often yields policies that violate safety-critical constraints. To overcome this limitation, we introduce the Preordered Multi-Objective MDP (Pr-MOMDP), which augments standard MOMDPs with a preorder over reward components. This structure enables reasoning about actions with respect to a hierarchy of objectives rather than a scalar signal. To make this structure actionable, we extend distributional RL with a novel pairwise comparison metric, Quantile Dominance (QD), that evaluates action return distributions without reducing them into a single statistic. Building on QD, we propose an algorithm for extracting optimal subsets, the subset of actions that remain non-dominated under each objective, which allows precedence information to shape both decision-making and training targets. Our framework is instantiated with Implicit Quantile Networks (IQN), establishing a concrete implementation while preserving compatibility with a broad class of distributional RL methods. Experiments in Carla show improved success rates, fewer collisions and off-road events, and deliver statistically more robust policies than IQN and ensemble-IQN baselines. By ensuring policies respect rewards preorder, our work advances safer, more reliable autonomous driving systems.

Abstract:
Search and rescue (SAR) robots are required to quickly traverse terrain and perform high-force rescue tasks, necessitating both terrain adaptability and controlled high-force output. Few platforms exist today for SAR, and fewer still have the ability to cover both tasks of terrain adaptability and high-force output when performing extraction. While legged robots offer significant ability to traverse uneven terrain, they typically are unable to incorporate mechanisms that provide variable high-force outputs, unlike traditional wheel-based drive trains. This work introduces a novel concept for a dynamically extensible and retractable robot leg. Leveraging a dynamically extensible and retractable five-bar linkage design, it allows for mechanically switching between height-advantaged and force-advantaged configurations via a geometric transformation. A testbed evaluated leg performance across linkage geometries and operating modes, with empirical and analytical analyses conducted on stride length, force output, and stability. The results demonstrate that the morphing leg offers a promising path toward SAR robots that can both navigate terrain quickly and perform rescue tasks effectively.

Abstract:
While robotics research continues to propose strategies for collision avoidance in human-robot interaction, the reality of constrained environments and future humanoid systems makes contact inevitable. To mitigate injury risks, energy-constraining control approaches are commonly used, often relying on safety thresholds derived from blunt impact data in EN ISO 10218-2:2025. However, this dataset does not extend to edged or pointed collisions. Without scalable, clinically grounded datasets covering diverse contact scenarios, safety validation remains limited. Previous studies have laid the groundwork by assessing surrogate-based velocity and mass limits across various geometries, focusing on perpendicular impacts. This study expands those datasets by including shearing contact scenarios in unconstrained collisions, revealing that collision angle significantly affects injury outcomes. Notably, unconstrained shearing contacts result in fewer injuries than perpendicular ones. By reevaluating all prior porcine surrogate data, we establish energy thresholds across geometries and contact types, forming the first energy-based Injury Protection Database. This enables the development of meaningful energy-limiting controllers that ensure safety across a wide range of realistic collision events.

Abstract:
Real-time tracking of previously unseen, highly dynamic objects in contact-rich scenes, such as during dexterous in-hand manipulation, remains a major challenge. Pure vision-based approaches often fail under heavy occlusions due to frequent contact interactions and motion blur caused by abrupt impacts. We propose TwinTrack, a physics-aware perception system that enables robust, real-time 6-DoF pose tracking of unknown dynamic objects in contact-rich scenes by leveraging contact physics cues. At its core, TwinTrack integrates Real2Sim and Sim2Real. Real2Sim combines vision and contact physics to jointly estimate object geometry and physical properties: an initial reconstruction is obtained from vision, then refined by learning a geometry residual and simultaneously estimating physical parameters (e.g., mass, inertia, and friction) based on contact dynamics consistency. Sim2Real achieves robust pose estimation by adaptively fusing a visual tracker with predictions from the updated contact dynamics. TwinTrack is implemented on a GPU-accelerated, customized MJX engine to guarantee real-time performance. We evaluate our method on two contact-rich scenarios: object falling with environmental contacts and multi-fingered in-hand manipulation. Results show that, compared to baselines, TwinTrack delivers significantly more robust, accurate, and real-time tracking in these challenging settings, with tracking speeds above 20 Hz.

Abstract:
In computer-assisted orthopedic surgery (CAOS), accurately registering sparse and partial intraoperative point sets with a complete preoperative model remains highly challenging due to limited overlap, extreme sparsity, and point localisation noise. In this paper, we propose a novel end-to-end completionregistration framework, to accurately register partial and sparse point sets in CAOS. First, we develop a three-branch network that separately encodes intraoperative pose and geometry, while extracting rotation-invariant geometric priors from the preoperative model in a canonical space. This structure-aware design provides strong and beneficial cues for completing missing regions using sparse and partial data. Second, to address the sensitivity of the completion to random input poses, the completion is specifically conducted in a canonical frame and a learned SE(3) transform maps the output back to the observed intraoperative space. Third, we introduce a probabilistic registration module based on a bidirectional hybrid mixture model that aligns the completed intraoperative and preoperative point sets in distribution space by jointly optimizing the source-to-target and target-to-source objectives, addressing density mismatch and geometric inconsistencies that may arise from completion. Finally, we present the individual loss formulations for both supervised and unsupervised learning paradigms respectively, enabling robust end-to-end optimization of the entire pipeline. We systematically validate our approach on (1,757) femur, (1,301) hip, and (397) tibia models, as well as real-world phantom experiments. Our method achieves state-of-the-art performance under low overlap (1530%), sparse observations (64128 points), and large initial misalignments (up to ([-180, 180]^circ) rotation and ([-100, 100]mm) translation), demonstrating strong robustness and generalization.

Abstract:
Accurate motion tracking and haptics are pivotal to building platforms for immersive Virtual Reality, dexterous robotic hand teleoperation, or embodied AI data collection. Existing technologies fail to provide accurate finger motion tracking and multidimensional force feedback simultaneously, complicating robotic hand control. This work develops the APEX-Glove: the worlds first dorsal-mounted wearable hand exoskeleton yielding both accurate finger motion tracking and active kinesthetic 3D force feedback. Data-driven modeling of the exoskeleton and its Dynamixel XL330 actuators compensates gravity, Coriolis, and friction forces to improve transparency and comfort. Biomechanically-informed analytical inverse kinematics estimates human finger joint angles at 300 Hz with an average Root Mean Squared Error of 18.5�?when compared to industrial-grade datagloves (MANUS Quantum Metagloves). Stationary testing finds that the APEX-Glove can generate up to 0.8 N, 0.7 N, and 1.4 N of force feedback in the x, y, and z directions, on average. Motion retargeting to humanoid robot hands is also detailed, with hardware experimentation demonstrating haptic hand teleoperation. Lastly, we open-source the APEX-Gloves cost-effective (<700 USD) design to disseminate its motion capture and force feedback capabilities to the community.

Abstract:
Autonomous aerial target tracking in unstructured and GPS-denied environments remains a fundamental challenge in robotics. Many existing methods rely on motion capture systems, pre-mapped scenes, or feature-based localization to ensure safety and control, limiting their deployment in real-world conditions. We introduce NOVA, a fully onboard, object-centric framework that enables robust target tracking and collision-aware navigation using only a stereo camera and an IMU. Rather than constructing a global map or relying on absolute localization, NOVA formulates perception, estimation, and control entirely in the targets reference frame. A tightly integrated stack combines a lightweight object detector with stereo depth completion, followed by histogram-based filtering to infer robust target distances under occlusion and noise. These measurements feed a visual-inertial state estimator that recovers the full 6-DoF pose of the robot relative to the target. A nonlinear model predictive controller (NMPC) plans dynamically feasible trajectories in the target frame. To ensure safety, high-order control barrier functions (CBFs) are constructed online from a compact set of high-risk collision points extracted from depth, enabling real-time obstacle avoidance without maps or dense representations. We validate NOVA across challenging real-world scenarios, including urban mazes, forest trails, and repeated transitions through buildings with intermittent GPS loss and severe lighting change

Abstract:
Loop closure detection in dynamic SLAM faces critical challenges when dynamic objects dominate camera views, degrading frame-to-frame methods reliant on static landmarks. We propose A-SPAM, an asynchronous framework that constructs spatiotemporal semantic graphs via semantic padding (entity tracking + rigid structure analysis) and validates loops via semantic matching (topology-feature hybrid correlation). Evaluated on TUM and BONN datasets, A-SPAM achieves at least 76.8% recall rate at 100% precision in dynamic environments, while maintaining a mean translational error of less than 0.07m across dynamic sequences under degraded odometry conditions. The proposed framework corrects erroneous trajectories and enhances robustness against odometry failures in dynamic environments.

Abstract:
This paper presents a data-driven control optimization framework for flexible joint robots (FJR) based on frequency response function (FRF) data, enabling automated controller synthesis without explicit model identification. Unlike conventional model-based approaches that rely on accurate parameter estimation, the proposed method directly utilizes measured FRF data and formulates the controller design as a convex optimization problem. The controller maximizes control bandwidth while ensuring stability across a wide range of configurations. Experimental validation on a FJR demonstrates superior tracking accuracy, vibration suppression, and robustness compared to model-based methods. Furthermore, a high-speed drumming task demonstrates the ability of the controller to handle repeated impacts and inertia variations, highlighting the potential of FRF-based control for the fast and precise operation of flexible robotic systems.

Abstract:
Taking human-robot collaborative assembly as an example, the methods based on contact forces can improve the assembly efficiency of industrial robots with large components in industrial manufacturing. However, due to the large size, high payload, assembly accuracy and dynamic changes in grip position, accurately estimating the contact forces between the payload and the operator becomes challenging when handling these large components. In this paper, a two-stage method is proposed for payload dynamic parameter identification. The parameter identification equation in the sensor coordinate system is initially established. Furthermore, the identification model of recursive restricted total least squares (RRTLS) based on total least squares (TLS) is constructed to achieve low-consumption online identification. According to the assembly requirements and payload characteristics, the posture coordinate system is designed for safety, including the feasible workspace for the robot. Subsequently, the static identification postures and dynamic excitation trajectory are planned to obtain static values and dynamic inertial parameters. In the end, a high-payload human-robot collaborative assembly system is built to validate the proposed method. Experimental results show that compared with the existing methods, the proposed approach can effectively identify and compensate the payload, leading to more accurate external force sensing.

Abstract:
Regulating grasping force to reduce slippage during dynamic object interaction remains a fundamental challenge in robotic manipulation, especially when objects are manipulated by multiple rolling contacts, have unknown properties (such as mass or surface conditions), and when external sensing is unreliable. In contrast, humans can quickly regulate grasping force by touch, even without visual cues. Inspired by this ability, we aim to enable robotic hands to rapidly explore objects and learn tactile-driven grasping force control under motion and limited sensing. We propose a physics-informed energy abstraction that models the object as a virtual energy container. The inconsistency between the fingers applied power and the objects retained energy provides a physically grounded signal for inferring slip-aware stability. Building on this abstraction, we employ model-based learning and planning to efficiently model energy dynamics from tactile sensing and perform real-time grasping force optimization. Experiments in both simulation and hardware demonstrate that our method can learn grasping force control from scratch within minutes, effectively reduce slippage, and extend grasp duration across diverse motion-object pairs, all without relying on external sensing or prior object knowledge.

Abstract:
The flying-wing aircraft control problem is a major concern. In this paper, a new control strategy is introduced. First, a Feedforward neural network (FNN) modeling is introduced. Then, a second-order sliding mode control is applied, with the parameters generated from Deep Deterministic Policy Gradient (DDPG) reinforcement learning. To study the disturbance rejection performance, wind disturbance is applied to the aircraft using a deep neural network as an disturbance observer for different types of winds. Finally, All three simulations: Simulink, Software In The Loop, and Hardware In the Loop are applied to show the effectiveness of the proposed strategy. The simulation results show that the proposed method demonstrates good robustness in various conditions.

Abstract:
Task and Motion Planning (TAMP) frameworks for bimanual robots are limited by the combinatorial explosion at the task planning level, which can negatively affect human-robot interaction. This work introduces BAG-Learn Planning, an efficient learning-based task planning approach that combines a Bio-Inspired Action Context-Free Grammar (BAG) with a Long-Short-Term Memory (LSTM) network to infer symbolic task plans and achieve bimanual manipulation. The proposed approach replaces costly symbolic search with efficient inference by formulating task planning as sequence prediction over grammar-compliant symbolic representations. Experimental comparisons with the classical Fast Downward task planner across three activities demonstrate significant reductions in task planning time, with millisecond-scale planning achieved for both seen and unseen goals. Additional results show robustness to increasing numbers of objects and symbolic locations, thus mitigating combinatorial explosion. BAG-Learn Planning is integrated with a Rapidly Exploring Random Tree (RRT) motion planner to form a complete TAMP framework. The latter is deployed on a physical bimanual robotic platform to achieve three household activities: pouring, opening, and passing.

Abstract:
This paper presents GLaMP, a grounded language model-based multi-agent framework for long-horizon robotic task planning in industrial environments. A key challenge in such tasks lies in the gap between high-level language reasoning and low-level perceptual grounding, which often leads to error accumulation during execution. GLaMP introduces a closed-loop architecture where a vision-language model extracts hierarchical task structures from manuals, a perception module grounds multimodal observations into symbolic predicates, and a large language model generates executable behavior trees. Through bidirectional feedback between perception and planning, the system continuously verifies and updates symbolic states, improving robustness in long-horizon execution. Preliminary experiments on representative industrial tasks demonstrate improved task success rates and reliability compared to existing approaches, highlighting the effectiveness of closed-loop grounding for robotic task planning.

Abstract:
This paper proposes a manipulator-effort-aware model predictive control (MPC) framework for coordinating body motion of a multi-legged underwater walking robot during moored mine clearance under ocean-current disturbances. The method treats the norm of manipulator joint torques as an effort-related input and uses it in an upper-layer MPC to adapt the robots body approach motion and posture, while lower-level whole-body control and impedance control handle body stabilization and rope grasping. Simulation studies in a ROS1 NoeticGazebo environment with a UUV-Simulator-based model show that, under increasing unidirectional current, conventional decoupled controllers cause manipulator torques to grow and approach saturation, whereas the proposed framework keeps torques within safe limits by generating adaptive body-motion compensation. These results indicate improved mechanical stability and reduced manipulator burden during rope-grasping interactions, though validation is currently limited to a simplified unidirectional flow without full locomotion and is sensitive to model accuracy, motivating future experiments with more complex currents, dynamic walking scenarios, and refined hydrodynamic modeling.

Abstract:
Control Barrier Functions (CBFs) are a powerful tool for ensuring the safety of autonomous systems, yet applying them to nonholonomic robots in cluttered, dynamic environments remains an open challenge. State-of-the-art methods often rely on collision-cone or velocity-obstacle constraints which, by only considering the angle of the relative velocity, are inherently conservative and can render the CBF-based quadratic program infeasible, particularly in dense scenarios. To address this issue, we propose a Dynamic Parabolic Control Barrier Function (DPCBF) that defines the safe set using a parabolic boundary. The parabola's vertex and curvature dynamically adapt based on both the distance to an obstacle and the magnitude of the relative velocity, creating a less restrictive safety constraint. We prove that the proposed DPCBF is valid for a kinematic bicycle model subject to input constraints. Extensive comparative simulations demonstrate that our DPCBF-based controller significantly enhances navigation success rates and QP feasibility compared to baseline methods. Our approach successfully navigates through dense environments with up to 100 dynamic obstacles, scenarios where collision cone-based methods fail due to infeasibility.

Abstract:
Modular soft robot arms (MSRAs) are composed of multiple modules connected in a sequence, and they can bend at different angles in various directions. This capability allows MSRAs to perform more intricate tasks than single-module robots. However, the modular structure also induces challenges in accurate planning and control. Nonlinearity and hysteresis complicate the physical model, while the modular structure and increased DOFs further lead to cumulative errors along the sequence. To address these challenges, we propose a versatile configuration space planning and control strategy for MSRAs, named S2C2A (State to Configuration to Action). Our approach formulates an optimization problem, S2C (State to Configuration planning), which integrates various loss functions and a forward model based on biLSTM to generate configuration trajectories based on target states. A configuration controller C2A (Configuration to Action control) based on biLSTM is implemented to follow the planned configuration trajectories, leveraging only inaccurate internal sensing feedback. We validate our strategy using a cable-driven MSRA, demonstrating its ability to perform diverse offline tasks such as position and orientation control and obstacle avoidance. Furthermore, our strategy endows MSRA with online interaction capability with targets and obstacles. Future work focuses on addressing MSRA challenges, such as more accurate physical models.

Abstract:
In the rapidly evolving field of visionlanguage navigation (VLN), ensuring safety for physical agents remains an open challenge. For a human-in-the-loop language-operated drone to navigate safely, it must understand natural language commands, perceive the environment, and simultaneously avoid hazards in real time. Control Barrier Functions (CBFs) are formal methods that enforce safe operating conditions. Model Predictive Control (MPC) is an optimization framework that plans a sequence of future actions over a prediction horizon, ensuring smooth trajectory tracking while obeying constraints. In this work, we consider a VLN-operated drone platform and enhance its safety by formulating a novel scene-aware CBF that leverages ego-centric observations from a camera which has both Red-Green-Blue as well as Depth (RGB-D) channels. A CBF-less baseline system uses a VisionLanguage Encoder with crossmodal attention to convert commands into an ordered sequence of landmarks. An object detection model identifies and verifies these landmarks in the captured images to generate a planned path. To further enhance safety, an Adaptive Safety Margin Algorithm (ASMA) is proposed. ASMA tracks moving objects and performs scene-aware CBF evaluation on- the-fly, which serves as an additional constraint within the MPC framework. By continuously identifying potentially risky observations, the system performs prediction in real time about unsafe conditions and proactively adjusts its control actions to maintain safe navigation throughout the trajectory. Deployed on a Parrot Bebop2 quadrotor in the Gazebo environment using the Robot Operating System (ROS), ASMA achieves 64%67% increase in success rates with only a slight increase (1.4%5.8%) in trajectory lengths com

Abstract:
Recent progress in 3D place recognition has delivered strong results in urban and indoor scenarios, but orchards remain largely unexplored. In these environments, unreliable or absent GNSS signals necessitate LiDAR-based place recognition for robust long-term localization, yet challenges such as ill-defined geometry, semi-transparent foliage, and severe inter-/intra-row overlaps cause high structural ambiguity. To address these challenges, we propose SCDCE-3D, a novel framework that integrates soft-weighted covariance representation with dual-branch channel enhancement. The soft-weighted covariance module adaptively down-weights noisy or overlapping points using a sigmoid-based weighting strategy, enabling robust second-order statistical representation that suppresses cross-row interference. In parallel, a dual-branch backbone extracts complementary global and local features, which drive a dynamic channel enhancement mechanism to emphasize discriminative feature channels while suppressing redundancy. Furthermore, multi-level triplet learning is applied not only to the final descriptor but also to intermediate statistical features, reinforcing robustness against structural ambiguity. Experiments on orchard-based LiDAR datasets demonstrate that SCDCE-3D significantly outperforms state-of-the-art methods in both recall and robustness, offering a reliable solution for long-term 3D place recognition in agricultural robotics. Code is available at https://github.com/typist2001/SCDCE-3D.

Abstract:
Recent advances in teleoperation have enabled sophisticated manipulation of dexterous robotic hands, with most systems concentrating on guiding finger positions to achieve desired grasp configurations. However, while accurate finger positioning is essential, it often overlooks the equally critical task of grasp force modulationvital for handling objects of diverse hardness, texture, and shape. This limitation poses a significant challenge for users, especially individuals with upper-limb disabilities who lack natural tactile feedback and rely on indirect cues to infer appropriate force levels. To address this gap, we propose a novel teleoperation framework that integrates EMG-based force control, AR-based pose tracking, and visuo-tactile sensing to enable precise and intuitive force adjustment. A wearable haptic vest delivers real-time tactile feedback, allowing users to dynamically refine grasp force during manipulation. User studies confirm that our dual-loop control system substantially improves grasp stability and task success, underscoring its potential for assistive robotic applications.

Abstract:
This paper introduces a novel shape-sensing approach for Concentric Tube Steerable Drilling Robots (CT-SDRs) based on Optical Frequency Domain Reflectometry (OFDR). Unlike traditional FBG-based methods, OFDR enables continuous strain measurement along the entire fiber length with enhanced spatial resolution. In the proposed method, a Shape Sensing Assembly (SSA) is first fabricated by integrating a single OFDR fiber with a flat NiTi wire. The calibrated SSA is then routed through and housed within the internal channel of a flexible drilling instrument, which is guided by the pre-shaped NiTi tube of the CT-SDR. In this configuration, the drilling instrument serves as a protective sheath for the SSA during drilling, eliminating the need for integration or adhesion to the instrument surface that is typical of conventional optical sensor approaches. The performance of the proposed SSA, integrated within the cannulated CT-SDR, was thoroughly evaluated under free-bending conditions and during drilling along multiple J-shaped trajectories in synthetic Sawbones phantoms. Results demonstrate accurate and reliable shape-sensing capability, confirming the feasibility and robustness of this integration strategy.

Abstract:
Effective participation in multiparty scenarios requires robots to move beyond individual toward understanding group-level social dynamics, which are inherently complex due to the interplay of nonverbal cues, internal states, and interaction context. Existing approaches often rely on end-to-end deterministic models, while recent state-of-the-art methods such as large Vision-Language Models (VLMs) address this issue to some extent but remain limited by their size and computational cost for real-time applications. Moreover, both approaches are constrained by the scarcity of multiparty interaction data and annotations, which describe how individual nonverbal cues and emotional states contribute to social dynamics which describe collective outcomes such as group engagement. We hypothesize that explicitly modeling individual-level states is essential for accurate group-level understanding. To this end, we present Social-Qwen, a two-stage framework that first analyzes each participants nonverbal cues and emotions, then infers group-level engagement using instruction-tuned representations. To mitigate the lack of individual annotations in group datasets, we employ knowledge distillation to transfer supervision signals. Experiments on the OUC-CGE dataset show that Social-Qwen significantly outperforms prior end-to-end baselines and achieves state-of-the-art performance in group engagement analysis, demonstrating the promise of instruction tuning for scalable social intelligence in robots. We further evaluate robustness by testing generalization to (1) an in-house dataset spanning multiple social activities and (2) estimating other social dynamics such as group harmony. Results suggest consistent performance, highlighting Social-Qwen as a promising approach toward real-time social intelligence for intelligent agents.

Abstract:
In robotic-assisted minimally invasive surgery, an assistant surgeon stands at the bedside to insert and manipulate instruments while the primary surgeon operates the robot. Augmented reality (AR) head-mounted displays (HMDs) may improve the assistant's spatial awareness, but require tracking of surgical tools (both robotic and hand-held) for accurate overlay. In this work, we propose a markerless method to estimate the 6-DoF trocar pose for the assistant port, which can convey the insertion trajectory of any handheld instrument to the assistant surgeon. The method is based on a deep U-Net architecture with cross-attention and Atrous Spatial Pyramid Pooling (ASPP) to predict 2D keypoints on the trocar, which are then used by a Perspective-n-Point (PnP) method to estimate the trocar's pose. From the predicted trocar pose, we can also directly find the 4-DoF shaft-line of the handheld instrument using a multi-view method; this enables correction for misalignment of the trocar and instrument shaft. The trocar tracking runs in real-time (66 Hz) and can be integrated into an AR-assisted workflow. Experimental results with a phantom show an accuracy of ~5.5 mm and angle error of ~1.9 degrees, which is sufficient to guide instrument insertion into the endoscope field of view.

Abstract:
Vision Language Navigation (VLN) typically requires agents to navigate to specified objects or remote regions in unknown scenes by obeying linguistic commands. Such tasks require organizing historical visual observations for linguistic grounding, which is critical for long-sequence navigational decisions. However, current agents suffer from overly detailed scene representation and ambiguous vision-language alignment, which weaken their comprehension of navigation-friendly high-level scene priors and easily lead to behaviors that violate linguistic commands. To tackle these issues, we propose a navigation policy by recursively summarizing along-the-way visual perceptions, which are adaptively aligned with commands to enhance linguistic grounding. In particular, by structurally modeling historical trajectories as compact neural grids, several Recursive Visual Imagination (RVI) techniques are proposed to motivate agents to focus on the regularity of visual transitions and semantic scene layouts, instead of dealing with misleading geometric details. Then, an Adaptive Linguistic Grounding (ALG) technique is proposed to align the learned situational memories with different linguistic components purposefully. Such fine-grained semantic matching facilitates the accurate anticipation of navigation actions and progress. Our navigation policy outperforms the state-of-the-art methods on the challenging VLN-CE and ObjectNav tasks, showing the superiority of our RVI and ALG techniques for VLN.

Abstract:
As a robot makes and breaks contact with environment surfaces, the equations of motion are switched. Task planning and real-time control become challenging as the system traverses multiple regions and switches the governing dynamics. This paper presents a modeling and real-time control methodology for such switched dynamical systems based on Koopman operator theory. Potentially, Koopman operators allow us to subsume segmented dynamics within a unified, globally linear model amenable for control analysis and synthesis. However, the original Koopman operators are not appliable to non-autonomous systems with exogenous input. A new method for converting robot dynamics to a Koopman-compatible model using actuator pre-filtering is presented and applied to the modeling and control of robots interacting with the environment. Specifically, an underactuated cart-pole robot bouncing against multiple walls is modeled as a Control-Coherent Koopman model and a Koopman LQR controller is designed for the wall-bouncing robot. Simulation experiments demonstrate the effectiveness of the method and investigates the effect of the actuator pre-filter parameter on control performance.

Abstract:
Control Barrier Functions (CBFs) have proven to be an effective tool for performing safe control synthesis for nonlinear systems. However, guaranteeing safety in the presence of disturbances and input constraints for high relative degree systems is a difficult problem. In this work, we propose the Robust Policy CBF (RPCBF), a practical approach for constructing robust CBF approximations online via the estimation of a value function. We establish conditions under which the approximation qualifies as a valid CBF and demonstrate the effectiveness of the RPCBF-safety filter in simulation on a variety of high relative degree input-constrained systems. Finally, we demonstrate the benefits of our method in compensating for model errors on a hardware quadcopter platform by treating the model errors as disturbances. Website including code: www.oswinso.xyz/rpcbf/

Affiliations: DFKI GmbH Robotics Innovation Center; Università Degli Studi Di Padova; University of Padova; Korea University; Technical University of Darmstadt; TU Darmstadt; Università Di Padova; Mitsubishi Electric Research Laboratories; Korea Univeristy; Robotics Innovation Center, DFKI GmbH; German Research Center for Artificial Intelligence; Technische Universität Darmstadt; University of Bremen; Chalmers University of Technology

Abstract:
In robotics many different approaches ranging from classical planning over optimal control to reinforcement learning (RL) are developed and borrowed from other fields to achieve reliable control in diverse tasks. In order to get a clear understanding of their individual strengths and weaknesses and their applicability in real-world robotic scenarios it is important to benchmark and compare their performances not only in a simulation but also on real hardware. The 2nd AI Olympics with RealAIGym competition was held at the IROS 2024 conference to contribute to this cause and evaluate different controllers according to their ability to solve a dynamic control problem on an underactuated double pendulum system (Fig. 1) with chaotic dynamics. This paper describes the four different RL methods submitted by the participating teams, presents their performance in the swing-up task on a real double pendulum, measured against various criteria, and discusses their transferability from simulation to real hardware and their robustness to external disturbances.

Abstract:
The aspiration to replicate the capabilities of the human hand has driven innovations in the design of soft robotic hands. Despite these advancements, many existing designs of soft hands still lack effective in-hand vision and the ability for each finger to achieve active multidegree-of-freedom motion. This article proposes a cable-driven soft robotic hand that can achieve dex- terous grasping and manipulation, vision-guided grasping, vision- based slip detection and compensation, as well as visually servoed in-hand manipulation. The hand has five soft fingers, each ca- pable of independent flexion/extension motion and bidirectional ad/abduction motion. A redgreenblue-depth (RGB-D) camera is integrated into the palm of the soft hand to enable in-hand vision capability. Modeling of the soft hand is established to analyze its kinematics, statics, and manipulability. A series of experiments are conducted to demonstrate its dexterous grasping and manipulation capabilities on a variety of objects. Using 3-D point cloud data from the in-palm camera, an effective vision-guided grasping strat- egy is developed to grasp objects on a table. The in-hand vision also enables slip detection and compensation during grasping to maintain the grasp stability. Furthermore, a hierarchical, visually servoed controller is developed to perform closed-loop in-hand object manipulation. With its high dexterity and visual feedback capabilities, the soft hand will find important applications such as household object manipulation and food picking/sorting, and may also be used as a prosthetic hand or an auxiliary hand for humans.

Abstract:
This paper presents the design and experimental validation of a magnetic end-effector optimized for the robust manipulation of a tethered magnetic endoscope in six-degrees-of-freedom (6-DoF). The symmetrical end-effector integrates two pairs of permanent magnets that generate a stable 2D attraction zone with a diameter exceeding 60 mm. This feature enables stable gravity compensation and precise control of centimeter-scale magnetic endoscopes. Experimental results demonstrate 6-DoF manipulation, stability and robustness against external disturbances, and successful navigation through obstacles in a confined environment. The stable gravity compensation allows to reduce friction and pressure on the endoscopes environment during navigation, which represent a key advantage for advancing minimally invasive medical procedures in general and colonoscopy in particular where minimizing pressure on the colon is critical. Future work will focus on enhancing the system through active control of the tethers length.

Abstract:
This letter presents a distributed trajectory planning method for multi-agent aerial tracking. The proposed method uses a Dynamic Buffered Voronoi Cell (DBVC) and a Dynamic Inter-Visibility Cell (DIVC) to formulate the distributed trajectory generation. Specifically, the DBVC and the DIVC are time-variant spaces that prevent mutual collisions and occlusions among agents, while enabling them to maintain suitable distances from the moving target. We combine the DBVC and the DIVC with an efficient Bernstein polynomial motion primitive-based tracking trajectory generation method, which has been refined into a less conservative approach than in our previous work. The proposed algorithm can compute each agent's trajectory within several milliseconds on an Intel i7 desktop. We validate the tracking performance in challenging scenarios, including environments with dozens of obstacles.

Abstract:
In the realm of object pose estimation, scenarios involving both dynamic objects and moving cameras are prevalent. However, the scarcity of corresponding real-world datasets significantly hinders the development and evaluation of robust pose estimation models. This is largely attributed to the inherent challenges in accurately annotating object poses in dynamic scenes captured by moving cameras. To bridge this gap, this paper presents a novel dataset DynOPETs and a dedicated data acquisition and annotation pipeline tailored for object pose estimation and tracking in such unconstrained environments. Our efficient annotation method innovatively integrates pose estimation and pose tracking techniques to generate pseudo-labels, which are subsequently refined through pose graph optimization. The resulting dataset offers accurate pose annotations for dynamic objects observed from moving cameras. To validate the effectiveness and value of our dataset, we perform comprehensive evaluations using 18 state-of-the-art methods, demonstrating its potential to accelerate research in this challenging domain. The dataset will be made publicly available to facilitate further exploration and advancement in the field.

Abstract:
In extraterrestrial planetary environments, computing, energy, and environmental constraints require robotic agents to complete tasks unsupervised. For specialized extraterrestrial robotic drilling agents there is no broadly applicable solution to detect drilling faults as they happen, before the fault escalates to hardware failure. We build upon previous work with time-series subspace analysis methods to to estimate drilling faults using drill avionics telemetry. This work introduces a subsurface anomaly detection method for planetary drilling robots and further evaluates the robustness of our time-series subspace analysis method. We implemented this novel fault and anomaly detection method on an extraterrestrial drilling robot and evaluated it first in a controlled lab environment with composite materials and then in a Mars planetary analog site in the Canadian High Arctic.

Abstract:
The endurance and energy efficiency of drones remain critical challenges in their design and operation. To extend mission duration, numerous studies explored perching mechanisms that enable drones to conserve energy by temporarily suspending flight. This paper presents a new perching drone that utilizes an active flexible perching mechanism inspired by the rapid predation mechanism of the Venus flytrap, achieving perching in less than 100 ms. The proposed system is designed for high-speed adaptability to the perching targets. The overall drone design is outlined, followed by the development and validation of the biomimetic perching structure. To enhance the system stability, a cascade extended high-gain observer (EHGO) based control method is developed, which can estimate and compensate for the external disturbance in real time. The experimental results demonstrate the adaptability of the perching structure and the superiority of the cascaded EHGO in resisting wind and perching disturbances.

Abstract:
Recent works have shown that foundational safe control methods, such as HamiltonJacobi (HJ) reachability analysis, can be applied in the latent space of world models. While this enables the synthesis of latent safety filters for hard-to-model vision-based tasks, they assume that the safety constraint is known a priori and remains fixed during deployment, limiting the safety filter's adaptability across scenarios. To address this, we propose constraint-parameterized latent safety filters that can adapt to user-specified safety constraints at runtime. Our key idea is to define safety constraints by conditioning on an encoding of an image that represents a constraint, using a latent-space similarity measure. The notion of similarity to failure is aligned in a principled way through conformal calibration, which controls how closely the system may approach the constraint representation. The parameterized safety filter is trained entirely within the world model's imagination, treating any image seen by the model as a potential test-time constraint, thereby enabling runtime adaptation to arbitrary safety constraints. In simulation and hardware experiments on vision-based control tasks with a Franka manipulator, we show that our method adapts at runtime by conditioning on the encoding of user-specified constraint images, without sacrificing performance. Video results can be found on the https://any-safe.github.io.

Abstract:
This work proposes a machine learning approach for the Three-Point Dubins Problem (3PDP) based on classification and regression. The 3PDP is a path planning problem with Dubins curves through 3 waypoints. It is required to find the heading at the intermediate point and the form of the two Dubins paths joining the three points. Classification is used to select the correct path type (out of 18) to avoid the trial-and-error enumeration of all cases; regression is employed to have a good initial guess for finding the heading angle. Our results are used to improve and speed-up existing methods in terms of efficiency and accuracy

Abstract:
Recently, active vision has reemerged as an important concept for manipulation, since visual occlusion occurs more frequently when main cameras are mounted on the robot heads. We reflect on the visual occlusion issue and identify its essence as the absence of information useful for task completion. Inspired by this, we come up with the more fundamental problem of Exploratory and Focused Manipulation (EFM). The proposed problem is about actively collecting information to complete challenging manipulation tasks that require exploration or focus. As an initial attempt to address this problem, we establish the EFM-10 benchmark that consists of 4 categories of tasks that align with our definition (10 tasks in total). We further come up with a Bimanual Active Perception (BAP) strategy, which leverages one arm to provide active vision and another arm to provide force sensing while manipulating. Based on this idea, we collect a dataset named BAPData for the tasks in EFM-10. With the dataset, we successfully verify the effectiveness of the BAP strategy in an imitation learning manner. We hope that the EFM-10 benchmark along with the BAP strategy can become a cornerstone that facilitates future research towards this direction. Project website: EFManipulation.github.io.

Abstract:
In Human-Robot Interaction research, assessing how humans understand the robots they interact with is crucial, particularly when studying the impact of explainability and transparency. Some studies evaluate objective understanding by analysing the accuracy of users' mental models, while others rely on perceived, self-reported levels of subjective understanding. We hypothesise that both dimensions of understanding may diverge, thus being complementary methods to assess the effects of explainability on users. In our study, we track the weekly progression of the users' understanding of an autonomous robot operating in a healthcare centre over five weeks. Our results reveal a notable mismatch between objective and subjective understanding. In areas where participants lacked sufficient information, the perception of understanding, i.e. subjective understanding, raised with increased contact with the system while their actual understanding, objective understanding, did not. We attribute these results to inaccurate mental models that persist due to limited feedback from the system. Future research should clarify how both objective and subjective dimensions of understanding can be influenced by explainability measures, and how these two dimensions of understanding affect other desiderata such as trust or usability.

Abstract:
Current passive or semi-active shoulder exoskeletons for overhead work provide fixed assistive torque for all participants and tasks, which lacks adaptability. In addition, due to the need to store energy at low elevation angles, they may increase physical demand on the user when assistance is not required. This study presents a novel semi-active shoulder exoskeleton that can provide the free mode (i.e., no assistance) and personalized assistive torque to assist overhead work. The exoskeleton includes the motorized torque generator and hybrid control strategy. The motorized torque generator equipped with servo motor and encoder is characterized by its ability to electrically adjust the peak assistive torque angle and peak torque. In addition, we propose a hybrid control strategy with free and assistive modes. The free mode allows the exoskeleton to not interfere with movements that do not require assistance. The assistive mode provides personalized torque with three levels based on the height and weight of the user. Experimental results validated the exoskeleton's mechanical performance (e.g., high backdrivability) and its assistive effectiveness. The results showed that the exoskeleton could reduce shoulder muscle activation by up to 55.03% and demonstrated a significant difference compared to fixed assistance.

Abstract:
Robotic compliance control is critical for delicate tasks such as electronic connector assembly, where precise force regulation and adaptability are paramount. However, traditional methods often struggle with modeling inaccuracies and sensor noise. Inspired by human adaptability in complex assembly operations, we present RoboMT, a novel framework that integrates a Mamba algorithm with a Transformer architecture to achieve human-like compliance control. By leveraging a bilateral teleoperation platform, we collect extensive real-time force/torque and motion data to form a comprehensive dataset for training. Furthermore, RoboMT incorporates an Adaptive Action Chunk module and a Temporal Fusion module to ensure smooth and robust action prediction. Experimental results across four electronic assembly tasks show that RoboMT achieves superior success rates (6298%) over baselines (2998%), while maintaining stable force regulation around 2.5N, closely resembling human performance. During task transitions, RoboMT quickly stabilizes at 5N with minimal overshoot, avoiding the large force spikes (over 24N) seen in baselines. Additionally, RoboMT maintains an average inference speed of 55 ms per batch, balancing real-time responsiveness and control robustness. Overall, RoboMT presents a compelling pathway toward error-minimized, human-level compliance control, and generalization for real-world robotic assembly, setting a new benchmark for precision, adaptability, and robustness in robotic assembly.

Abstract:
The iterative closest point registration algorithm has been a preferred method for light detection and ranging LiDAR-based robot localization for nearly a decade. However, even in modern simultaneous localization and mapping (SLAM) solutions, ICP can degrade and become unreliable in geometrically ill-conditioned environments. In response, this work investigates and compares new and existing degeneracy mitigation methods for robust LiDAR-based localization and analyzes the efficacy of these approaches in degenerate environments for the first time in the literature at this scale. Specifically, this work investigates i) the effect of using active or passive degeneracy mitigation methods for the problem of ill-conditioned ICP in LiDAR degenerate environments and ii) the evaluation of truncated singular value decomposition (TSVD), inequality constraints (Ineq. Con.), and linear/nonlinear Tikhonov regularization for the application of degenerate point cloud registration for the first time. Furthermore, a sensitivity analysis for the least-squares minimization step of the ICP problem is carried out to better understand how each method affects the optimization and what to expect from each method. The results of the analysis are validated through multiple real-world robotic field and simulated experiments. The analysis demonstrates that active optimization degeneracy mitigation is necessary and advantageous in the absence of reliable external estimate assistance for LiDAR-SLAM.

Abstract:
Because of the non-stationary nature of electroencephalogram (EEG) signals, traditional non-invasive brain-computer interfaces (BCIs) usually only produce discrete commands, limiting their ability to control external devices continuously. This study proposes a novel BCI control strategy mapping multiple discrete commands to continuous motion, enabling real-time manipulation of a drone in four degrees of freedom (DOF). Our strategy used the fast steady state visual evoked potential (SSVEP) encoding and decoding method to convert user intentions into the drones flight status in near real-time. Simultaneously, the drones live video was embedded into the SSVEP stimuli, providing users with a first-person perspective control experience. In drone control experiments, participants successfully maneuvered the drone through complex path-following tasks in simulated and physical scenarios. The mean flight trajectory bias ratio was measured as 0.81, with a mean flight smoothness of -3.31 (measured by spectral arc length) and mean Fittss throughput of 9.18 bits/min. Notably, the brain-to-hand ratio (BHR) for all metrics approached 1, indicating that our non-invasive control system achieved comparable performance to manual control systems. These results suggest the effectiveness of our proposed BCI control strategy that maps discrete commands to continuous motion and extends the capabilities of non-invasive BCIs in continuous control scenarios.

Abstract:
Soft robots can revolutionize several applications with high demands on dexterity and safety. When operating these systems, real-time estimation and control require fast and accurate models. However, prediction with first-principles (FP) models is slow, and learned black-box models have poor generalizability. Physics-informed machine learning offers excellent advantages here, but it is currently limited to simple, often simulated systems without considering changes after training. We propose physics-informed neural networks (PINNs) for articulated soft robots (ASRs) with a focus on data efficiency. The amount of expensive real-world training data is reduced to a minimum - one dataset in one system domain. Two hours of data in different domains are used for a comparison against two gold-standard approaches: In contrast to a recurrent neural network, the PINN provides a high generalizability. The prediction speed of an accurate FP model is exceeded with the PINN by up to a factor of 467 at slightly reduced accuracy. This enables nonlinear model predictive control (MPC) of a pneumatic ASR. Accurate position tracking with the MPC running at 47 Hz is achieved in six dynamic experiments.

Abstract:
In the era of Large Language Models (LLMs), embodied artificial intelligence presents transformative opportunities for robotic manipulation tasks. Ultrasound imaging, a widely used and cost-effective medical diagnostic procedure, faces challenges due to the global shortage of professional sonographers. To address this issue, we propose USPilot, an embodied robotic assistant ultrasound system powered by an LLM-based framework to enable autonomous ultrasound acquisition. USPilot is designed to function as a virtual sonographer, capable of responding to patients' ultrasound-related queries and performing ultrasound scans based on user intent. By fine-tuning the LLM, USPilot demonstrates a deep understanding of ultrasound-specific questions and tasks. Furthermore, USPilot incorporates an LLM-enhanced Graph Neural Network (GNN) to manage ultrasound robotic APIs and serve as a task planner. Experimental results show that the LLM-enhanced GNN achieves unprecedented accuracy in task planning on public datasets. Additionally, the system demonstrates significant potential in autonomously understanding and executing ultrasound procedures. These advancements bring us closer to achieving autonomous and potentially unmanned robotic ultrasound systems, addressing critical resource gaps in medical imaging.

Abstract:
Evolutionary pressures have pushed humans to become efficient walkers, but inefficient divers. People consume more energy to travel the same distance underwater than on land. In diverse overground locomotion, emerging exoskeletons have reduced the metabolic cost of humans. Can we also improve the energy economy in underwater locomotion via exoskeletons? Here, we propose an underwater exoskeleton to assist scuba diving using flutter kick, by applying assistive knee extension torque during the strike phase of the diving kick cycle. When divers wore the powered exoskeleton, the average net air cost across six experienced divers was reduced by 22.7±10.0%, and the peak quadriceps activation was decreased by 20.9±7.5%, compared with normal diving without the exoskeleton. The average gastrocnemius activation also decreased by 20.6±5.3%, suggesting that the divers sufficiently utilized the exoskeleton assistance. These results indicate that applying exoskeleton assistance is conducive to improving the endurance of human underwater diving and enhancing our ability to explore the underwater world. Our study extends the application boundary of wearable robots, and provides a reference for the

Abstract:
Rimless wheel, one of the simplest walking models, has been widely studied as a theoretical framework for bipedal locomotion. This study introduces a dual rimless wheel (DRW) connected by elastic elements for maintaining the body shape and investigates its passive locomotion capability through numerical simulations. Simulation results reveal that as the stiffness of the elastic elements increases, the walking behavior approaches that of a rigid rimless wheel, resulting in higher forward velocity. Conversely, lower stiffness enhances body flexibility and enables the generation of low-speed gaits with remarkably small energy loss. These findings suggest that the DRW may be advantageous in environments where collisions have strong impacts, such as compliant terrains. Furthermore, through comparative simulations with several other models, including the rigid rimless wheel, we demonstrate that the low-stiffness DRW model can generate clearly slower passive locomotion while maintaining a feasible walking region. On the other hand, basic prototype experiment indicates that the low-stiffness DRW model achieves more stable and faster walking than the high-stiffness DRW model in low-angle slopes. While the results do not imply that the DRW is universally optimal, they provide new insights into generating soft and stable gaits and underline the usefulness of tensegrity mechanisms.

Abstract:
This paper proposes a BIT-based collision-free path planning method for remotely operated Maritime Autonomous Surface Ships (MASS) under target ship position uncertainty. In a remote operating environment, the Remote Operating Center (ROC) receives target ship position data from long-range sensors such as AIS and radar. This data inherently contains uncertainty due to measurement errors and communication delays, making it essential to account for such uncertainty during path planning. As MASS must comply with the International Regulations for Preventing Collisions at Sea (COLREGs) during navigation, path planning must also reflect these requirements. The proposed method models target ship position uncertainty as a two-dimensional Gaussian distribution and incorporates it into edge evaluation as a collision risk cost, while applying penalty costs to edges that enter COLREGs non-compliant regions. A turning radius constraint is also incorporated into the edge selection process to ensure navigational feasibility. The method is validated through headon and crossing encounter simulations on an Electronic Nautical Chart (ENC)-based grid map of Ulsan Port, South Korea. The results show that higher levels of position uncertainty lead to more conservative avoidance paths, resulting in greater Distance to Closest Point of Approach (DCPA).

Abstract:
In soft robotics, actuators using both positive and negative pressures are notable for their high payload-to-weight ratios and wide operating ranges, but they require separate power sources. A single-pump system generating dual pressures presents a promising solution, though addressing pressure fluctuations due to coupled dynamics remains a challenge. In this work, we propose a reinforcement learning (RL)-based controller capable of tracking both pressures over a wide range. To facilitate RL training, we built a simulator that models not only airflow dynamics but also the pumps kinematics and the electromagnetic behavior of pneumatic components. Our controller employs Model-Predicted Observation (MPObs) to predict future input effects and mitigate nonlinearities, and uses a Conditioning for Action Policy Smoothness (CAPS)-based action smoothing to reduce abrupt input changes. Experimental results show that the proposed RL controller achieves root-mean-square errors (RMSEs) of 0.6935 kPa (positive) and 0.2646 kPa (negative), outperforming the Disturbance Observer (DOB)-based approach. Ablation studies confirm the synergistic effect of MPObs and CAPS, underscoring their importance in control. Furthermore, robustness tests with external loads from 0 to 20 kg demonstrate a maximum RMSE of 0.7906 kPa (positive) and 0.1186 kPa (negative), indicating strong robustness. This study verifies that our proposed RL-based controller overcomes the nonlinear challenges of pneumatic power sources and highlights its potential for future stand-alone systems in field applications.

Abstract:
Policy learning often encounters difficulties in long-horizon tasks. Subgoal-conditioned policies address long-horizon problems by decomposing them into manageable segments, but they usually struggle with identifying informative subgoals. To address this limitation, we propose PDS (planning with dual subgoal), an architecture that learns short-horizon and low-variance subgoals in embedding space, ensuring the planning both reachable and consistent. We begin by analyzing the impact of horizon and consistency on the performance of subgoal-conditioned policies. We evaluate the performance of commonly used subgoal definitions (time-based, visual-based, and language-based) in tasks with different lengths. Subsequently, we demonstrate that our approach, which predicts and conditions on dual subgoals, improves success rates and enhances stability across diverse tasks in simulation and real-world.

Abstract:
The Robot Operating System (ROS) is widely adopted in the robotics community, powering applications from self-driving vehicles to industrial automation. ROS 2 utilizes the Data Distribution Service (DDS) middleware for decentralized communication, making it inherently susceptible to reconnaissance and exploitation attacks. Previous research has examined the security implications of DDS implementations but has not systematically distinguished ROS 2 nodes from standalone DDS deployments, a critical distinction that significantly influences the execution and outcome of cyberattacks. This paper presents the first systematic fingerprinting framework designed specifically for ROS 2, demonstrating how DDS-based metadata leakage can facilitate precise identification and targeted exploitation of robotic systems. Through controlled experiments and an Internet-wide scan of DDS deployments, we identify extensive metadata exposure across actively supported ROS 2 implementations. Despite existing security solutions such as Secure ROS 2 (SROS2), deployments using default configurations remain vulnerable, highlighting the need for enhanced metadata obfuscation, stricter network access policies, and deployment of real-time anomaly detection mechanisms to strengthen the security posture of ROS 2 systems.

Abstract:
Achieving human-like dexterous manipulation is essential for general-purpose robots but remains a challenge. Recent advances in Vision-Language-Action (VLA) models offer the potential to learn flexible skills from demonstration data. However, training effective VLAs requires a large amount of high-quality data, which is difficult to obtain: fully manual teleoperation cognitively overloads human operators, while automated planning produces unnatural motions and lacks data diversity. We present a Shared Autonomy framework: a human operator teleoperates the arm for global motion, while an autonomous DexGrasp-VLA policy, as an AI Copilot, generates force-adaptive actions for a five-finger hand with tactile feedback -- drastically reducing human effort and enabling efficient collection of high-quality demonstrations. Using these data, we train an end-to-end VLA policy with a novel Arm-Hand Feature Enhancement module -- shared representations are conjunct with separate arm and hand latent features, representing the distinct dynamics of macro and micro movements, leading to more robust and natural coordination of arm-hand motions. Our Corrective Teleoperation can further refine the policy with failure-recovery demonstrations via human intervention. Experiments show our approach efficiently generates high-quality data and learns policies with a high success rate and natural behaviors. The trained arm-hand VLA policy is effectively generalized to both seen and unseen objects, with a success rate of around 90% in more than 50 diverse objects.

Abstract:
This paper introduces a suspension-integrated tilting mechanism for narrow-track mobility robot platforms, offering a novel means of achieving commanded body tilt without departing from conventional suspension layouts. The architecture provides a two-degree-of-freedom roll path with inherent passive self-centering, thereby reconciling mechanical simplicity with enhanced motion capability. A linearized dynamic model forms the foundation for a dual-layer control scheme, consisting of a reference generator and a high-bandwidth tracking loop. The proposed approach is rigorously evaluated through hardware-in-the-loop simulations that combine a real-time driving simulator with a physical hardware bench. Relative to a non-tilting baseline, the platform demonstrates substantial reductions in perceived lateral acceleration and lateral load transfer, signifying improved ride comfort and rollover stability. These findings establish suspension-integrated tilting with model-based control as a compelling pathway toward safe and stable next-generation mobility robot platforms.

Abstract:
Robots are increasingly being deployed in public spaces such as shopping malls, sidewalks, and hospitals, where safe and socially aware navigation depends on anticipating how pedestrians respond to their presence. However, existing datasets rarely capture the full spectrum of robot-induced reactions, e.g., avoidance, neutrality, attraction, which limits progress in modeling these interactions. In this paper, we present the Pedestrian-Robot Interaction (PeRoI) dataset that captures pedestrian motions categorized into attraction, neutrality, and repulsion across two outdoor sites under three controlled conditions: no robot present, with stationary robot, and with moving robot. This design explicitly reveals how pedestrian behavior varies across robot contexts, and we provide qualitative and quantitative comparisons to established state-of-the-art datasets. Building on these data, we propose the Neural Robot Social Force Model (NeuRoSFM), an extension of the Social Force Model that integrates neural networks to augment inter-human dynamics with learned components and explicit robot-induced forces to better predict pedestrian motion in vicinity of robots. We evaluate NeuRoSFM by generating trajectories on multiple real-world datasets. The results demonstrate improved modeling of pedestrian-robot interactions, leading to better prediction accuracy, and highlight the value of our dataset and method for advancing socially aware navigation strategies in human-centered environments.

Abstract:
Inspired by the design of Vision-based Tactile Sensors (VTSs) and soft resistive strain sensors, in this paper, we introduce MINT: a vision-based soft sensor for Mutual Integration of Normal interaction force and Texture perception. MINT is a hybrid vision-based tactile sensor that simultaneously integrates normal force measurement with high-resolution texture perception. This unique sensor utilizes a soft resistive strain sensor between the Gel Layer and Mirror Layer of a typical VTS. By combining electrical and visual sensing modalities, MINT overcomes the limitations of existing resistive sensors and VTSs, offering a robust, efficient, and scalable solution for direct measurement of force and texture capture. To evaluate MINTs functionality, we first propose a unique design and fabrication procedure. Next, we conduct a series of experiments, evaluating its force and texture sensing capabilities through interactions with various rigid objects.

Abstract:
Automatically generating robot applications from natural language promises to lower the barrier to automation, but remains difficult in domains that demand reliability and transparency, such as industrial assembly or collaborative manipulation. End-to-end policies and a large language model(LLM)-based planners can map instructions to robot behaviors, but they often lack interpretability and provide limited assurance of correctness. We present a framework that composes applications from modular, application-independent atomic skills expressed as Behavior Trees (BTs). BTs are constructed and validated against an ontology-level dual graph to enforce control-flow and data-flow consistency before execution, ensuring transparency and structural correctness. Application-level parameters are optimized offline in simulation using Monte Carlo Tree Search guided by LLM-derived priors. Rather than serving as a runtime optimizer, this process systematically explores interdependent parameters, producing a dataset of reliable parameterizations that can support future gating mechanisms for online adaptation. The framework is validated in a physical robotic setup, demonstrating transparent and consistent offline generation of deployable applications, and laying the foundation for adaptive, real-time systems.

Abstract:
Robotic systems present unique safety challenges due to their complex integration of computational and physical processes and direct interaction with humans and environments. Traditional approaches to robot safety planning either rely on conventional methods, which struggle with the complexity of modern robotic systems, or on pure machine learning techniques, which lack formal safety guarantees. While recent advances in Large Language Models (LLMs) offer promising capabilities, pre-trained LLMs alone lack the specific domain expertise required for effective robotic safety planning. This paper introduces SafeNet, a novel neural-symbolic network architecture that enhances LLMs' safety planning capabilities through formal method-guided fine-tuning for robotic applications. Our approach integrates formal logical knowledge and reward machines into pre-trained LLMs by carefully designed fine-tuning, creating a neural-symbolic approach that combines the flexibility of neural networks with the precision of formal methods for robot trajectory generation and task planning. Experimental results demonstrate significant improvements in safe trajectory generation for robotic systems, with planning success rates increasing from 1.17% to 91.60% for the block manipulation task and from 7.23% to 90.63% for the robotic path planning task.

Abstract:
We propose a visual-servoing and obstacle-avoidance controller for a wheeled mobile robot (WMR) with a two-axis gimbal camera that operates without mapping, using only vision and lightweight forward sensing. A task-allocation MPC with online terminal-cost iteration is introduced. Specifically, task projection in the image-feature space mitigates underactuation and couplinginduced local optima; Virtual Imaging Constraint Guidance (VICG) yields a visibility-preserving heading reference that steers the trajectory around obstacles; and an Approximate Dynamic Programming (ADP) module learns a context-aware terminal cost online, providing long-horizon guidance for mid-horizon prediction. Relying solely on image feedback plus lightweight ranging, the method coordinates the WMR and gimbal to accomplish obstacle avoidance and visual-servo tracking jointly. Hardware experiments validate the feasibility and effectiveness of the proposed approach.

Abstract:
To increase the safety and reliability of autonomous driving systems in complex traffic environments, this paper proposes a novel 3D multiobject tracking (MOT) method that integrates center-plane adaptive multisensor fusion, motion compensation, and multilevel data association. Unlike traditional methods, our approach employs a center-plane adaptive fusion strategy to align LiDAR and visual data precisely, mitigating errors in the target width caused by pose variations, and improving tracking accuracy. To address vehicle motion-induced association errors in dynamic scenarios, we incorporate IMU and GPS data for high-frequency vehicle pose estimation and compensation, ensuring stable and robust target association. Additionally, a rotational geometric distance intersection-over-union (RGDIoU) cost function is introduced, combined with multilevel spatial indexing, to optimize the data association efficiency and accuracy. The experimental results on benchmark datasets, including KITTI and nuScenes, demonstrate that our method achieves state-of-the-art (SOTA) performance across multiple tracking metrics, including HOTA and sAMOTA, while maintaining real-time performance at 90 FPS. Specifically, our method improves sAMOTA tracking accuracy by 13% over the best existing methods and achieves a HOTA score of 50.24%, surpassing all compared methods.

Abstract:
Recent advances in imitation learning and visionlanguage models highlight the need for high-fidelity tactile perception, with 6-DoF tactile object pose estimation providing a crucial foundation for precise robotic manipulation. We introduce InvariantCloud, a 6-DoF pose estimation framework that leverages the global invariance of surface marker constellations on vision-based tactile sensors. In contrast to recent approaches, our one-shot globally invariant point cloud registration suppresses cumulative drift and overcomes long-standing limitations in accurately estimating yaw (Z-axis) rotation. Experimental verifications show that InvariantCloud achieves superior yaw tracking accuracy and re-localization repeatability compared to existing benchmarks, demonstrating its precision and robustness in long-sequence manipulation tasks.

Abstract:
Relay robots are crucial for extending communication when a client robot performs long-range missions. However, existing network quality prediction models and relay planning methods often struggle with real-time operation due to their high computational cost and poor adaptability to frequently changing missions. To address this, we propose a real-time communication relay system featuring two key contributions. First, a low-complexity network quality prediction model using Kalman filter-based Gaussian process regression achieves efficient online inference with constant-time updates (~0.02s). Second, a hierarchical relay planning strategy, employing a Monte Carlo tree search-based sequential planner, generates communication-aware trajectories satisfying network constraints at discrete steps. Real-world experiments validate our system's effectiveness, demonstrating near-continuous network availability (99.1% channel reliability) and boosting the packet delivery ratio from a baseline of 44.7% to 73.7%. Our integrated approach offers a practical and robust solution for dynamic indoor missions.

Abstract:
Mobile robot platforms are increasingly being used to automate information-gathering tasks such as environmental monitoring. Efficient target tracking in dynamic environments is critical for applications such as search and rescue and pollutant cleanups. In this letter, we study active mapping of floating targets that drift due to environmental disturbances such as wind and currents. This is a challenging problem as it involves predicting both spatial and temporal variations in the map due to changing conditions. We introduce an integrated framework combining dynamic occupancy grid mapping and an informative planning approach to actively map and track freely drifting targets with an autonomous surface vehicle. A key component of our adaptive planning approach is a spatiotemporal prediction network that predicts target position distributions over time. We further propose a planning objective for target tracking that leverages these predictions. Simulation experiments show that this planning objective improves target tracking performance compared to existing methods that consider only entropy reduction as the planning objective. Finally, we validate our approach in field tests, showcasing its ability to track targets in real-world monitoring scenarios.

Abstract:
Percutaneous Left Atrial Appendage Closure (LAAC) is a minimally invasive procedure to prevent thromboembolic events in atrial fibrillation patients. The procedures success relies on precise navigation and occluder deployment, which is challenged by sheath movement in the dynamic cardiac environment, procedural complexity, and prolonged radiation R1-1 exposure. This study introduces a robotic-assisted navigation system for LAAC procedure, integrating a dedicated steerable sheath, customed planning algorithms, and an intuitive teleoperation interface. The path-planning framework generates collision-free routes based on patient-specific anatomy, adjusting for deviations in real-time. The teleoperation interface comprises a digital twin of the patients anatomy with real-time visual feedback to the user for precise and intuitive navigation. Bench-top validation demonstrated that navigation guidance reduced target position error by 2.03% with the planner and 2.85% with the replanner, compared to free navigation without planning assistance. Planning and replanning strategies also reduced collisions with cardiac structures, highlighting the platforms potential to improve procedural precision and safety.

Abstract:
"Frail" elderly often experience walking impairments that limit independence and sustained physical activity. Although various assistive devices exist, many rely on single-mode control, limiting adaptability, responsiveness to gait variability, and voluntary motion. To improve, we developed a wearable ankle-assist device with real-time gait phase recognition and multi-mode control. Sensor fusion of inertial and plantar-pressure data enables robust five-phase segmentation, with optimal weights tuned by Particle Swarm Optimization. Based on detected gait phase, the controller dynamically switches between speed, torque, and free modes, adapting to cadence variations. Treadmill experiments showed that mixed control increased walking distance (251 m to 282 m (p < 0.05)), reduced heart rate change (20% to 10% (p < 0.01)). Gait analysis confirmed comfort and less resistance. These findings demonstrate that phase-aware adaptive assistance balances propulsion and natural motion, supporting mobility and reducing strain. This framework provides a practical basis for wearable ankle-assist systems in elderly rehabilitation and daily use.

Abstract:
Robotic technologies are increasingly used in dentistry for their precision in delicate procedures. While most dental robots focus on implant surgery, automating root canal treatment (RCT) remains challenging due to the need to guide a thin, flexible endodontic file through a narrow, curved root canal without causing ledging or file fracture. Patient movementsparticularly those that induce additional file bending during insertionfurther complicate robot-assisted procedures. This study presents an autonomous approach for root canal cleaning and shaping by combining force admittance and position tracking. A novel Patient Tracking Module, which connects the patients dental brace to the robot end-effector via string potentiometers, is developed to estimate real-time robot-patient pose. Additionally, a file flexibility model is proposed to predict and compensate for file deflection during insertion. A hybrid position/force control strategy, which integrates these estimations, autonomously guides file manipulation, minimizes misalignment, and therefore reduces the risk of file fracture. Experimental validation demonstrates the systems feasibility and potential for clinical application in precision endodontic procedures.

Abstract:
We propose MMD-OPT: a sample-efficient approach for minimizing the risk of collision under arbitrary prediction distribution of the dynamic obstacles. MMD-OPT is based on embedding distribution in Reproducing Kernel Hilbert Space (RKHS) and the associated Maximum Mean Discrepancy (MMD). We show how these two concepts can be used to define a sample efficient surrogate for collision risk estimate. We perform extensive simulations to validate the effectiveness of MMD-OPT on both synthetic and real-world datasets. Importantly, we show that trajectory optimization with our MMD-based collision risk surrogate leads to safer trajectories at low sample regimes than popular alternatives based on Conditional Value at Risk (CVaR).

Abstract:
Object manipulation is a fundamental challenge in robotics, where systems must balance trade-offs among manipulation capabilities, system complexity, and throughput. Distributed manipulator systems (DMS) use the coordinated motion of actuator arrays to perform complex object manipulation tasks, seeing widespread exploration within the literature and in industry. However, existing DMS designs typically rely on high actuator densities and impose constraints on object-to-actuator scale ratios, limiting their adaptability. We present a novel DMS design utilizing an array of 3-DoF, origami-inspired robotic tiles interconnected by a compliant surface layer. Unlike conventional DMS, our approach enables manipulation not only at the actuator end effectors but also across a flexible surface connecting all actuators; creating a continuous, controllable manipulation surface. We analyse the combined workspace of such a system, derive simple motion primitives, and demonstrate its capabilities to translate simple geometric objects across an array of tiles. By leveraging the inter-tile connective material, our approach significantly reduces actuator density, increasing the area over which an object can be manipulated by aproximately X1.84 without an increase in the number of actuators. This design offers a lower cost and complexity alternative to traditional high-density arrays and introduces new opportunities for manipulation strategies that leverage the flexibility of the interconnected surface.

Abstract:
Recent advancements in visual-inertial motion capture systems have demonstrated the potential of combining monocular cameras with sparse inertial measurement units (IMUs) as cost-effective solutions, which effectively mitigate occlusion and drift issues inherent in single-modality systems. However, they are still limited by metric inaccuracies in global translations stemming from monocular depth ambiguity, and shape-agnostic local motion estimations that ignore anthropometric variations. We present Stereo-Inertial Poser, a real-time motion capture system that leverages a single stereo camera and six IMUs to estimate metric-accurate and shape-aware 3D human motion. By replacing the monocular RGB with stereo vision, our system resolves depth ambiguity through calibrated baseline geometry, enabling direct 3D keypoint extraction and body shape parameter estimation. IMU data and visual cues are fused for predicting drift-compensated joint positions and root movements, while a novel shape-aware fusion module dynamically harmonizes anthropometry variations with global translations. Our end-to-end pipeline achieves over 200 FPS without optimization-based post-processing, enabling real-time deployment. Quantitative evaluations across various datasets demonstrate state-of-the-art performance. Qualitative results show our method produces drift-free global translation under a long recording time and reduces foot-skating effects.

Abstract:
A novel, compact, backdrivable 6-degree-of-freedom (DOF) hybrid parallel robot with a large axisymmetric workspace is proposed, referred to as CoPaRo, short for Compact Parallel Robot. The architecture achieves a high workspace-to-footprint ratio comparable to that of serial robots. The proposed robot is well suited for physical human-robot interaction (pHRI) due to its low inertia, backdrivability, and large workspace. A complete kinematic analysis is provided, including forward and inverse kinematics and velocity equations. All singularity conditions of the proposed architecture are identified, and the complete usable workspace is presented, accounting for singularities, mechanical interferences, and numerical stability. A CAD model and computer animations of the robot are provided to illustrate its motion, highlighting both the compact footprint and the large workspace. The actuators are positioned close to the base and transmit motion to distal joints via pulleys to reduce the robot's inertia. Direct-drive or quasi-direct-drive actuators can be used to enable backdrivability.

Abstract:
Reliable multi-object tracking (MOT) is essential for autonomous systems but remains challenging due to ambiguous object characteristics such as birth, death, and motion models, as well as detector errors including false detections and missed objects. Random finite set (RFS) theory provides a rigorous mathematical foundation that enables the formulation of fundamental uncertainties in object estimation under the Bayesian framework. We propose MOTCues, a MOT algorithm built on the RFS-based Poisson multi-Bernoulli filter, which integrates informative components derived from point cloud cues into the estimator as a tailored formulation. The object birth intensity function is modeled with a Gaussian mixture distribution for effective initialization of new-born objects, while object shape information is captured by constructing bounding box-centric descriptors to enhance hypothesis management. Evaluations on the KITTI dataset and the nuScenes benchmark demonstrate that integrating point cloud cues improves tracking performance by reducing ID switches, achieving superior results compared to baseline model-based trackers in real-world object tracking scenarios.

Abstract:
Dexterous hand teleoperation is becoming increasingly common, yet existing methods rarely provide both efficiency and convenience. The core challenge is to achieve motion retargeting from the human hand to a dexterous hand. To address this, we introduce TransDexNet, an end-toend vision-based motion retargeting architecture for dexterous hands. Equipped with a Vision Transformer backbone, it takes a single RGB image of a human hand and directly regresses the joint angles of a dexterous hand without any intermediate pose estimation. The architecture employs dual branches bridged by an alignment layer to close the gaps in degrees of freedom (DoFs), geometry, and kinematics between the human and dexterous hands, enabling domain-invariant latent features. To train TransDexNet, we built a dataset named TransDexData, consisting of 91,000 RGB images of human hands paired with the corresponding dexterous hand RGB images and joint angles. In evaluation, the proposed network achieves an average joint angle error of 0.076 rad. Both simulation and real-world experiments demonstrate accurate and efficient performance.

Abstract:
Grasping objects in cluttered environments remains a fundamental yet challenging problem in robotic manipulation. While prior works have explored learning-based synergies between pushing and grasping for two-fingered grippers, few have leveraged the high degrees of freedom (DoF) in dexterous hands to perform efficient singulation for grasping in cluttered settings. In this work, we introduce DexSinGrasp, a unified policy for dexterous object singulation and grasping. DexSinGrasp enables high-dexterity object singulation to facilitate grasping, significantly improving efficiency and effectiveness in cluttered environments. We incorporate clutter arrangement curriculum learning to enhance success rates and generalization across diverse clutter conditions, while policy distillation enables a deployable vision-based grasping strategy. To evaluate our approach, we introduce a set of cluttered grasping tasks with varying object arrangements and occlusion levels. Experimental results show that our method outperforms baselines in both efficiency and grasping success rate, particularly in dense clutter. Codes, appendix, and videos are available on our website https://nus-lins-lab.github.io/dexsingweb/.

Abstract:
Ensuring safety in robotic manipulation is increasingly critical as robots become integrated into human-shared environments for complex physical interaction tasks. This paper presents an energy-aware control framework that combines active responses with passive compliance for safety-critical robotic manipulation. Specifically, Control Barrier Functions (CBFs) are employed for active collision avoidance with detected obstacles, which are then integrated with fallback safety actions to resolve potential violation of CBF constraints. Complementing this active safety paradigm, a passive safety paradigm is implemented to mitigate post-collision impacts by monitoring energy variance and limiting power exchanges. Furthermore, an energy tank is incorporated to enforce passivity of the robot, which is crucial to address potential instability issues in variable impedance control. To make the tank adaptive to varying energy requirements arising from dynamic environments and unpredictable events, we propose a novel, task-agnostic tank recharging condition without compromising the system's passivity guarantee. The effectiveness of the proposed control framework is validated through experiments on a KUKA iiwa 14 robot.

Abstract:
For robots with low rigidity, determining the robot's state based solely on kinematics is challenging. This is particularly crucial for a robot whose entire body is in contact with the environment, as accurate state estimation is essential for environmental interaction. We propose a method for simultaneous articulated robot posture estimation and environmental mapping by integrating data from proximity sensors distributed over the whole body. Our method extends the discrete-time model, typically used for state estimation, to the spatial direction of the articulated structure. The simulations demonstrate that this approach significantly reduces estimation errors.

Abstract:
Coral aquaculture for reef restoration requires accurate and continuous spawn counting for resource distribution and larval health monitoring, but current methods are labor-intensive and represent a critical bottleneck in the coral production pipeline. We propose the Coral Spawn and Larvae Imaging Camera System (CSLICS), which uses low cost modular cameras and object detectors trained using human-in-the-loop labeling approaches for automated spawn counting in larval rearing tanks. This paper details the system engineering, dataset collection, and computer vision techniques to detect, classify and count coral spawn. Experimental results from mass spawning events demonstrate an F1 score of 82.4% for surface spawn detection at different embryogenesis stages, 65.3% F1 score for sub-surface spawn detection, and a saving of 5,720 hours of labor per spawning event compared to manual sampling methods at the same frequency. Comparison of manual counts with CSLICS monitoring during a mass coral spawning event on the Great Barrier Reef demonstrates CSLICS' accurate measurement of fertilization success and sub-surface spawn counts. These findings enhance the coral aquaculture process and enable upscaling of coral reef restoration efforts to address climate change threats facing ecosystems like the Great Barrier Reef.

Abstract:
The rapid evolution of retail robotics is set to transform in-store operations through advanced automation, spanning vision-based inventory tracking, order picking, packing, and restocking. Yet fine-grained product identification remains a bottleneck: assortments change, packaging evolves, and shelves host thousands of near-duplicatesrequiring perception systems that can adapt quickly with minimal setup. This paper targets that gap with two contributions. First, we present a semi-automated, robot-assisted acquisition pipeline that records 3D scene ground truth via iterative placement, projecting it into each image, yielding dense, low-cost annotations at scale. Second, we extend IPA-3D1K with challenging real shelf scenes containing 130 near-duplicate SKUs. While scenes are not paired one-to-one, the same product set appears across synthetic and real images, enabling controlled, object-level sim/real analyses under occlusion, rearrangement, and lighting variation. Using frozen DINOv3 features, our baseline recognition pipeline allows index updates in minutes. We evaluate training-free or fast approaches (kNN and a lightweight classifier head) to assess the capabilities and limitations of this representation in fine-grained retail identification. Experiments show that on the FineGrainedOCR dataset the lightweight head improves over kNN by sim11 percentage points, narrowing the gap to fully trained models to 1.95.3 pp. On IPA-3D1K (1,000 SKUs), synthetic-scene retrieval is strong (Top-1 90%, Top-2 95%), while exact disambiguation among near-duplicates remains challenging. We find that confidence thresholds enable targeted triage during inference, and a neighborhood-based risk signal predicts confusion during training, indicating where specialized modules are most beneficial.

Abstract:
We present TANGO (Tensor ANd Graph Optimization), a novel motion planning framework that integrates tensor-based compression with structured graph optimization to enable efficient and scalable trajectory generation. While optimization-based planners such as the Graph of Convex Sets (GCS) offer powerful tools for generating smooth, optimal trajectories, they typically rely on a predefined convex characterization of the high-dimensional configuration spacea requirement that is often intractable for general robotic tasks. TANGO builds further by using Tensor Train decomposition to approximate the feasible configuration space in a compressed form, enabling rapid discovery and estimation of task-relevant regions. These regions are then embedded into a GCS-like structure, allowing for geometry-aware motion planning that respects both system constraints and environmental complexity. By coupling tensor-based compression with structured graph reasoning, TANGO enables efficient, geometry-aware motion planning and lays the groundwork for more expressive and scalable representations of configuration space in future robotic systems. Rigorous simulation studies on planar and real robots reinforce our claims of effective compression and higher quality trajectories.

Abstract:
Accurate state estimation is critical for legged and aerial robots operating in dynamic, uncertain environments. A key challenge lies in specifying process and measurement noise covariances, which are typically unknown or manually tuned. In this work, we introduce a bi-level optimization framework that jointly calibrates covariance matrices and kinematic parameters in an estimator-in-the-loop manner. The upper level treats noise covariances and model parameters as optimization variables, while the lower level executes a full-information estimator. Differentiating through the estimator allows direct optimization of trajectory-level objectives, resulting in accurate and consistent state estimates. We validate our approach on quadrupedal and humanoid robots, demonstrating significantly improved estimation accuracy and uncertainty calibration compared to hand-tuned baselines. Our method unifies state estimation, sensor, and kinematics calibration into a principled, data-driven framework applicable across diverse robotic platforms.

Abstract:
Fast contact detection is crucial for safe human-robot collaboration. Observers based on proprioceptive information can be used for contact detection but have first-order error dynamics, which results in delays. Sensor fusion based on inertial measurement units (IMUs) consisting of accelerometers and gyroscopes is advantageous for reducing delays. The acceleration estimation enables the direct calculation of external forces. For serial robots, the installation of multiple accelerometers and gyroscopes is required for dynamics modeling since the joint coordinates are the minimal coordinates. Alternatively, parallel robots (PRs) offer the potential to use only one IMU on the end-effector platform, which already presents the minimal coordinates of the PR. This work introduces a sensor-fusion method for contact detection using encoders and only one low-cost, consumer-grade IMU for a PR. The end-effector accelerations are estimated by an extended Kalman filter and incorporated into the dynamics to calculate external forces. In real-world experiments with a planar PR, we demonstrate that this approach reduces the detection duration by up to 50% compared to a momentum observer and enables the collision and clamping detection within 3-39ms.

Abstract:
Vision-Language-Action (VLA) models trained via imitation learning have achieved impressive results on robotic manipulation, yet their performance degrades significantly on complex, multi-step tasks. We evaluate NVIDIA GR00T N1.6, a state-of-the-art cross-embodiment VLA model (~1.09B parameters), on the SimplerEnv Fractal benchmark to systematically identify where imitation learning falls short. We conduct closed-loop evaluation across six manipulation tasks of increasing complexity using a Google Robot embodiment, with 200 episodes per task. Our results reveal a stark performance gap: simple tasks such as picking a can achieve 90.0% success rate, while complex sequential tasks such as placing an object in a closed drawer achieve only 4.5%. Average episode time further confirms this simple tasks complete in under 3 seconds, while complex tasks approach the maximum step timeout, indicating the policy fails to make meaningful progress. We identify three failure modes driving this degradation: absence of recovery behaviors, compounding distribution shift over long horizons, and inability to optimize trajectory quality beyond mimicking demonstrations. Based on these findings, we propose Human Preference Optimization (HPO) as a post-training strategy leveraging human trajectory rankings and reinforcement learning to refine VLA policies beyond what demonstration data alone can teach.

Abstract:
Robotic fish have attracted growing attention in recent years owing to their biomimetic design and potential applications in environmental monitoring and biological surveys. Among robotic fish employing the BodyCaudal Fin (BCF) locomotion pattern, motor-driven actuation is widely adopted. Some approaches utilize multiple servo motors to achieve precise body curvature control, while others employ a brushless motor to drive the tail via wire or rod, enabling higher oscillation and swimming speeds. However, the former approaches typically result in limited swimming speed, whereas the latter suffer from poor maneuverability, with few capable of smooth turning. To address this trade-off, we develop a wire-driven robotic fish equipped with a 2-degree-of-freedom (DoF) crankslider mechanism that decouples propulsion from steering, enabling both high swimming speed and agile maneuvering. In this paper, we first present the design of the robotic fish, including the elastic skeleton, waterproof structure, and the actuation mechanism that realizes the decoupling. We then establish the actuation modeling and body dynamics to analyze the locomotion behavior. Furthermore, we propose a combined feedforwardfeedback control strategy to achieve independent regulation of propulsion and steering. Finally, we validate the feasibility of the design, modeling, and control through a series of prototype experiments, demonstrating swimming, turning, and directional control.

Abstract:
As autonomous mobile robots begin to populate public spaces, it is becoming increasingly important for robots to accurately distinguish pedestrians and navigate safely to avoid collisions. Texting while walking is a common but hazardous behavior among pedestrians that poses significant challenges for robot navigation systems. While several studies have addressed the detection of text walkers, many have overlooked the impact of occlusions, a very common phenomenon where parts of pedestrians are obscured from sensors view. This study proposes a machine learning method that distinguishes text walkers from other pedestrians in video data. The proposed method processes each video frame to extract body keypoints, encodes the keypoints into a latent space, and classifies pedestrian activities into three categories: normal walking, texting while walking, and other activities. A variational autoencoder is incorporated to enhance the systems robustness under various occlusion scenarios. Performance tests in real-world environments identified potential areas for improvement, particularly in distinguishing pedestrian activities with similar body postures. However, ablation studies demonstrated that the proposed system performs reliably across different occlusion scenarios.

Abstract:
Intuitive humanrobot interfaces are essential to increase usability and personalization in wearable robotic assistive technologies. However, most current systems rely on pre-programmed or sensor-driven strategies that offer limited active user control online. To address this limitation, we present a voice-driven control framework for a soft hip exosuit, enabling on-demand modulation of assistance and resistance via short spoken commands. The system combines a fully embedded transformer-based automatic speech recognition model (Whisper) with a gait-phase estimator to synchronize actuation with the users motion. Users can switch between assistive and resistive modes and select discrete gain levels (low, medium, high). Experiments with six healthy participants demonstrate high recognition accuracy (95-100%) and low latency (�? ms). Metabolic measurements show that assistive commands reduced walking energy cost by 20.9±4.8% (LOW) and 9.7±5.5% (MEDIUM) relative to baseline, while resistive commands increased cost by 13.1±3.5% (MEDIUM) and 14.9±5.1% (HIGH). These results highlight the feasibility of intuitive, voice-driven modulation in wearable robotics.

Abstract:
Bio-inspired swimming vehicles are increasingly being developed to understand the locomotion strategies of aquatic animals to expand the performance envelope of engineered systems. However, the increasing complexity of these multi-segmented vehicles makes it challenging to understand and optimize their performance. Accurate numerical models of these systems can provide a pathway forward, but it depends critically on reliable estimation of hydrodynamic coefficients. Traditional approaches to estimate these coefficients, such as tow-tank testing can be costly and often impractical. In this work, a numerical model of a bio-robotic sea lion was developed and validated, in which hydrodynamic coefficients critical for estimating fluid forces were first obtained through computational fluid dynamics (CFD) simulations and analytical methods such as strip theory. These coefficients were then refined using a genetic algorithm to improve agreement with experimental trials of the robot. This hybrid framework bridges the gap between simulation and reality, enabling accurate force estimation across different body segments. Validation experiments showed a close alignment between the numerical model and the physical robot's performance in position and orientation during various trials. The validated model could enable large-scale parametric studies to evaluate the effectiveness of different control surfaces, optimize gaits, and explore control strategies without extensive prototyping of the bio-robotic platform. Beyond design and analysis, the model can also provide a high-fidelity environment for the application of reinforcement learning, supporting the development of adaptive controllers and advancing bio-inspired robots toward autonomous operation.

Abstract:
Polar coordinates are widely used in segmentation tasks for range sensors such as LiDAR and radar, owing to their ability to naturally align with point cloud sparsity and distribution. However, their use in detection is limited by feature distortion. Existing polar-based detection works focused on undistorting features from the polar coordinates back to canonical Cartesian representations, but their results remain unsuccessful. In this work, we propose fully polar coordinate object detection, performing training and evaluation entirely in polar coordinates without relying on Cartesian metrics. To achieve this, we design a constraint-based polar bounding box representation, that enables the direct conversion of Cartesian bounding boxes via a constrained minimum bounding rectangle (MBR). Using the state-of-the-art polar-based detector as our baseline, we conduct experiments on the Boreas dataset. The results demonstrate that our approach improves the LiDAR detection AP30 metric by 2.88%, and yields a 2.17% gain over Cartesian-based detection methods. On more challenging scanning radar detection experiments, our method achieves an 13.11% improvement in AP30 compared to Cartesian-based detection methods. These findings validate the feasibility of fully polar coordinate object detection and demonstrate its robustness and generalizability across multiple range sensor modalities.

Abstract:
Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. In this work, we propose VGGT-Long, a simple yet effective system that pushes the limits of monocular 3D reconstruction to kilometer-scale, unbounded outdoor environments. Our approach addresses the scalability bottlenecks of existing models through a chunk-based processing strategy combined with overlapping alignment and lightweight loop closure optimization. Without requiring camera calibration, depth supervision or model retraining, VGGT-Long achieves trajectory and reconstruction performance comparable to traditional methods. We evaluate our method on KITTI, Waymo, and Virtual KITTI datasets. VGGT-Long not only runs successfully on long RGB sequences where foundation models typically fail, but also produces accurate and consistent geometry across various conditions. Our results highlight the potential of leveraging foundation models for scalable monocular 3D scene in real-world settings, especially for autonomous driving scenarios. Code is available at https://github.com/DengKaiCQ/VGGT-Long.

Abstract:
Soft robots exhibit compliance but lack load support and pose retention, while rigid robots provide structural capacity but sacrifice adaptability. Existing variable-stiffness approaches operate at segment or patch scales, preventing precise spatial control over stiffness distribution and virtual joint placement. This paper presents the Variable Stiffness Lattice Skin (VSL-Skin), the first system enabling individually addressable voxel-level morphological control with millimeter-scale precision. The system achieves three unprecedented capabilities: nearly two orders of magnitude stiffness modulation across axial 15-1200 N/mm, shear 45-850 N/mm, bending 810^2-310^4 N/deg, and torsional modes with millimeter-scale spatial control; the first demonstrated 30% axial compression in phase-change systems while maintaining structural integrity; and autonomous component-level self-repair through thermal cycling that eliminates fatigue accumulation and enables programmable sacrificial joints for predictable failure management. Selective voxel activation creates six canonical virtual joint types with programmable compliance while preserving structural integrity in non-activated regions. The platform incorporates closed-form design models and finite element analysis for predictive synthesis of stiffness patterns and joint placement. Experimental validation demonstrates 30% axial contraction, thermal switching in 30-45 second cycles, and cut-to-fit integration that preserves addressability after trimming. The row-column architecture enables platform-agnostic deployment across diverse robotic systems without specialized infrastructure. This framework establishes morphological intelligence as an engineerable system property, fundamentally advancing autonomous reconfigurable robotics.

Abstract:
Advancing Multi-Agent Pathfinding (MAPF) and Multi-Robot Motion Planning (MRMP) requires platforms that enable transparent, reproducible comparisons across modeling choices. Existing tools either scale under simplifying assumptions (grids, homogeneous agents) or offer higher fidelity with less comparable instrumentation. We present GRACE, a unified 2D simulator+benchmark that instantiates the same task at multiple abstraction levels (grid, roadmap, continuous) via explicit, reproducible operators and a common evaluation protocol. Our empirical results on public maps and representative planners enable commensurate comparisons on a shared instance set. Furthermore, we quantify the expected representationfidelity trade-offs (MRMP solves instances at higher fidelity but lower speed, while grid/roadmap planners scale farther). By consolidating representation, execution, and evaluation, GRACE thereby aims to make cross-representation studies more comparable and provides a means to advance multi-robot planning research and its translation to practice.

Abstract:
Twisted String Actuators (TSAs) are promising alternatives to conventional gear-based transmissions due to their high reduction ratios and compact form factors. However, practical limitations such as nonlinear hysteresis, limited stroke, and inherently unidirectional motion hinder their deployment in robotic systems. In this work, we propose a novel bidirectional TSA mechanism that addresses all three limitations simultaneously through an antagonistic configuration, asymmetric axis shift (AAS), and pre-tension tuning. This mechanism enables reliable bidirectional actuation by compensating for asymmetric contraction-extension behavior, suppresses hysteresis via adaptive tensioning, and extends the effective stroke. We implement the proposed design in a continuum finger module and derive a compact kinematic model for control. Extensive experiments validate the effectiveness of the approach, demonstrating the attenuation of the hysteresis, accurate bidirectional bending control across a wide range (±180�?, and the feasibility of integration into multi-finger grippers for dexterous manipulation. The results suggest that the proposed actuator design serves as a practical and scalable solution for compact robotic systems requiring precise and reversible motion.

Abstract:
Aquaculture is a marine industry experiencing significant growth and an important seafood provider. Underwater vehicles such as remotely operated vehicles (ROVs) are commonly used for inspection and maintenance of the net pens where the fish are grown. These net pens are flexible structures whose position and shape change with ocean currents and waves. Any autonomous robotic operation in aquaculture is, therefore, challenging as the net pen position and shape cannot be predetermined and it is imperative that the robot does not collide with and damage the net. This article addresses this issue by proposing a novel method to estimate the full shape of aquaculture net pens in real time using an underwater vehicle equipped with a forward-looking Doppler velocity log (DVL). The method introduces a new concept for how sparse measurement data on the net pen can be fused with numerical models of the full net pen that contrasts other models in the literature by not requiring instrumentation on the net pen nor knowledge of ocean current conditions. The estimator output is then used in closed-loop vehicle control by planning and following paths relative to the estimated pen shape. The method is tested in simulations, which show a root-mean-square error (RMSE) of 0.5 m for the estimate of the entire net pen structure and centimeter-level estimation error of the distance between the vehicle and net, and in full-scale trials in an industrial fish farm, where an ROV autonomously navigated a

Abstract:
Self-esteem plays a crucial role in childrens psychological well-being and social development. However, traditional interventions cannot provide consistent and engaging support. Recently, game-based learning has shown promise in fostering self-reliance and social confidence. Notably, socially supportive robots, offering consistent, adaptive, and peer-like reinforcement, have emerged as potential tools for enhancing childrens self-esteem. Nevertheless, their effectiveness in improving self-esteem remains unexplored. In this study, we investigated the role of a socially supportive virtual robot in boosting childrens self-esteem, social engagement, and motivation through game-based interactions. Specifically, we examined whether positive reinforcement from the robot influenced childrens global and social self-esteem, the quality and quantity of their friendships, and sustained engagement with the game. Twenty-three children in India participated in a video game with and without the virtual robot across three 30-min sessions over a month. Results indicated that children who interacted with the virtual robot showed significant improvement in global self-esteem, enhanced quantity and quality of friendships, and sustained interest and enjoyment in the task. However, no considerable change was observed in social self-esteem between the experimental and control conditions. These findings provide valuable insights into the potential of robot-mediated interventions for boosting childrens self-esteem and social engagement.

Abstract:
Forward-looking sonar (FLS) enables long-range underwater sensing. In FLS-based 3-D reconstruction, falsely inclined surfaces arise from elevation ambiguity caused by the finite vertical beamwidth. Existing approaches mitigate these errors using multi-pass strategies, but they require repeated observations, which are often impractical in real-world underwater operations. To address this, we propose a pattern-informed geometric refinement framework that leverages structural patterns from profiling sonar (PS) to resolve ambiguity in FLS-based reconstruction. Within this framework, geometric patterns within ambiguity-dominated intervals are analyzed to distinguish between physically valid surfaces and falsely inclined surfaces, and selective geometric refinement is applied accordingly. Experimental results demonstrate effective suppression of falsely inclined surfaces and improved reconstruction accuracy without trajectory modifications. This provides a practical solution for reliable 3-D mapping and perception in underwater robotic applications.

Abstract:
Robots operating alongside humans must recognize what they do not know before acting, diagnose problems from domain knowledge, and reason about action consequences. These capabilities are operational requirements, not optimization targets, and their absence produces silent and unrecoverable failures. We present a first-of-its-kind controlled comparison between OntoAgent, our content-centric cognitive architecture, and six LLMs spanning frontier and efficient tiers as drop-in replacements at the strategic layer of the same robotic system in HARMONIC. LLMs fail to verify their knowledge state before acting, even when given equivalent procedural knowledge. The deficit is architectural, not knowledge-based. Knowledge-grounded architectures must retain decision authority; LLMs contribute where their strengths apply.

Abstract:
This paper addresses the problem of autonomous UAV search missions, where a UAV must locate specific Entities of Interest (EOIs) within a time limit, based on brief descriptions in large, hazard-prone environments with keep-out zones. The UAV must perceive, reason, and make decisions with limited and uncertain information. We propose NEUSIS, a compositional neuro-symbolic system designed for effective UAV search and navigation in realistic scenarios. NEUSIS integrates neuro-symbolic visual perception, reasoning, and grounding (GRiD) to process raw sensory inputs, maintains a probabilistic world model for environment representation, and uses a hierarchical planning component (SNaC) for efficient path planning. Experimental results from simulated urban search missions using AirSim and Unreal Engine show that NEUSIS outperforms state-of-the-art baselines for both perception and planning. These results demonstrate the effectiveness of our compositional neuro-symbolic approach in handling complex scenarios, making it a promising solution for autonomous UAV systems in search missions.

Abstract:
This paper introduces an artificial intelligence-based landing zone detection module (LZDM) for vertical take-off and landing (VTOL) navigation. It employs a projection-based point cloud semantic segmentation (PCSS) convolutional neural network model combined with point cloud accumulation and a range image generation module. The proposed method addresses the limitations of existing projection-based PCSS methods, which often struggle with low-resolution and non-repetitive scan raw light detection and ranging (LiDAR) data commonly found in aerial datasets. The proposed LZDM was developed using three sets of aerial datasets collected from a DJI M600 hexacopter drone, a DJI M300 RTK quadrotor, and a Bell412 helicopter. The results were evaluated using both qualitative and quantitative metrics, demonstrating its robustness and effectiveness. In terms of quantitative results, the proposed method achieved mean intersection over union and accuracy values greater than 0.93 and 98 percent, respectively, across all three datasets, highlighting its accuracy in identifying safe landing zones (LZs). To assess the real-time feasibility of the proposed LZDM, it was deployed on a reconfigurable hardware-accelerated module. This setup achieved processing rates higher than 10 Hz for all three datasets and a throughput of over 5 million pts/s on the Jetson AGX Xavier dedicated hardware combined with the PyTorch TensorRT optimization module.

Abstract:
The ability to localise teams of robots is essential for applications ranging from robotic fleets in unstructured environments to cooperative control and navigation tasks. In such contexts, fixed infrastructure is often unavailable, deployments must be fast and flexible, and system requirements must be minimal. We present a decentralised cooperative localisation algorithm that addresses all these challenges at once. The method is anchor-less, fully decentralised, and, unlike most existing approaches, does not require controlling the robots motion to ensure team observability. It relies only on local odometry, sparse inter-agent ranging measurements, and short-range communication, all of which are widely available in practice. The algorithm adopts a multi-hypothesis Bayesian framework that maintains the entire set of feasible solutions, ensuring robustness under transient unobservable conditions. Moreover, through information sharing, each agent benefits from the estimates of the entire group, even in partially connected conditions.

Abstract:
Object recognition and grasping position detection are critical tasks in robotic manipulation, particularly when operating in dynamic and unstructured environments. This paper presents the Channel Sharpening Attention-based Adaptive Inception Network (CSA-AInceptNet), a novel multi-task learning model designed for these tasks using event camera data. The proposed architecture integrates channel sharpening attention with adaptive inception networks to enhance feature extraction and improve robustness. The model's performance is evaluated on two state-of-the-art event camera datasets, E-Grasp and Neuro-Grasp. On the E-Grasp dataset, CSA-AIncepNet achieves a remarkable accuracy of 99.47% and a mean Intersection over Union (IoU) of 0.9370, significantly surpassing existing methods. On the Neuro-Grasp dataset, leveraging transfer learning, the model attains 98.58% accuracy and a mean IoU of 0.4897, demonstrating strong generalization capabilities across datasets. Comparative analyses and ablation studies further validate the effectiveness of the proposed architecture, highlighting its superiority over conventional models like ConvNeXt, DarkNet, DenseNet, and VGG16. The results establish CSA-AIncepNet as a robust solution for event-based object recognition and grasping detection, paving the way for advancements in human-robot collaboration and dynamic robotic manipulation.

Abstract:
This letter focuses on robotic harvesting of delicate crops such as table grapes, featuring selective harvesting based on individual product properties. The robot detects grape bunches and estimates their positions and quality attributes. However, sensor limitations and occlusions affect data completeness and accuracy, reducing the cost-effectiveness of automated harvesting systems. Determining in real-time the optimal harvesting order in the presence of uncertainty is therefore important for enhancing efficiency and grape quality for growers and consumers. This task is challenging not only due to data uncertainty, but also due to the need to consider factors such as obstructive low-quality bunches. Existing literature often resorts to sub-optimal approaches such as selecting the first available crop. In contrast, we propose (i) a mapping and tracking method based on multiple viewpoints to enhance bunch information quality and (ii) a decision-making algorithm in a decision-tree with a recursive structure based on a constructed reachability graph derived from the map to optimize harvested quality and execution time sequentially.

Abstract:
Knowledge of instrument contact forces can lead to safer medical interventions. We present a formulation of the frequently used Cosserat rod model that is linearized around the measured shape for efficient model-based contact force estimation in flexible instruments and robots. Validation on instruments deflection in an endovascular use case resulted in an average force estimation error of only 14 %.

Abstract:
Conventional freehand ultrasound (US) imaging is highly dependent on the skill of the operator, leading to inconsistent results and increased physical burden on sonographers. Robotic Ultrasound Systems (RUSS) aim to address these limitations by providing standardized and automated imaging solutions, especially in environments with limited access to skilled operators. This paper presents the development of a RUSS system that employs a novel end-effector, A-SEE2.0, which uses dual RGB-D depth cameras to maintain the US probe normal to the skin surface, a default starting configuration for anatomical landmarks identification. Our RUSS integrates RGB-D camera data with robotic control algorithms to maintain orthogonal probe alignment on uneven surfaces without preoperative data. Validation tests using a phantom model show that the system achieves robust normal positioning accuracy. A-SEE2.0 demonstrates 2.47 ± 1.25 degrees normal positioning error on a flat surface and 12.19 ± 5.81 degrees error on a mannequin surface. This work highlights the clinical potential of A-SEE2.0 by demonstrating that, during in-vivo forearm ultrasound examinations, it achieves image quality comparable to manual scanning by a human sonographer.

Abstract:
Retrieving ground robots from dangerous environments after their operation is a challenging task that poses risks for the personnel. Some researchers often employ drones for retrieval, which makes operations safer. However, this setup requires an accurate method that guarantees drone and ground robot alignment due to inaccuracies in standard GPS devices, drone drifts, and wind gusts. Hence, this research article introduces simultaneous object detection and tilt correction as part of visual servoing to achieve precise drone-rover alignment. Drone detection using YOLOv8 and a tilt correction algorithm was integrated for the proposed visual servo of the ground robot. The study collected 3024 images as a data set for drone detection. The experimental results show that the trained instance segmentation model detected and captured drone objects. The study conducted an initial test for visual servo control of the ground robot in various surface terrains, resulting in a maximum alignment error on rough surfaces. Furthermore, the study conducted drone-ground robot alignment real test in an outdoor field setting. The alignment between the drone and the ground robot produced a maximum alignment error of 20.3 cm, below the threshold error. The open field experiments verified the effectiveness of the ground robots visual servo control with an actual drone operation.

Abstract:
This letter describes the manufacturing and experimental characterization of novel stretchable strain sensors for continuum robots. The overarching goal of this research is to provide a new solution for the shape sensing of these devices. The sensors are fabricated via direct ink writing, an extrusion-based additive manufacturing technique. Electrically conductive material (i.e., the ink) is printed into traces whose electrical resistance varies in response to mechanical deformation. The principle of operation of stretchable strain sensors is analogous to that of conventional strain gauges, but with a significantly larger operational window thanks to their ability to withstand larger strain. Among the different conductive materials considered for this study, we opted to fabricate the sensors with a high-viscosity eutectic Gallium-Indium ink, which in initial testing exhibited high linearity (R2 �?0.99), gauge factor �?1, and negligible drift. Benefits of the proposed sensors include (i) ease of fabrication, as they can be conveniently printed in a matter of minutes; (ii) ease of installation, as they can simply be glued to the outside body of a robot; and (iii) ease of miniaturization, which enables integration into millimiter-sized continuum robots.

Abstract:
Locating and grasping of objects by robots is typically performed using visual sensors. Haptic feedback from contacts with the environment is only secondary if present at all. In this work, we explored an extreme case of searching for and grasping objects in complete absence of visual input, relying on haptic feedback only. The main novelty lies in the use of contacts over the complete surface of a robot manipulator covered with sensitive skin. The search is divided into two phases: (1) coarse workspace exploration with the complete robot surface, followed by (2) precise localization using the end-effector equipped with a force/torque sensor. We systematically evaluated this method in simulation and on the real robot, demonstrating that diverse objects can be located, grasped, and put in a basket. The overall success rate on the real robot for one object was 85.7% with failures mainly while grasping specific objects. The method using whole-body contacts is six times faster compared to a baseline that uses haptic feedback only on the end-effector. We also show locating and grasping multiple objects on the table. This method is not restricted to our specific setup and can be deployed on any platform with the ability of sensing contacts over the entire body surface. This work holds promise for diverse applications in areas with challenging visual perception (due to lighting, dust, smoke, occlusion) such as in agriculture when fruits or vegetables need to be located inside foliage and picked.

Abstract:
We propose a multi-task co-design approach to design a robot's actuation (motor sizes and gear ratios) based on trajectory optimisation. Leveraging an actuation model fit on data of series of components, we find the optimal set of design parameters for all joints over a set of representative tasks for the given robot. Critically, we close the loop towards component selection, given a finite set of available components. This enables more practical use of co-design tools. Our results show that the method is effective, and critically, show that it is possible to find a robot design that is capable of performing an entire set of tasks at an efficiency that is comparable to a robot co-designed for each specific task. Finally, we perform an extensive analysis of hyperparameter effects, and select discrete actuation components from catalogues and compare to co-design results.

Abstract:
The robotic fish has attracted widespread research interest over the past few decades, due to its outstanding agility and environmental friendliness. And the sensing ability of underwater environments is crucial for the robotic fish to accomplish various underwater tasks. Inspired by the lateral line of real fish, many types of artificial lateral line (ALL)sensors have been proposed, including pressure-based sensors and deformation-based sensors. However, currently these types of ALL sensors mounted on robotic fish are susceptible to the interference from robotic fishs self-motions such as yaw motion and pitch motion, as well as the unavoidable vortices around the robotic fish. To address the above issues, a deformation-based magnetic ALL sensor capable of flow velocity-decoupling sensing is proposed, which can be used to measure the swimming speed of the robotic fish while suppressing the aforementioned noise. Besides, an ALL array is designed and mounted on both sides of a robotic fish, enabling the measurement of its swimming speed under both rectilinear and turning motion, with a mean absolute error (MAE) of 0.0153 m/s and 0.0125 m/s, respectively. Based on this, the ALL array is applied for trajectory estimation of the robotic fish, and the MAE of trajectory estimation under rectilinear and turning motion is 0.0600 m and 0.0730 m, respectively.

Abstract:
High-order control barrier functions (HOCBFs) that can achieve strict safety guarantees are widely used in robot safety control. However, robot obstacle avoidance in narrow environments with curved surfaces, as represented by aircraft blade detection, is still a challenge. Considering the narrow space between adjacent blades, the traditional spherical barrier boundary is not suitable for flat curved surface blades, which cannot obtain sufficient operational space. Furthermore, the lengths of obstacle avoidance paths in different directions vary greatly under the overall distortion characteristics of the blades, and HOCBF lacks explicit direction guidance. To navigate these challenges, we firstly propose an accurate surface envelope method with short solution time through rotated and scaled super-ellipsoids to obtain a large operational space. Building upon this, we propose a novel force-guided high-order control barrier function (FG-HOCBF) method to guide robot to closely adhere to the surface along the desired direction and complete detection of specific areas, which consists of two components: surface normal approach judgment and guiding force generation in desired direction. Finally, simulations and experiments validate the performance of the proposed method.

Abstract:
We consider the task of visually estimating the relative pose of a drone racing gate in front of a nano-quadrotor, using a convolutional neural network pre-trained on simulated data to regress the gate's pose. Due to the sim-to-real gap, the pre-trained model underperforms in the real world and must be adapted to the target domain. We propose an unsupervised domain adaptation (UDA) approach using only real image sequences collected by the drone flying an arbitrary trajectory in front of a gate; sequences are annotated in a self-supervised fashion with the drone's odometry as measured by its onboard sensors. On this dataset, a state consistency loss enforces that two images acquired at different times yield pose predictions that are consistent with the drone's odometry. Results indicate that our approach outperforms other SoA UDA approaches, has a low mean absolute error in position (x=26, y=28, z=10 cm) and orientation (psi=13 degrees), an improvement of 40% in position and 37% in orientation over a baseline. The approach's effectiveness is appreciable with as few as 10 minutes of real-world flight data and yields models with an inference time of 30.4ms (33 fps) when deployed aboard the Crazyflie 2.1 Brushless nano-drone.

Abstract:
Next Best View (NBV) selection is critical for achieving high-quality 3D reconstruction in unknown environments. This paper presents an active NBV selection approach tailored for Gaussian Splatting (GS), a widely adopted 3D reconstruction technique that has recently gained significant attention and been extended to Simultaneous Localization and Mapping (SLAM) systems. Existing state-of-the-art NBV methods for GS focus on minimizing uncertainties of GS parameters but often fail to prioritize views that improve geometric reconstruction quality. To address this limitation, we propose an active view selection method for GS-based reconstruction, with its core being ROI-GSurFisher. This method calculates Fisher Information on Gaussian surfels selected via a Region of Interest (ROI) mechanism. Both the use of surfels for computation and the ROI constraint enhance ROI-GSurFisher's ability to evaluate geometric information gain. We further introduce a close-front view scoring module that prioritizes viewpoints conducive to high-quality reconstruction. The final NBV is selected by maximizing the combined geometric information gain and close-front score. Experimental results on 3D reconstruction of various objects and scenes demonstrate consistent qualitative and quantitative improvements. Beyond standalone 3D reconstruction, the proposed NBV method can be integrated into SLAM systems to select fewer but more valuable keyframes. Code is available at https://github.com/WW11111/ROI-GSurFisher.

Abstract:
Offline reinforcement learning (Offline RL) provides a compelling solution for applying RL in high-risk or resource-constrained real-world domains such as healthcare, autonomous driving, and robotic manipulation. However, Offline RL faces critical challenges arising from limited data coverage and potential distributional mismatch between the pre-training dataset and real-world environment. In this paper, we propose to allow an agent to learn from a hybrid dataset: high-quality real-world data and high-diversity simulation data, and assume that the dynamics of the simulation and the real world do not match, but the state space is the same. To address the policy extrapolation error and potentially catastrophic failures because of out-of-distribution actions and sim-to-real gap, we use progressive neural networks (PNNs) to transfer the offline policy to the real world. Results in two robotic manipulation tasks with a six-degree-of-freedom Ned robotic arm show that, the hybrid dataset facilitates faster offline learning and better adaptation to real-world tasks during online learning. In addition, further analysis shows that transferring the offline policy via PNN can not only effectively retain the policy learned from the hybrid dataset and bridge the gap between simulation and reality data, but also allow the agent to explore in a more diverse distribution of samples during online learning.

Abstract:
Accurate 3D pose estimation of drones is essential for security and surveillance systems. However, existing methods often rely on prior drone information such as physical sizes or 3D meshes. At the same time, current datasets are small-scale, limited to single models, and collected under constrained environments, which makes reliable validation of generalization difficult. We present DroneKey++, a prior-free framework that jointly performs keypoint detection, drone classification, and 3D pose estimation. The framework employs a keypoint encoder for simultaneous keypoint detection and classification, and a pose decoder that estimates 3D pose using ray-based geometric reasoning and class embeddings. To address dataset limitations, we construct 6DroneSyn, a large-scale synthetic benchmark with over 50K images covering 7 drone models and 88 outdoor backgrounds, generated using 360-degree panoramic synthesis. Experiments show that DroneKey++ achieves MAE 17.34° and MedAE 17.1° for rotation, MAE 0.135 m and MedAE 0.242 m for translation, with inference speeds of 19.25 FPS (CPU) and 414.07 FPS (GPU), demonstrating both strong generalization across drone models and suitability for real-time applications. The dataset is available at [link].

Abstract:
Autonomous driving has advanced significantly with the integration of large Vision-Language Models (VLMs), which excel in understanding and analyzing driving data. However, existing VLMs face challenges, particularly in terms of latency, which is crucial for real-time driving tasks. While shrinking the model size can reduce latency, it also limits the model's ability to handle both regular and corner cases effectively. To address this challenge, we propose the Curriculum Learning-based Knowledge Distillation (CLKD) framework. CLKD enhances student model performance through three key innovations: (1) integration of a Mixture-of-Experts (MoE) architecture to preserve model expressiveness; (2) Hardness-explored at Two Granularities (H2G), which dynamically identifies easy and difficult samples at both instance and feature levels; and (3) Progressive Release Distillation strategy that gradually reduces reliance on the teacher model, thereby fostering the students autonomy and improving its generalization capability in complex driving scenarios. In real-world data experiments, CLKD has achieved a twofold increase in speed compared to existing approaches while maintaining comparable performance.

Abstract:
Comprehensive visual, geometric and semantic understanding of a 3D scene is crucial for successful execution of robotic tasks, especially in unstructured and complex environments. Additionally, to make robust decisions it is necessary for the robot to evaluate the reliability of perceived information. While recent advances in 3D neural feature fields have enabled robots to leverage features from pretrained foundation models for tasks such as language-guided manipulation and navigation, existing methods suffer from two critical limitations: (i) they are typically scene-specific, and (ii) they lack the ability to model uncertainty in their predictions. We present UniFField, a unified uncertainty-aware neural feature field that combines visual, semantic, and geometric features in a single generalizable representation while also predicting uncertainty in each modality. Our approach, which can be applied zero shot to any new environment, incrementally integrates RGB-D images into our voxel-based feature representation as the robot explores the scene, simultaneously updating uncertainty estimation. We evaluate our uncertainty estimations to accurately describe the model prediction errors in scene reconstruction and semantic feature prediction. Furthermore, we successfully leverage spatial and semantic feature predictions and their respective uncertainty for an active object search task using a mobile manipulator robot, demonstrating the capability for robust decision-making.

Abstract:
We address multi-robot motion planning under Signal Temporal Logic (STL) specifications with kinodynamic constraints. Exact approaches face scalability bottlenecks and limited adaptability, while conventional sampling-based methods require excessive samples to construct optimal trajectories. We propose a two-stage framework integrating sampling-based online learning with formal STL reasoning. At the single-robot level, our constrained Bayesian Optimization-based Tree search (cBOT) planner uses Gaussian process as surrogate models to learn local cost maps and feasibility constraints, generating shorter collision-free trajectories with fewer samples. At the multi-robot level, our STL-enhanced Kinodynamic Conflict-Based Search (STL-KCBS) algorithm incorporates STL monitoring into conflict detection and resolution, ensuring specification satisfaction while maintaining scalability and probabilistic completeness. Benchmarking demonstrates improved trajectory efficiency and safety over existing methods. Real-world experiments with autonomous surface vehicles validate robustness and practical applicability in uncertain environments.The STLcBOT Planner will be released as an open-source package and videos of real-world and simulated experiments are available at https://stlbot.github.io/.

Abstract:
Force estimation is crucial for robotics, human--machine interaction, and industrial automation. However, traditional methods are often hindered by high cost, mechanical wear, and limited accuracy in dynamic scenarios. Vision-based tactile sensing provides a promising alternative, yet existing approaches commonly rely on static calibration and degrade under dynamic interactions such as slip. To overcome these limitations, we present a novel force prediction framework for TacTip sensors, termed as Frame-stack Force Prediction Method (FFPM). The framework integrates a Dynamic Tactile Flow Encoder to capture spatiotemporal features, enabling accurate modeling of dynamic force variations. An Exponentially Weighted Residual Correction strategy is further introduced to refine predictions by leveraging historical residuals, yielding smoother and more reliable force estimation. The predicted forces are incorporated into a force-tracking impedance control scheme, achieving precise tracking during slip interactions. Experiments on our constructed dataset demonstrate state-of-the-art performance, reducing MAPE to 12.54%, and further validate the effectiveness of the proposed framework in real-world dynamic force estimation and control.

Abstract:
Zero-shot object-goal navigation (ZSON) is a challenging problem in robotics that requires a comprehensive understanding of both language and visual observations. Contextual cues from rooms and objects are critical, but their relative importance depends on the target: some objects are strongly tied to specific room types, while others are better predicted by nearby co-located objects. Existing methods overlook this distinction, leading to inefficient and inaccurate exploration. We present CLUE, a novel navigation framework that adaptively balances the use of contextual rooms and objects by leveraging commonsense knowledge extracted from an offline large language model (LLM). By estimating a targets association with room types using LLM, the agent prioritizes room cues for predictable objects and object cues for those with weak room associations. Our framework constructs a unified semantic value map that integrates both types of contextual information, adaptively weighted by the targets ambiguity to guide exploration. Combined with multi-viewpoint verification and an exploration strategy informed by contextual cues, CLUE achieves robust and efficient navigation. Extensive experiments in simulation and real-world deployments show that our method consistently outperforms state-of-the-art baselines in both success rate (SR) and success weighted by path length (SPL), demonstrating its effectiveness and practicality for real-world navigation tasks.

Abstract:
Combining data-driven models that adapt online and model predictive control (MPC) has enabled effective control of nonlinear systems. However, when deployed on unstable systems, online adaptation may not be fast enough to ensure reliable simultaneous learning and control. For example, a controller on a vehicle executing highly dynamic maneuverssuch as drifting to avoid an obstaclemay push the vehicles tires to their friction limits, destabilizing the vehicle and allowing modeling errors to quickly compound and cause a loss of control. To address this challenge, we present an active information gathering framework for identifying vehicle dynamics as quickly as possible. We propose an expressive vehicle dynamics model that leverages Bayesian last-layer meta-learning to enable rapid online adaptation. The models uncertainty estimates are used to guide informative data collection and quickly improve the model prior to deployment. Dynamic drifting experiments on a Toyota Supra show that (i) the framework enables reliable control of a vehicle at the edge of stability, (ii) online adaptation alone may not suffice for zero-shot control and can lead to undesirable transient errors or spin-outs, and (iii) active data collection helps achieve reliable performance.

Abstract:
Semantic segmentation and stereo matching, respectively analogous to the ventral and dorsal streams in our human brain, are two key components of autonomous driving perception systems. Addressing these two tasks with separate networks is no longer the mainstream direction in developing computer vision algorithms, particularly with the recent advances in large vision models and embodied artificial intelligence. The trend is shifting towards combining them within a joint learning framework, especially emphasizing feature sharing between the two tasks. The major contributions of this study lie in comprehensively tightening the coupling between semantic segmentation and stereo matching. Specifically, this study makes three key contributions: (1) a tightly coupled, gated feature fusion strategy, (2) a hierarchical deep supervision strategy, and (3) a coupling tightening loss function. The combined use of these technical contributions results in TiCoSS, a state-of-the-art joint learning framework that simultaneously tackles semantic segmentation and stereo matching. Through extensive experiments on the KITTI, vKITTI2, and Cityscapes datasets, along with both qualitative and quantitative analyses, we validate the effectiveness of our developed strategies and loss function. Our approach demonstrates superior performance compared to prior arts, with a notable increase in mean intersection over union by over 9%.

Abstract:
In this article, we present a framework for deploying aerial multiagent systems in large-scale subterranean environments with minimal supporting infrastructure. The objective is to optimally and reactively execute routine inspection tasks, selected by a mine operator on-the-fly. The assignment of currently available tasks to the agents is accomplished through an auction-based system, where the agents bid for available tasks, which are used by a central auctioneer to optimally assign the tasks. A mobile Wi-Fi mesh supports interagent communication and bi-directional communication between agents and the task allocator, while the task execution is performed completely infrastructure-free. Given a task to be accomplished, reliable and modular agent behavior is synthesized by generating behavior trees from a pool of agent capabilities, using a back-chaining approach. The auction system is reactive and supports the addition of new tasks on-the-go, at any point through a user-friendly operator interface. The framework has been validated in a real underground mining environment using three aerial agents, with several inspection locations spread in an environment of almost 200 m as a proof-of-concept. The scalability, fault tolerance, and the influence of agent initializations on the multiagent architecture have been tested through complementary Gazebo simulations in a cave environment. The proposed framework can be utilized in a subterranean environment for missions involving rapid i

Abstract:
Recent advances in large vision-language models (VLMs) and large language models (LLMs) have enabled zero-shot approaches to Visual Language Navigation (VLN), where an agent follows natural language instructions using only ego perception and reasoning. However, existing zero‑shot methods typically construct a naive observation graph and perform per‑step VLMLLM inference on it, resulting in high latency and computation costs that limit real‑time deployment. To address this, we present SFCo-Nav, an efficient zero-shot VLN framework inspired by the principle of slowfast cognitive collaboration. SFCo-Nav integrates three key modules: 1) a slow LLM-based planner that produces a strategic chain of subgoals, each linked to an imagined object graph; 2) a fast reactive navigator for real-time object graph construction and subgoal execution; and 3) a lightweight asynchronous slowfast bridge aligns advanced structured, attributed imagined and perceived graphs to estimate navigation confidence, triggering the slow LLM planner only when necessary. To the best of our knowledge, SFCo-Nav is the first slow-fast collaboration zero-shot VLN system supporting asynchronous LLM triggering according to the internal confidence. Evaluated on the public R2R and REVERIE benchmarks, SFCo‑Nav matches or exceeds prior state‑of‑the‑art zero‑shot VLN success rates while cutting total token consumption per trajectory by over 50% and running more than 3.5 times faster. Finally, we demonstrate SFCo‑Nav on a legged robot in a hotel suite, showcasing its efficiency and practicality in indoor environments.

Abstract:
This study presents a lightweight, wearable fingertip haptic device that provides physics-based haptic feedback for dexterous manipulation in virtual environments without hindering real-world interactions. The device, designed with thin strings and actuators attached to the fingernails, ensures minimal weight (1.55 g per finger) and preserves finger flexibility. Integrating the software with a physics engine renders multiple types of haptic feedback (grip force, collision, and sliding vibration feedback). We evaluated the device's performance in pressure perception, slip feedback, typical dexterous manipulation tasks, and daily operations, and we gathered user experience through subjective assessments. Our results show that participants could perceive and respond to pressure and vibration feedback. Through dexterous manipulation experiments, we further demonstrated that these minimal haptic cues significantly improved virtual task efficiency, showcasing how lightweight haptic feedback can enhance manipulation performance without complex mechanisms. The device's ability to preserve tactile sensations and minimize hindrance to real-world operations is a key advantage over glove-type haptic devices. This research offers a potential solution for designing haptic interfaces that balance lightweight construction, haptic feedback for dexterous manipulation, and daily wearability.

Abstract:
This paper describes a method for steering flexible linear objects using two robot hands in environments populated by sparsely spaced obstacles. The approach involves manipulating an elastic inextensible rod by varying the gripping endpoint positions and tangents. Closed form solutions that describe the flexible linear object shape in planar environments, Eulers elastica, are described. The paper uses the elastica solutions to formulate criteria for non self-intersection, stability and obstacle avoidance in analytic closed form manner. These criteria are formulated as constraints in the flexible object six-dimensional configuration space that represents the robot gripping endpoint positions and tangents. In particular, this paper introduces a novel criterion that ensures the flexible object stability during steering. All safety criteria are integrated into a scheme for steering flexible linear objects in planar environments, which is lifted into a steering scheme in three-dimensional environments populated by sparsely spaced obstacles. Experiments with a dual-arm robot demonstrate the method.

Abstract:
The integration of large-scale foundation models in control loops has shown strong potential for executing complex tasks directly from natural language inputs. However, achieving stability and real-time performance remains a sig- nifi cant challenge, particularly for systems with hard-to-model dynamics. In this paper, we introduce Prompt-to-State Stability (PSS) and propose the Prompt-to-State Stable Vision-Language Model Predictive Control (PSS-VLMPC) framework, which couples a vision-language model (VLM) with a robust model predictive control (MPC) scheme. The VLM interprets user commands and visual feedback, converting them into control- relevant parameters for the MPC. System dynamics are fully learned by a neural network and then approximated to enable real-time MPC performance. Building on prediction error bounds, we provide rigorous closed-loop stability guarantee and validate the effectiveness of PSS-VLMPC through both simulations and real-world experiments on a soft continuum robot, demonstrating its ability to robustly execute tasks from natural language instructions.

Abstract:
Recent years have seen a significant increase in demand for robotic solutions in unstructured natural environments, alongside growing interest in bridging 2D and 3D scene understanding. However, existing robotics datasets are predominantly captured in structured urban environments, making them inadequate for addressing the challenges posed by complex, unstructured natural settings. To address this gap, we propose WildCross, a cross-modal benchmark for place recognition and metric depth estimation in large-scale natural environments. WildCross comprises over 476K sequential RGB frames with semi-dense depth and surface normal annotations, each aligned with accurate 6DoF poses and synchronized dense lidar submaps. We conduct comprehensive experiments on visual, lidar, and cross-modal place recognition, as well as metric depth estimation, demonstrating the value of WildCross as a challenging benchmark for multi-modal robotic perception tasks. We provide access to the code repository and dataset at https://csiro-robotics.github.io/WildCross.

Abstract:
Accurate soil moisture data is crucial to precise irrigation and manual deployment of existing sensors is labor-intensive and expensive, especially in cornfield environments. We present DRILL, an unmanned ground vehicle (UGV) for autonomously deploying and reading low-cost biodegradable soil moisture sensors. The platform consists of a mechanical drilling head (linear actuator, auger drill, 16-slot encoded sensor dispenser, and a chute for guiding) and a reading head with a vector network analyzer (VNA), combined with a vison-guided navigation system for logging soil moisture data without human intervention. The robot platform has been experimentally validated in real-world farm environments over an extensive period and achieved a success rate of 93.75% for the deployment cycle and 100% for the reading cycle, with a mean cycle time of under a minute per sensor. Out of 330 sensor readings with the VNA, overall 73.3% produced valid peaks in 100-160 MHz range indicating a valid soil moisture reading and over 95.3% during the first half of the study, suggesting sensor aging. With a mean in-ground plane alignment error of 1.3 cm in X and 0.6 cm in Y, well within the 4 cm tolerance in each axis, DRILL demonstrates a scalable platform for autonomous soil monitoring and timely data collection in precision agriculture.

Abstract:
Trajectory planning plays a pivotal role in robotic motion planning, particularly in achieving time-optimal motion under complex dynamic constraints. Although the Time-Optimal Path Parameterization (TOPP) algorithm effectively addresses trajectory generation under joint torque constraints, classical methods often overlook third-order constraints. As a result, the generated trajectories, while torque-feasible, exhibit excessive jerk and poor dynamic stability, which limits their practical applicability. To overcome these limitations, this paper proposes a trajectory planning framework that simultaneously enforces torque and jerk constraints. Building upon torque-constrained TOPP, the method integrates a shooting-based strategy to identify switching points through bidirectional integration under jerk constraints and employs a Sigmoid-based fusion scheme to eliminate integration errors and ensure smooth transitions. The proposed approach is experimentally validated on a six-degree-of-freedom industrial robot. Comparative evaluations with the TOPP-RA algorithm demonstrate that the method significantly reduces both high-frequency vibrations during high-speed execution and residual oscillations after motion termination. Feedback from torque rate measurements, vibration sensors, and laser tracker data confirms faster settling and improved compliance, making the approach well-suited for complex industrial scenarios.

Abstract:
Wireless capsule endoscopy provides a minimally invasive method for examining the gastrointestinal (GI) tract; however, most existing systems are limited to passive operation and single functions, restricting control and functionality. This work presents the design, fabrication, and experimental evaluation of a multifunctional magnetically actuated capsule for drug delivery, sampling, and cargo transport. The capsule incorporates a novel springmagnet mechanism that enables controlled, repeatable opening and closing under external magnetic fields using a single actuation input. In parallel, a large-workspace magnetic actuation platform is developed to support autonomous navigation and task execution. Iterative capsule designs improved fabrication and sealing performance, guided by analytical modeling. Experimental results demonstrate a substantial reduction in the required magnetic field for actuation (from 38.3 ± 7.7 mT to 12.7 ± 2.5 mT), alongside an approximately 4-fold reduction in leakage (6.19% vs. 23.59%). The actuation platform achieved accurate path tracking with a mean deviation of 2.63 mm across multiple trajectories and enabled navigation in a stomach phantom. These results demonstrate the feasibility of a multifunctional capsule platform with integrated actuation for minimally invasive GI interventions.

Abstract:
Sit-to-stand (STS) transfer is a fundamental but challenging movement that plays a vital role in older adults daily activities. The decline in muscular strength and coordination ability can result in difficulties performing STS and, therefore, the need for mobility assistance by humans or assistive devices. Robotics rollators are being developed to provide active mobility assistance to older adults, including STS assistance. In this paper, we consider the robotic walker SkyWalker, which can provide active STS assistance by moving the handles upwards and forward to bring the user to a standing configuration. In this context, it is crucial to monitor if the user is performing the STS and adapt the rollators control accordingly. To achieve this, we utilized a standard vision-based method for estimating the human pose during the STS movement using Mediapipe pose tracking. Since estimating a users state from extreme proximity to the camera is challenging, we compared the pose identification results from Mediapipe to ground truth data obtained from Vicon marker-based motion capture to assess accuracy and reliability of the STS motion. The fourteen kinematic features critical for accurate pose estimation were selected based on literature review and the specific requirements of our robots STS method. By employing these features, we have implemented a phase classification system that enables the SkyWalker to classify the users STS phase in real-time. The selected kinematics from vision-based human state estimation method and trained classifier can be furthermore generalized to other types of motion support, including adaptive STS path planning and emergency stops for safety insurance during STS.

Abstract:
In estimating odometry accurately, an inertial measurement unit (IMU) is widely used owing to its high-rate measurements, which can be utilized to obtain motion information through IMU propagation. In this paper, we address the limitations of existing IMU propagation methods in terms of motion prediction and motion compensation. In motion prediction, the existing methods typically represent a 6-DoF pose by separating rotation and translation and propagate them on their respective manifold, such that the rotational variation is not effectively incorporated into translation propagation. During motion compensation, the relative transformation between predicted poses is used to compensate motion-induced distortion in other measurements, while inherent errors in the predicted poses introduce uncertainty in the relative transformation. To tackle these challenges, we represent and propagate the pose on SE(3) manifold, where propagated translation properly accounts for rotational variation. Furthermore, we precisely characterize the relative transformation uncertainty by considering the correlation between predicted poses, and incorporate this uncertainty into the measurement noise during motion compensation. To this end, we propose a LiDAR-inertial odometry (LIO), referred to as SE(3)-LIO, that integrates the proposed IMU propagation and uncertainty-aware motion compensation (UAMC). We validate the effectiveness of SE(3)-LIO on diverse datasets. Our source code and additional material are available at: https://se3-lio.github.io/.

Abstract:
In this letter, we investigate whether classical function allocation--the principle of assigning tasks to either a human or a machine--holds for physical Human-Robot Collaboration, which is important for providing insights for Industry 5.0 to guide how to best augment rather than replace workers. This study empirically tests the applicability of Fitts' List within physical Human-Robot Collaboration, by conducting a user study (N=26, within-subject design) to evaluate four distinct allocations of position/force control between human and robot in an abstract blending task. We hypothesize that the function in which humans control the position achieves better performance and receives higher user ratings. When allocating position control to the human and force control to the robot, compared to the opposite case, we observed a significant improvement in preventing overblending. This was also perceived better in terms of physical demand and overall system acceptance, while participants experienced greater autonomy, more engagement and less frustration. An interesting insight was that the supervisory role (when the robot controls both position and force) was rated second best in terms of subjective acceptance. Another surprising insight was that if position control was delegated to the robot, the participants perceived much lower autonomy than when the force control was delegated to the robot. These findings empirically support applying Fitts' principles to static function allocation for physical collaboration, while also revealing important nuanced user experience trade-offs, particularly regarding perceived autonomy when delegating position control.

Abstract:
Despite decades of research in magnetic resonance imaging (MRI)-compatible robotic technologies, the existing MRI-safe needle drivers rarely feature simultaneously high compactness, large insertion force, and motion versatility, all of which are critical to facilitate clinical translation in intraoperative MRI-guided percutaneous procedures. The paper presents an MR-safe needle driver that for the first time offers all these desired qualities. It measures only 2.2 × 5.3 × 3.8 cm (length × width × height), facilitating its adoption in in-bore skull-mounted or body-mounted MRI-guided procedures. It is driven by a single hydraulic bellows-based actuator, which provides good water sealing, smooth motion, and high expansion ratio, and a pre-clamped gripper design that offers large insertion force (>10 N). A compact passive rotation mechanism, together with a motion decoupling and switching mechanism, was introduced, allowing the needle to move with three motion types: independent translation, translation with passive rotation, and independent rotation. The passive rotation motion reduces needle deformation and tissue resistance during insertion, while the combination of independent tran

Abstract:
We present a deployable inflatable robotic torso with an origami-inspired spine, designed to combine the inherent compliance of soft robots with the controllability of skeletal structures. Unlike simple inflatable cylinders, which deform unpredictably through membrane buckling, our approach embeds a foldable spine that defines discrete bending axes and enables repeatable motion. Pneumatic inflation provides compact self-deployment, external stiffening, and a compliant outer shell that serves as a protective contact interface, while tendon actuation delivers precise, joint-level control. Experiments demonstrate that the torso replicates and in some cases exceeds human spinal range of motion, and that combined tendonpneumatic actuation doubles lateral stiffness compared to pneumatics alone. We further characterize stiffnessmotion trade-offs across pressures, showing tunable performance relevant to contact-rich operation. This integration of origami endoskeletons with inflatable bodies advances deployable humanoid-scale robots, addressing the gap between compliant contact behavior and controlled movement.

Abstract:
Abstract The absence of force feedback remains a major bottleneck in the development of robotic laparoendoscopic single-site (R-LESS) surgery, reducing the control precision of surgical instruments and increasing the risk of tissue damage. To address this challenge, we propose a miniature triaxial force sensor based on Fiber Bragg Grating (FBG), featuring high precision, nonlinear decoupling capability, and seamless integration with the tool tip of a continuum manipulator for singleport access surgery. The sensor features a monolithic elastic body with a dumbbell-shaped groove, where four FBGs are symmetrically arranged at 90�?intervals around the circumference to form a redundant measurement unit, thereby enhancing sensing accuracy. A novel Whale Migration Algorithm Based Kernel Extreme Learning Machine (WMA-KELM) is introduced to address the nonlinear coupling influences arising from manipulator integration, demonstrating superior accuracy and robustness compared to conventional methods. Experimental results show that within the ranges of axial force [0 N, 5 N] and radial force [-2.5 N, 2.5 N], the maximum full-scale (FS) error is less than 1% in all dimensions, the maximum RMSE is 0.0308 N, and the maximum repeatability error is within ±0.24%. These results validate the force sensor integrated with the continuum manipulator, and the proposed algorithm is effective and reliable.

Abstract:
We propose a emphfully distributed real-time model predictive control framework for transporting a single rigid object with multiple mobile manipulators. Each robot rapidly solves a local optimal control problem via Box-iLQR, while ADMM enforces consensus on the shared object state without centralized computation. The core idea is an object-centric planar orthographic projection that reduces the whole-body state and input dimensions, substantially lowering the computational load of linearization and the Riccati backward pass. Simulations demonstrate accurate trajectory tracking and consistent convergence. Specifically, the proposed dimension-reduced Box-iLQR solver operates at an average of 6.32 ms per iterationapproximately 4 times faster than a full 6-DoF model and cutting the computational cost of SQP-based methods nearly in half. Despite this significant reduction, our controller achieves comparable tracking accuracy, offering a practical alternative for real-time cooperative manipulation under limited compute and communication resources. The framework scales naturally with the number of robots and provides a concise and effective design for cooperative mobile manipulation grounded in real-time distributed optimization.

Abstract:
This work covers the design of a sliding mode control to stabilize the attitude of a flapping-wing micro aerial vehicle. The approach employs an auxiliary observer loop to avoid system excitation from unmodeled actuator dynamics, a common issue in sliding mode control applications. A proportional-integral observer is constituted in the auxiliary loop to minimize interactions with the actuator dynamics and to handle parametric uncertainties in the low bandwidth. Then, the observer-based sliding mode control is designed to track the attitude command with the reconstructed state variables from the observer loop. Furthermore, a barrier function-based adaptive gain strategy is utilized to modulate the control input according to the systems current state, ensuring efficient use of control effort. Flight experiments were conducted with a freely movable dummy mass attached to the bottom of the vehicle, simulating external disturbances. The proposed sliding mode control outperforms PD, classical, and super-twisting sliding mode controls in tracking performance and control efficiency, while mitigating self-excitation due to discontinuous input.

Abstract:
Trajectory planning for autonomous driving in dynamic unstructured traffic remains a fundamental challenge. Existing methods are often reactive, i.e., they only respond to observed situations without explicitly anticipating future risks. Moreover, most reinforcement learning based approaches rely on manually crafted reward functions, which limits their adaptability and generalization across complex driving scenarios. In this paper, we propose a novel RL-based trajectory planning framework that integrates proactive obstacle avoidance and adaptive reward learning. Specifically, our planner predicts the future trajectories of surrounding traffic participants as well as potential ghost-probe risk zones, and proactively avoids these high-risk regions during planning. In addition, we introduce a large-model agent that dynamically adjusts the reward signals according to evolving traffic contexts, enabling more adaptive and robust policy learning compared with fixed reward designs. To evaluate our method, we build a high-fidelity simulation environment based on the Peking University campus, which provides realistic unstructured traffic scenarios. Extensive experiments demonstrate that our method significantly improves safety, efficiency, and generalization over state-of-the-art baselines, particularly in scenarios with occlusions and unpredictable behaviors. We may open-source our code and simulation environment for community benefit soon.

Abstract:
Searches are conducted to find missing persons and/or objects given uncertain information, imperfect observers and large search areas in Search and Rescue (SAR). In many scenarios, such as Maritime SAR, expected survival times are short and optimal search could increase the likelihood of success. This optimization problem is complex for nontrivial problems given its probabilistic nature. Stochastic optimization methods search large problems by nondeterministically sampling the space to reduce the effective size of the problem. This has been used in SAR planning to search otherwise intractably large problems but the stochastic nature provides no formal guarantees on the quality of solutions found in finite time. This paper instead presents ASAR, an ε-optimal search algorithm for SAR planning. It calculates a heuristic to bound the search space and uses graph-search methods to find solutions that are formally guaranteed to be within a user-specified factor, ε, of the optimal solution. It finds better solutions faster than existing optimization approaches in operational simulations. It is also demonstrated with a real-world field trial on Lake Ontario, Canada, where it was used to locate a drifting manikin in only 150s

Abstract:
Flexible manufacturing requires rapid deployment of solutions and minimal setup time to remain competitive. An essential attribute is the ability to control error levels, as failures can range from minor performance degradation to severe equipment damage. However, conventional deployment often involves extensive setup, data collection, model training or parameter tuning, and system testing, resulting in significant delays that hinder commercial feasibility. We propose a data engine which gathers data and improve its performance while executing the task. The data engine consists of two classifiers, a fast model prediction and expensive verification. First, a model prediction is performed and based on the confidence level of the prediction, the expensive verification can be used. By adjusting the confidence level, users can control the level of tolerable error. Our method is implemented on a real-world robotic insertion task, which uses force data for the model prediction. The system applies UMAP dimensionality reduction and uses Wilson-Score to compute the confidence bounds of the prediction. Results demonstrate the ability to learn and reduce the need for expensive verifications over time, while staying within the set error-rate. The results highlight the potential of confidence bounds in self-improving models to enhance reliability in robotic classification task.

Abstract:
This paper present a spiral-cavity wheel for lunar regolith excavation and a sensor-light evaluation stack that jointly estimates fill ratio (vision), sinkage (vision), and specific energy from actuator logs. In benchtop tests (four revolutions at 5, 10, and 15 RPM) against two literature baselines, the proposed wheel achieved higher excavated mass and fill ratio, delivering 2.23.0 times higher excavation rate while reducing specific energy by 29% relative to a bucket-drum baseline. Normalized sinkage (mm/kg) was also lower, indicating stable traction without bogging. Effort-time traces show a steady torque envelope with repeatable cutcarrydump cycles across speeds. We provide a retention index η that correlates with fill ratio and a DEM setup that reproduces experimental trends with low error. Results suggest spiral-cavity wheels can replace heavier multi-actuator diggers when mass, simplicity, and energy efficiency are mission drivers.

Abstract:
Unmanned aerial vehicles (UAVs) operating in confined, cluttered environments face significant performance degradation due to nonlinear, time-varying unmodeled dynamicssuch as ground/ceiling effects and wake recirculationthat are unaccounted for in traditional controllers. While learning-based compensators (e.g., MLPs, TCNs, LSTMs) struggle with historical data dependency, vanishing gradients, and prohibitive computational costs, this work pioneers the integration of a deep photonic reservoir computer (PRC) with feedforward control to overcome these limitations. Harnessing semiconductor laser dynamics and optical feedback, our hardware-implemented deep PRC architecture achieves intrinsic temporal memory without explicit historical inputs, while reducing training time from hours to milliseconds and slashing inference latency to nanoseconds. Reliable high-performance CFD simulations capturing proximity-induced flows demonstrate that deep PRC delivers residual-force prediction accuracy comparable to or exceeding TCN/MLP baselines, while training only a linear readout layer via ridge regression. By injecting these predictions into a nonlinear feedback PID controller via a feedforward channel, the framework significantly enhances closed-loop tracking stability in confined spaces. Essentially, this work establishes the first deep PRC-based lightweight, ultra-fast solution for real-time UAV dynamic compensation, with promising extensibility to unseen scenarios with more complex fluid environments.

Abstract:
Robot failure is detrimental and disruptive, often requiring human intervention to recover. Our vision is 'fail-active' operation, allowing robots to safely complete their tasks even when damaged. Focusing on 'actuation failures', we introduce DEFT, a diffusion-based trajectory generator conditioned on the robots current embodiment and task constraints. DEFT generalizes across failure types, supports constrained and unconstrained motions, and enables task completion under arbitrary failure. We evaluate DEFT in both simulation and real-world scenarios using a 7-DoF robotic arm. DEFT outperforms its baselines over thousands of failure conditions, achieving a 99.5% success rate for unconstrained motions versus RRT's 42.4%, and 46.4% for constrained motions versus differential IK's 30.9%. Furthermore, DEFT demonstrates robust zero-shot generalization by maintaining performance on failure conditions unseen during training. Finally, we perform real-world evaluations on two multi-step tasks, drawer manipulation and whiteboard erasing. These experiments demonstrate DEFT succeeding on tasks where classical methods fail. Our results show that DEFT achieves fail-active manipulation across arbitrary failure configurations and real-world deployments.

Abstract:
Previous work has shown that adding haptic feedback to the hands can improve awareness of tool-tissue interactions and enhance performance of teleoperated tasks in robot-assisted minimally invasive surgery. However, hand-based haptic feedback occludes direct interaction with the manipulanda of surgeon consoles. We propose relocating haptic feedback to the wrist using a wearable haptic device. It is unknown if such feedback will be effective, given that it is not co-located with the finger movements used for manipulation. To test if relocated haptic feedback improves force application during teleoperated tasks using the da Vinci Research Kit (dVRK) surgical robot, participants learned to palpate a phantom tissue to desired forces. A soft pneumatic wrist-worn haptic device with an anchoring system renders tool-tissue interaction forces to the wrist of the user. Participants demonstrated statistically significant lower force error and performed the palpation task with longer movement times when provided wrist-worn haptic feedback.

Abstract:
Current AAV autonomous ﬂights exhibit efﬁcient performance in both indoor and ﬁeld environments. However, they often face signiﬁcant challenges in large-scale and cluttered environments, where the vast amount of captured data can lead to computation and storage bottlenecks. Additionally, the existing gradient-based planning methods depend on appropriate resolutions to adapt to different scenarios. In this letter, we present FELP, a fast and effective autonomous ﬂight system for large-scale and cluttered environments based on the uniﬁed linear parametric map. It can enhance the adaptability of planners to diverse environments. First, by the random mapping method (RMM), the original irregular points in low-dimensional space are mapped into the high-dimensional space, where the points are approximately linearly separable or distributed. Leveraging the features of this mapping space, we can quickly obtain the occupancy state and Euclidean distance (the distance to the nearest obstacle) rather than relying on a large number of queries and repeated iterations. Then we learn a uniﬁed linear parametric model about grid maps and ESDF maps. Based on the linear parametric model, path searching is quickly executed in the front-end. Unlike traditional methods that compute the ESDF through interpolation, the closed-form ESDF can be solved efﬁciently, enabling real-time online trajectory optimization in the back-end. Compared to EGO-Planner, FELP reduces the mapping time by 68% and the planning time by 29%. Simulation and real-world experiments are conducted to verify their comprehensive performance compared to typical methods and state-of-the-art methods.

Abstract:
We explore the use of mobile furniture swarms that are intended to assist users with limited mobility in their daily indoor activities. We focus on the multi-agent coordination problem for a mobile furniture swarm when a dense target pose configuration is required, such as in an apartment setting. In those cases, one agents convergence to the target can be significantly affected by neighboring agents with specific shapes. In this letter, we propose a solution, named Velocity Potential Field Modulation (VPFM), to deal with the dense coordination problem of a polytopic swarm in a decentralized manner. We adapt our method to assistive applications, such as room reconfigurations and facilitating indoor movement of wheelchair users. We evaluate the performance of our method in simulations and on real-world mobile furniture hardware, demonstrating its effectiveness and real-time performance.

Abstract:
Robot Operating System 2 (ROS 2) has become the standard framework in modern robotics, but its steep learning curve and complex development environment present significant barriers to newcomers. This paper presents an open-source web lab through which learners can start practicing ROS2 programming on real robots right away with no development environment setup. The web lab offers realistic ROS2 development experience by serving browser-based Linux desktop workstations linked to physical remote robots. We also show how the web lab system can be leveraged to create portable infrastructure for hosting ad-hoc on-site robotics workshops that require zero setup time from the participants.

Abstract:
We introduce GFreeDet2, which leverages Gaussian Splatting and foundation models to address RGB-based model-free 2D detection and 6D detection of unseen objects. GFreeDet2 reconstructs 3D Gaussian object models from multi-view RGB references, enabling efficient model-free detection without relying on CAD models. To accelerate reconstruction and consistently handle both pinhole and fisheye cameras, we propose projection-aware perspective cropping (PAPC) with visual hull initialization. PAPC further improves coarse 6D detection by accurately extracting pinhole crops from fisheye query images. The Gaussian objects enable rendering in place of CAD models within foundation model-driven pipelines, allowing existing state-of-the-art RGB-based methods for unseen 2D and 6D detection to be extended to the model-free setting with minimal modifications. Extensive experiments on all three BOP-H3 datasets demonstrate that GFreeDet2 achieves state-of-the-art performance and establishes a strong baseline for RGB-based, model-free 2D and 6D unseen object detection. The code is publicly available at https://github.com/wangg12/GFreeDet2.git.

Abstract:
Out-of-distribution (OOD) robustness in vision-based autonomous driving is often reduced to a single number, hiding what breaks a policy and by how much. We adopt a factorized view, decomposing environments along five axes: scene (rural/urban), season, weather, time (day/night), and agent mix; and measure performance under controlled k-factor perturbations (k in 0,1,2,3). Using closed-loop control in VISTA, we benchmark FC, CNN, and ViT policies, train compact ViT heads on frozen foundation-model (FM) features, and vary in-distribution (ID) support in scale, diversity, and temporal context. (1) ViT policies are markedly more OOD-robust than comparably sized CNN/FC, and adding FM features yields state-of-the-art success at a latency cost. (2) Naive temporal inputs (multi-frame) do not beat the best single-frame baseline. (3) The largest single-factor drops are rural -> urban and day -> night (~31% each); actor swaps ~10% and moderate rain ~7%; several season shifts are drastic, and combining a time flip with other changes further degrades performance. (4) FM-feature policies stay above 85% under three simultaneous changes; non-FM single-frame policies take a large first-shift hit, and all no-FM models fall below 50% by three changes. (5) Interactions are non-additive: some pairings (e.g., urban-night) partially offset, whereas season-time combinations are especially harmful. (6) Training on winter/snow is most robust to single-factor shifts, while a rural+summer baseline gives the best overall OOD performance. (7) Scaling traces/views of the same configuration improves robustness (about +11.8 points from 5 to 14 traces), yet targeted exposure to hard conditions can substitute for scale. (8) Using multiple ID environments broadens coverage and strengthens weak cases (urban OOD 60.6% -> 70.1%) with a small ID drop; single-ID preserves peak performance but in a narrow domain. These results yield actionable design rules for OOD-robust driving policies.

Abstract:
Resolving disorientation of the surgeon caused by wrong recognition of scopes position, which often increases procedural time and workload, remains a significant challenge in robotic-assisted retrograde intrarenal surgery (RIRS). This paper introduces a novel hybrid ureteroscope tracking algorithm that integrates low-latency lumen identification with robotic motion data to enhance intrarenal navigation. The system estimates the ureteroscopes position on the centerline of the kidney by recognizing its pathway. In validation tests using a 3D-printed phantom, the proposed method achieved an average localization success rate of 89.2% for major calyx entry and 84.1% for minor calyx entry, with an average computation time of 0.26 seconds, ensuring low-latency operation. Usability testing with ten novice participants demonstrated a 44.5% reduction in cognitive workload (NASA-TLX), improved task success rates, and reduced manipulation effort. These results indicate that the proposed tracking algorithm significantly enhances ureteroscope navigation, improving efficiency and reducing the surgeon's cognitive load in robotic-assisted RIRS.

Abstract:
Sim-to-real transfer remains a fundamental challenge in robot manipulation due to the entanglement of perception and control in end-to-end learning. We present a decoupled framework that learns each component where it is most reliable: control policies are trained in simulation with privileged state to master spatial layouts and manipulation dynamics, while perception is adapted only at deployment to bridge real observations to the frozen control policy. Our key insight is that control strategies and action patterns are universal across environments and can be learned in simulation through systematic randomization, while perception is inherently domain-specific and must be learned where visual observations are authentic. Unlike existing end-to-end approaches that require extensive real-world data, our method achieves strong performance with only 10-20 real demonstrations by reducing the complex sim-to-real problem to a structured perception alignment task. We validate our approach on tabletop manipulation tasks, demonstrating superior data efficiency and out-of-distribution generalization compared to end-to-end baselines. The learned policies successfully handle object positions and scales beyond the training distribution, confirming that decoupling perception from control fundamentally improves sim-to-real transfer.

Abstract:
Magnetic nanoparticle (MNP) guidance has attracted considerable attention for biomedical applications, such as targeted drug delivery and minimally invasive therapy. However, precise navigation in vivo remains challenging, particularly in high-flow vascular environments, where drag forces dominate particle dynamics and real-time feedback is impractical. Here, we present an evolutionary automatic guidance scheme for feedforward control of MNP chains in a vascular model. The proposed approach leverages chain alignment with the flow direction to achieve directional migration into specific branches, without relying on swarm cohesion or online feedback. To provide uniform actuation, a Halbach array is designed and optimized to generate a nearly uniform magnetic force field within the target workspace. A physics-based simulator incorporating magnetic, drag, and wall interaction forces is developed to model chain dynamics, and control sequences are optimized using the Covariance Matrix Adaptation Evolution Strategy (CMA-ES), ensuring robustness to variations in chain length and injection conditions. The method is experimentally validated using a four-channel vascular model, demonstrating that feedforward magnetic actuation can reliably guide nanoparticle chains under physiologically relevant high-flow conditions. This study establishes a practical and scalable strategy for nanoparticle navigation, providing a foundation for future biomedical applications in dynamic vascular environments.

Abstract:
The dynamic and heterogeneous nature of agricultural fields presents significant challenges for object detection and localization, particularly for autonomous mobile robots that are tasked with surveying previously unseen unstructured environments. Concurrently, there is a growing need for real-time detection systems that do not depend on large-scale manually labeled real-world datasets. In this work, we introduce a comprehensive annotation-to-detection framework designed to train a robust multi-modal detector using limited and partially labeled training data. The proposed methodology incorporates cross-modal annotation transfer and an early-stage sensor fusion pipeline, which, in conjunction with a multi-stage detection architecture, effectively trains and enhances the system's multi-modal detection capabilities. The effectiveness of the framework was demonstrated through vine trunk detection in novel vineyard settings that featured diverse lighting conditions and varying crop densities to validate performance. When integrated with a customized multi-modal LiDAR and Odometry Mapping (LOAM) algorithm and a tree association module, the system demonstrated high-performance trunk localization, successfully identifying over 70% of trees in a single traversal with a mean distance error of less than 0.37 m. The results reveal that by leveraging multi-modal, incremental-stage annotation and training, the proposed framework achieves robust detection performance regardless of limited starting annotations, showcasing its potential for real-world and near-ground agricultural applications.

Abstract:
While visuomotor policy has made advancements in recent years, contact-rich tasks still remain a challenge. Robotic manipulation tasks that require continuous contact demand explicit handling of compliance and force. However, most visuomotor policies ignore compliance, overlooking the importance of physical interaction with the real world, often leading to excessive contact forces or fragile behavior under uncertainty. Introducing force information into vision-based imitation learning could help improve awareness of contacts. However, current visuomotor policy approaches require a lot of data to perform well. One remedy for data scarcity is to generate data in simulation, yet computationally taxing processes are required to generate data good enough not to suffer from the Sim2Real gap. In this work, we introduce a framework for generating force-informed data in simulation, instantiated by a single human demonstration, and show how coupling with a compliant policy improves the performance of a visuomotor policy learned from synthetic data. We validate our approach on real-robot tasks, including non-prehensile block flipping and a bi-manual object moving, where the learned policy exhibits reliable contact maintenance and adaptation to novel conditions. Project Website: https://flow-with-the-force-field.github.io/webpage/

Abstract:
Recent advances in 4D radar-inertial odometry have demonstrated promising potential for autonomous lo calization in adverse conditions. However, effective handling of sparse and noisy radar measurements remains a critical challenge. In this paper, we propose a radar-inertial odometry with a spatial weighting method that adapts to unevenly distributed points and a novel point-description histogram for challenging point registration. To make full use of the Doppler velocity from different spatial sections, we propose a weighting calculation model. To enhance the point cloud registration performance under challenging scenarios, we con struct a novel point histogram descriptor that combines local geometric features and radar cross-section (RCS) features. We have also conducted extensive experiments on both public and self-constructed datasets. The results demonstrate the precision and robustness of the proposed VGC-RIO.

Abstract:
Drone docking stations promote efficient operations of drones, but they usually support only one vehicle, and are accessible primarily through vertical landing. These limitations hinder multi-drone operations and result in challenges for fast, precise docking, particularly under severe wind conditions. This study assesses the EAGLES port, which uses a horizontal landing approach to address these challenges, and makes a performance comparison between horizontal and vertical landing through analysis of wind tunnel data. Results show that horizontal landing decreases the average landing duration by 35.58%, and can achieve 59.67% faster docking compared to vertical landing in optimal conditions. The system also provides near-zero position error at docking, and supports multiple drones. These advantages stem from improved flight stability, quicker alignment with landing targets, and a 2.8 times higher average velocity compared to vertical landing. These results indicate that vertical landing is better suited for missions with wider landing zones and where delays in landing have mild consequences, whereas horizontal landing excels in scenarios where rapid accurate landings are critical.

Abstract:
The rise in prevalence of AI-enabled technologies (from voice assistants to social robots) has not yet been accompanied by an analogous mastery of computer-mediated humor. Although humans often use jokes to repair interactions and navigate uncomfortable scenarios, social robots in similar roles typically fall short at reading the room and adapting behavior according to sensed social contexts and reactions. We pursued two studies to gain clearer evidence about adaptive robot joking's influence (compared to hardcoded repartee or no robot banter). The first study (N = 48, between-subjects design) examined in-person one-on-one human-robot interactions across the three conditions. The results indicated that adaptive repartee by robots tended to increase perceived warmth, competence, comfort, social closeness feelings, and humorousness, and that human behavioral responses varied significantly between conditions, with any repartee leading to significant gains over no repartee. The second study used an online video-based survey with a within-subjects design (N = 99) to examine the same conditions. This follow-up effort showed significant gains in perceived competence and anthropomorphism for any type of repartee, although this banter also made the robot more discomforting. Our work can help practitioners who are interested in applying playful banter to enhance robot charm and success.

Abstract:
Thoracostomy involves draining fluid from the pleural cavity using chest tubes. This medical intervention is currently performed manually by inserting a hollow flexible tube, risking damage to vital organs, including the lungs, diaphragm, spleen, and mediastinum, due to the lack of control over the tubes path inside the patients body. Inspired by snake-like structures, continuum robots are particularly well-suited to address the challenges encountered during thoracostomy. Taking advantage of their slender shape, they can nest inside the tubes and guide them from within without requiring further incision. However, available continuum robots are not suitable for this application due to geometrical and payload requirements. In this paper, a novel design is presented, leveraging a multi-backbone structure with asymmetrical rolling joints to enhance payload capacity and dexterity while maintaining the slender shape of the robot. A static modeling approach is proposed to estimate the configuration of the robot given the force applied to the robot, including the effects of friction and gravity often neglected for these robots. Two prototypes were 3D-printed, allowing for after-use disposal due to their cost-effectiveness, thereby preventing cross-contamination. Stiffness and position error were evaluated for the prototypes, demonstrating a modeling accuracy of 2.25%.

Abstract:
This paper proposes an algorithm capable of driving a system to follow a piecewise linear trajectory without prior knowledge of the system dynamics. Motivated by a critical failure scenario in which a system can experience an abrupt change in its dynamics, we demonstrate that it is possible to follow a set of waypoints comprised of states analytically proven to be reachable despite not knowing the system dynamics. The proposed algorithm first applies small perturbations to locally learn the system dynamics around the current state, then computes the set of states that are provably reachable using the locally learned dynamics and their corresponding maximum growth-rate bounds, and finally synthesizes a control action that navigates the system to a guaranteed reachable state.

Abstract:
This paper presents the design and characterization of a new series elastic actuator (SEA) for physical human-robot interaction (pHRI) featuring a compact spring mechanism. The spring mechanism consists of ten compression springs, fitted on output rotors and arranged in a curved formation. The compression springs are enclosed in spring chambers featured in input rotors. This design reduces frictional losses and enables all springs to bear load bidirectionally with minimal preload relative to conventional designs that rely on antagonistic spring arrangements, thereby enhancing deflection range and torque capacity. We introduce the SEA design and experimentally characterize the passive torque-deflection curve and the closed-loop torque tracking bandwidth. Bench testing demonstrates a torque capacity of 18 Nm and a maximum stiffness of 43.8 Nm/rad. As a representative application, the SEA is integrated into an ankle exoskeleton, with the spring mechanism co-located at the ankle joint. Treadmill walking tests with the exoskeleton indicate good torque tracking performance, with a root-mean-square error of 1.48 Nm when applying 12% assistance, corresponding to a peak torque of 17.6 Nm.

Abstract:
Birds Eye View (BEV) map segmentation is one of the most important and challenging tasks in autonomous driving. Camera-only approaches have drawn attention as cost-effective alternatives to LiDAR, but they still fall behind LiDAR-Camera (LC) fusion-based methods. Knowledge Distillation (KD) has been explored to narrow this gap, but existing methods mainly enlarge the student model by mimicking the teachers architecture, leading to higher inference cost. To address this issue, we introduce BridgeTA, a cost‑effective distillation framework to bridge the representation gap between LC fusion and Camera‑only models through a Teacher Assistant (TA) network while keeping the students architecture and inference cost unchanged. A lightweight TA network combines the BEV representations of the teacher and student, creating a shared latent space that serves as an intermediate representation. To ground the framework theoretically, we derive a distillation loss using Youngs Inequality, which decomposes the direct teacher-student distillation path into teacher-TA and TA-student dual paths, stabilizing optimization and strengthening knowledge transfer. Extensive experiments on the challenging nuScenes dataset demonstrate the effectiveness of our method, achieving an improvement of 4.2% mIoU over the Camera-only baseline, up to 45% higher than the improvement of other state-of-the-art KD methods. The code will be released upon acceptance.

Abstract:
Recovering upper-limb motor functions impaired by trauma or neurological disease is a long and challenging process. To monitor a patients progress through the various stages of rehabilitation and guide therapy, regular movement assessment is essential. However, such evaluations are rarely conducted in clinical practice due to time constraints and the need for cumbersome equipment. A key limitation is the access to reference motion data, typically derived from averaged movements of unimpaired individuals, which requires new data collection for each task and lacks personalization (e.g., accounting for individual morphology or motor abilities). We present a novel method to generate patient-specific reference motions directly from the patients hand pose using a personalized model of the patient, the Virtual Humanoid Twin (VHT). By solving an ergonomic-based optimal control problem, our approach produces tailored reference motions without prior task-specific data. We validated this method on two motor tasks (reaching and pouring) using data from seven unimpaired participants, with and without an elbow orthosis restricting motion. Analysis of joint trajectories, range of motion, and normalized multi-dimensional Dynamic Time Warping confirmed that VHT-generated motions were more ergonomic than those with the orthosis and closely matched natural movements. The methods rapid generation time can also enable real-time reference motion estimation, parallel to the patients movements. This innovation simplifies access to reference motions while providing personalization. It creates opportunities for automated motor assessment in neurorehabilitation, enhancing patient recovery tracking through regular evaluations.

Abstract:
This paper discusses the development of a model for an underwater glider robot equipped with multiple buoyancy controllers aimed at environmental surveying of lakes. Primary target of the model is performing 2D simulation of actual gliding that will lead to control system development. This proves to be a challenge as gliding requires calculation of hydrodynamic forces in the medium, in this case water, which typically involves Computational Fluid Dynamics (CFD). Although CFD is a well established technique, it's a well known fact that it is expensive both in terms of computational resources and valuable time. Instead, an approach that combines CFD with Euler-Lagrange equations is proposed and undertaken. Discussion regarding the proposed underwater glider, derivation of the model, the architecture of the simulation, and preliminary simulation results referenced with actual gliding results are presented.

Abstract:
The recent shift from mass production to mass personalization leads to a production environment in which workpieces have a high degree of geometric variations. The robotic process automation in such high-mix low-volume environments poses significant challenges since predetermined robot programs are not viable anymore. In this letter, we consider the automation of surface processing for category-level objects with significant variations in geometry by operating on point clouds without relying on CAD models. To achieve this, we present a novel multi-feature implicit neural descriptor (MIND) representation which leverages dense correspondence to generalize across diverse objects, enabling a one-shot transfer of process trajectories and associated process knowledge. The quantitative and qualitative evaluation shows that MIND outperforms other state-of-the-art dense correspondence approaches. A real-world application case study of robotic surface processing on geometry-varying basin molds validates the efficacy of the proposed approach.

Abstract:
6-DoF posture estimation is a critical technique in robotics. However, a significant gap exists between the two primary approachescamera-based methods and laser tracker systemsin terms of cost and performance. To bridge this gap, this work proposes a method to calculate the 6-DoF posture of a manipulators end-effector using the 2D profile of a custom-designed metallic gauge. The core principle relies on a one-to-one correspondence between the measured profile and the manipulator's posture. A mathematical model was developed to derive a closed-form solution, which is further refined via maximum likelihood estimation to enhance robustness and accuracy. Simulation studies assessed the influence of geometric parameters on estimation accuracy and noise robustness. Real-world experiments demonstrated that the refined solution significantly outperforms the closed-form solution alone. Furthermore, a speed benchmark against a camera-based system highlighted the proposed method's advantage in operating frequency. Finally, the method was successfully integrated into a real-time position control task for a 6-axis industrial manipulator, verifying its practical applicability in real-time robotic control.

Abstract:
Elastic lightweight manipulators offer multiple benefits but suffer from increased structural flexibility, making them susceptible to vibrations and thus requiring dedicated control concepts for vibration suppression. Based on a lumped element model formulation, a method called elastic structure preserving (ESP) control is used for additional damping injection, while using standard PD motor position control. The control method is applied for the first time to a flexible link robot by combining it with a link-side IMU-based observer. It is demonstrated in an industrial context using a standard controller setup, enabling straightforward implementation on existing industrial robots. The novel ESP method is further compared to a flatness-based control approach and to standard PD motor control. Particular aspects of controller tuning are discussed. Both theoretical analysis and experimental evaluations are conducted to address trajectory tracking behavior, disturbance rejection, and robustness to model parameter uncertainties. Results based on end effector accelerations show that ESP achieves superior vibration damping, demonstrating its effectiveness for industrial lightweight robots.

Abstract:
The autonomous cooperation of micro-Unmanned Aerial Vehicle (UAV) swarms remain key challenges. Existing swarm relative positioning and control methods demand high sensing, computing, and communication resources and rely on external equipment like GPS and ground stations. To address these issues, this paper proposes an all-onboard and external-aiding-free swarm relative measurement, positioning and control framework. The framework utilizes an onboard Vision-Optoelectronic-Ultra-Wideband (UWB) coupled measurement system to acquire inter-UAV relative distance and direction. Subsequently, the swarm's relative positions are solved via a distributed graph optimization (DGO) approach. Based on the solved relative positions, swarm cooperative control is implemented through a distributed Voronoi diagram approach. Experimental results demonstrate that the proposed method enables 150 g micro-UAVs to achieve nearly 100-meter autonomous outdoor formation flight and collaborative tracking of dynamic targets, with swarm relative localization accuracy reaching approximately 0.262 m. This work pioneers fully autonomous measurement and control for 100-gram scale UAV swarms without external infrastructure, significantly advancing autonomy and enabling swarm intelligence emergence.

Abstract:
Retinal Surgery Robotics is a rapidly emerging field that offers enhanced precision by overcoming human tremors. A key trend of these robotic designs is toward more compact and lightweight structures for improved positioning accuracy and precise force delivery. However, this compactness sacrifices the robot's working volume, making it difficult for ophthalmic surgeons to intuitively assess if retinal targets are accessible by the surgical tool tip. This paper proposes a methodology for visualizing the actual accessible area in the microscopic view to provide surgeons with an intuitive visual guide of the tool's reach, reducing uncertainty and streamlining extraocular robotic maneuvers. We validated this method on a commercial phantom with a surgical robot system, achieving <=1.0 deg error for 83.3% of tested points across four retinal subareas and demonstrating its clinical potential.

Abstract:
With the increasing integration of cyber-physical systems (CPS) into critical applications, ensuring their resilience against cyberattacks is paramount. A particularly concerning threat is the vulnerability of CPS to deceptive attacks that degrade system performance while remaining undetected. This paper investigates perfectly undetectable false data injection attacks (FDIAs) targeting the trajectory tracking control of a non-holonomic mobile robot. The proposed attack method utilizes affine transformations of intercepted signals, exploiting weaknesses inherent in the partially linear dynamic properties and symmetry of the nonlinear plant. The feasibility and potential impact of these attacks are validated through experiments using a Turtlebot 3 platform, highlighting the urgent need for sophisticated detection mechanisms and resilient control strategies to safeguard CPS against such threats. Furthermore, a novel approach for detection of these attacks called the state monitoring signature function (SMSF) is introduced. An example SMSF, a carefully designed function resilient to FDIA, is shown to be able to detect the presence of a FDIA through signatures based on system states.

Abstract:
Continuum robots employed in flexible gastrointestinal endoscopy require the capability of transitioning between the flexible and the rigid states. Phase-change-material-based variable stiffness (VS) methods exhibit a significant stiffness change ratio but are typically time-consuming. Besides, these materials are commonly fabricated into simplistic cylindrical or tubular structures and subsequently integrated with continuum joints, overlooking the impact of the intrinsic structural characteristics of the VS module on stiffness modulation and bending performance. To maintain the combination of motion flexibility and operation stability, this work presents a stiffness-tunable sheath inspired by a multi-layer wave spring structure, which is fabricated utilizing thermoplastic material. A water-based active heating/cooling method is employed, wherein the circulation of hot/cold water through silicone tubes helically wound around the exterior of the VS sheath enables rapid thermal regulation. Structural parameters selection of the VS sheath based on the orthogonal design method has been performed to enhance its stiffness in a rigid state and reduce the maximum stress during 90° flexion in a flexible state. Experimental results indicate that the proposed VS sheath can achieve a stiffness change ratio of up to 16.5 times within 30s. After being integrated with a continuum joint, the sheath demonstrates an average positioning error of 1.48mm within a ±90° bending range in a flexible state, without structural compromise or interference with the continuum joints bending. In the rigid state, the proposed design can resist 400g external payload with a deflection of less than 6mm. The efficacy of this design has been validated through ex-vivo experiments conducted on a porcine stom

Abstract:
Predicting human trajectory is crucial for social robot navigation in crowded environments. While most existing approaches treat human as point mass, we present a study on multi-agent trajectory prediction that leverages different human skeletal features for improved forecast accuracy. In particular, we systematically evaluate the predictive utility of 2D and 3D skeletal keypoints and derived biomechanical cues as additional inputs. Through a comprehensive study on the JRDB dataset and another new dataset for social navigation with 360-degree panoramic videos, we find that focusing on lower-body 3D keypoints yields a 13% reduction in Average Displacement Error and augmenting 3D keypoint inputs with corresponding biomechanical cues provides a further 1-4% improvement. Notably, the performance gain persists when using 2D keypoint inputs extracted from equirectangular panoramic images, indicating that monocular surround vision can capture informative cues for motion forecasting. Our finding that robots can forecast human movement efficiently by watching their legs provides actionable insights for designing sensing capabilities for social robot navigation.

Abstract:
Multi-agent reinforcement learning (MARL) for cyber-physical vehicle systems usually requires a significantly long training time due to their inherent complexity. Furthermore, deploying the trained policies in the real world demands a feature-rich environment along with multiple physical embodied agents, which may not be feasible due to monetary, physical, energy, or safety constraints. This work seeks to address these pain points by presenting a mixed-reality (MR) digital twin (DT) framework capable of: (i) boosting training speeds by selectively scaling parallelized simulation workloads on-demand, and (ii) immersing the MARL policies across hybrid simulation-to-reality (sim2real) experiments. The viability and performance of the proposed framework are highlighted through two representative use cases, which cover cooperative as well as competitive classes of MARL problems. We study the effect of: (i) agent and environment parallelization on training time, and (ii) systematic domain randomization on zero-shot sim2real transfer, across both case studies. Results indicate up to 76.3% reduction in training time with the proposed parallelization scheme and sim2real gap as low as 2.9% using the proposed deployment method.

Abstract:
Since the late 19th century when the first walking toys were developed, it has been known that mechanical stoppers at the hip joint are crucial for generating stable passive dynamic walking. Recent research on passive-dynamic and limit-cycle walkers has also confirmed that mechanical stoppers at the hip and knee joints are effective for generating stable walking motion, but theoretical research on how this mechanical constraint enhances the overall gait stability remains insufficient. This paper introduces a planar X-shaped walker equipped with mechanical stoppers at the hip joint, and investigates the effect of the mechanical constraint on the stability of the wheel gait generated by constant torque drive on a downslope. By simply falling forward while using the stoppers to constrain itself to the target impact posture, the robot can generate a highly stable wheel gait. We divide the motion of one step into four phases, derive approximate analytical solutions for the state error transition function matrix in each phase using a linearized model, and analyze the increase or decrease in the state error norm using metrics such as its maximum singular value. Numerical simulations demonstrate that while two phases are unstable in terms of the increase in the state error norm, the remaining two phases are stable, resulting in overall asymptotic stability.

Abstract:
Recent research has increasingly focused on delivering drug-carrying magnetic particles to diseased areas using electromagnetic actuation (EMA) systems. Particularly, in these systems, creating a field-free point (FFP) and using it to steer magnetic particles in the desired direction has attracted significant attention. However, most previous studies use closed-type EMA systems, which, due to their structural characteristics, are difficult to integrate into actual surgical environments and to operate in conjunction with external imaging systems like X-ray. This study addresses these limitations by using an open-type EMA system, which is better suited for surgical integration. However, an open-type EMA system faces issues such as a significant decrease in magnetic force and an anisotropic magnetic field as the distance from the coils increases in the region of interest (ROI). To overcome these challenges, we optimized the open-type EMA system and proposed a suitable FFP generation method. Furthermore, we presented a targeting algorithm for steering a magnetic particle in blood vessels using anisotropic FFP. This proposed open-type EMA system and the control strategy using FFP were validated through multiphysics simulations and phantom experiments, proving the viability of magnetic particle targeting.

Abstract:
Current platforms for Transoral Robotic Surgery (TORS) are suboptimal for confined oropharyngeal workspaces, particularly in pediatric applications. To address this, we present the design and characterization of a novel cable-driven robotic instrument providing 6 degrees-of-freedom (DoF)shaft roll, elbow pitch/yaw, wrist pitch/yaw, and gripfor dexterous manipulation. The system integrates a miniaturized 4-mm proximal elbow with a previously developed 3-mm distal wrist. This architecture is enabled by a novel sandwich link architecture that facilitates high-density cable routing through the joints center plane, providing a compact, rigid alternative to traditional pin-jointed designs. Experimental validation identified significant kinematic coupling between in-plane joint pairs. An empirical real-time compensation strategy reduced this coupling rate by 82.9% for pitch and 80.8% for yaw. Workspace analysis confirmed the proximal elbow enables high distal dexterity at regions critical for complex surgical tasks. Integration with a Franka Research 3 manipulator enabled fully coordinated macro-micro teleoperation, providing a pilot demonstration for TORS workflows. This represents the first demonstration of a 4-mm elbow-3mm wrist mechanism for TORS, providing the hardware foundation necessary for future evaluation of dexterity-intensive tasks, including suturing and dissection.

Abstract:
Degraded steering performance and increased energy consumption present significant barriers to deploying wheeled mobile robots (WMRs) in granular media such as sand and lunar regolith. This study presents and experimentally validates a systematic optimization framework based on Dynamic Resistive Force Model (DRFM). By integrating the DRFM with a four-wheel vehicle dynamics model featuring front-wheel steering, this approach accurately captures wheelterrain interactions in granular materials. Subject to a prescribed trajectory root-mean-square error constraint, the framework minimizes energy consumption per unit distance while determining optimal front-wheel steering angles and wheel-speed ratios. Experiments demonstrate that the active steering strategy reduces energy consumption per unit distance by 12.3% while maintaining trajectory root-mean-square error within 6.5%. The proposed method provides a generalizable design paradigm for motion-control optimization on granular terrain, establishing the foundation for long-duration, energy-efficient operations of rovers operating in granular terrain.

Abstract:
Deformable objects(DOs) are prevalent in everyday environments and represent important targets for robotic manipulation. However, their high degrees of freedom and complex nonlinear deformations make them more challenging to model and control than rigid objects when relying on traditional analytical approaches. To address this, we propose a data-driven method to model the dynamics of deformable objects. Our method utilizes time-series data to predict future states without relying on complex dynamics. We employ model predictive control(MPC) for robot manipulation and improve its performance through online updates of the data-driven model. To handle cables with varying configurations, interpolation is applied to align model input structures. In this study, we focus on manipulating deformable linear objects(DLOs) with different mechanical properties and configurations using a dual-arm robotic system, both in simulation and in real-world environments.

Abstract:
Tactile Walking Surface Indicators (TWSIs) are safety-critical landmarks that blind and low-vision (BLV) pedestrians use to locate crossings and hazard zones. From our observation sessions with BLV guide dog handlers, trainers, and an O&M specialist, we confirmed the critical importance of reliable and accurate TWSI segmentation for navigation assistance of BLV individuals. Achieving such reliability requires large-scale annotated data. However, TWSIs are severely underrepresented in existing urban perception datasets, and even existing dedicated paving datasets are limited: they lack robot-relevant viewpoints (e.g., egocentric or top-down) and are geographically biased toward East Asian directional bars - raised parallel strips used for continuous guidance along sidewalks. This narrow focus overlooks truncated domes - rows of round bumps used primarily in North America and Europe as detectable warnings at curbs, crossings, and platform edges. As a result, models trained only on bar-centric data struggle to generalize to dome based warnings, leading to missed detections and false stops in safety critical environments. We introduce GuideTWSI, the largest and most diverse TWSI dataset, which combines a photorealistic synthetic dataset, carefully curated open-source tactile data, and quadruped real-world data collected and annotated by the authors. Notably, we developed an Unreal Enginebased synthetic data generation pipeline to obtain segmented, labeled data across diverse materials, lighting conditions, weather, and robot-relevant viewpoints. Extensive evaluations show that synthetic augmentation improves truncated dome segmentation across diverse state-of-the-art models, with gains of up to +29 mIoU points, and enhances cross-domain robustness. Moreover, real-robot experiments demonstrate accurate stoppings at truncated domes, with high repeatability and stop success rates (96.15%). The GuideTWSI dataset, model weights, and code will be publicly released.

Abstract:
Tendon-driven Continuum Robots (TDCRs) are widely used in confined operating systems due to their thin shape, flexibility, and compliance making them easily deployable in narrow or contact-rich environments. However, real-time safe control near obstacles remains challenging. Computationally expensive dynamic models, such as the Cosserat rod model, are impractical for real-time control. Conventional model predictive control (MPC) methods require linearization of the dynamics, limiting their applicability to the complex nonlinear behavior of TDCRs, including hysteresis. In this paper, we adopt the Piecewise Constant Curvature (PCC) model, which assumes constant curvature for each link. While computationally cheap, this approximation contains modeling errors that, combined with mechanical friction, backlash, and misalignment at the rolling joints, result in unpredictable hysteresis. Also, we propose CVaR-MPPI(Conditional Value-at-Risk Model Predictive Path Integral), a controller that combines sampling based planning with probability safety under uncertainty environment, improving both worst-case risk managing and sampling efficiency. In simulation with 100 iterations, CVaR-MPPI improves the success rate from 80% to 85% and the mean safety clearance by 129%, while maintaining end-effector tracking error compared to standard MPPI, as detailed in the simulation results. The controller runs at 50Hz with 8192 samples, demonstrating real-time feasibility.

Abstract:
Inspection of hydraulic structures is crucial for ensuring the reliability and safety of infrastructures. Although underwater manipulators are essential tools, existing systems often lack sufficient compliance and safe interaction capabilities. This study develops a novel underwater manipulator system with a robust admittance control framework designed specifically for safe contact inspection tasks. The manipulator integrates a 6-axis force/torque sensor for contact force measurement and an ultrasonic detector for structural inspection. An underwater force estimation algorithm is implemented to ensure accurate force measurement under varying flow conditions. The proposed robust admittance control strategy comprises an inner-loop position controller, enhanced by an unknown system dynamics estimator and super-twisting sliding mode control, to counteract hydrodynamic disturbances and improve trajectory tracking accuracy. An outer-loop variable admittance controller, incorporating variable damping mechanism and adaptive feedback compensation, ensures compliant interactions and precise force control with minimal overshoot. Extensive experiments, including force measurement, motion and contact force control, and underwater thickness measurement, demonstrate the system's excellent performance, validating its effectiveness for hydraulic structure inspection tasks.

Abstract:
We present a Lie group implicit formulation for kinematically redundant parallel manipulators that yields left-trivialized extended Jacobians for the extended task variable x = (g, ρ) �?SE(3) × R. On top of this model we design a gradient-based redundancy flow on the redundancy manifold that empirically maintains a positive manipulability margin along prescribed SE(3) trajectories. The framework uses right-multiplicative state updates, remains compatible with automatic differentiation, and avoids mechanism-specific analytic Jacobians; it works with either direct inverse kinematics or a numeric solver. A specialization to SO(2)3 provides computation-friendly first- and second-order steps. We validate the approach on two representative mechanisms: a (6+3)-degree-of-freedom (DoF) Stewart platform and a SphericalRevolute platform. Across dense-coverage orientation trajectories and interactive gamepad commands, the extended Jacobian remained well conditioned while the redundancy planner ran at approximately 2 kHz in software-in-the-loop on a laptop-class CPU. The method integrates cleanly with existing kinematic stacks and is suitable for real-time deployment.

Abstract:
Vision-Language-Action (VLA) models have recently been proposed as a pathway toward generalist robotic policies capable of interpreting natural language and visual inputs to generate manipulation actions. However, their effectiveness and efficiency on structured, long-horizon manipulation tasks remain unclear. In this work, we present a head-to-head empirical comparison between a fine-tuned open-weight VLA model (π0) and a neuro-symbolic architecture that combines PDDL-based symbolic planning with learned low-level control. We evaluate both approaches on structured variants of the Towers of Hanoi manipulation task in simulation while measuring both task performance and energy consumption during training and execution. On the 3-block task, the neuro-symbolic model achieves 95% success compared to 34% for the best-performing VLA. The neuro-symbolic model also generalizes to an unseen 4-block variant (78% success), whereas both VLAs fail to complete the task. During training, VLA fine-tuning consumes nearly two orders of magnitude more energy than the neuro-symbolic approach. These results highlight important trade-offs between end-to-end foundation-model approaches and structured reasoning architectures for long-horizon robotic manipulation, emphasizing the role of explicit symbolic structure in improving reliability, data efficiency, and energy efficiency. Code and models are available at https://price-is-not-right.github.io

Abstract:
Motivated by clinical needs for precise navigation and safety, low-latency and high-precision localization has become a key enabler for capsule robots. A unified magnetic 5-DoF high-precision localization framework for capsule robots is presented. Building on layered multi-source magnetic field modeling, online external-field compensation, and global optimization-based inversion, the framework achieves real-time decoupling between control and localization fields, while providing a unified interface compatible with diverse hardware configurations and operation modes. On this basis, the PMMN-DBO algorithm is proposed, delivering high-accuracy and efficient localization in single- and multi-capsule scenarios, and supports synchronized controllocalization. Experimentally, for single-capsule localization, mean errors are 0.59 mm/0.69° with a 20.2 ms computation time, surpassing conventional methods. In multi-capsule settings, localization errors remain low with stable convergence: mean errors are 1.28 mm/1.13° for two capsules and 2.56 mm/2.83° for three capsules. Under synchronized controllocalization, trajectory-tracking errors reach 1.33 mm/1.85°. Overall, the proposed framework is unified, high-precision, efficient, and flexible, laying a general and reusable foundation for clinical-grade precise navigation and closed-loop magnetic control.

Abstract:
This paper presents a framework for aerial manipulation of an extensible cable that combines a high-fidelity model based on partial differential equations (PDEs) with a reduced-order representation suitable for real-time control. The PDEs are discretized using a finite-difference method, and proper orthogonal decomposition is employed to extract a reduced-order model (ROM) that retains the dominant deformation modes while significantly reducing computational complexity. Based on this ROM, a nonlinear model predictive control scheme is formulated, capable of stabilizing cable oscillations and handling hybrid transitions such as payload attachment and detachment. Simulation results confirm the stability, efficiency, and robustness of the ROM, as well as the effectiveness of the controller in regulating cable dynamics under a range of operating conditions. Additional simulations illustrate the application of the ROM for trajectory planning in constrained environments, demonstrating the versatility of the proposed approach. Overall, the framework enables real-time, dynamics-aware control of unmanned aerial vehicles (UAVs) carrying suspended flexible cables.

Abstract:
Young children with motor disabilities face barriers and delays to learning motor skills such as walking. Pediatric body-weight support harness systems (BWSHes) are a newer technology for helping young children to practice supported motor skills. Incorporating an assistive robot to mediate BWSH interventions can support further child motion and engagement, but almost no work to date has studied autonomous robot-mediated BWSH use. We conducted a six-month-long single-case study series with two participants to evaluate the effectiveness of an autonomous assistive robot in motivating the children to move and stay engaged while in the BWSH. We collected and analyzed objective movement data and self-reported parent survey data to determine how much the child moved and stayed engaged during sessions. Our results showed that both children displayed more movement while the assistive robot was active (relative to in prior no-robot periods). Parents also rated their children as more engaged while the assistive robot was present. An autonomous assistive robot may provide motivation for a child to move and stay engaged while using a pediatric rehabilitation aid such as a BWSH. The products of this work can benefit roboticists who work with children with disabilities and researchers who use pediatric rehabilitation technologies.

Abstract:
This letter proposes a method to systematically design a time-independent controller for a desired orbit in phase space. A time-independent controller is essential in robots that physically interact with humans or the environment. An approach to designing such a controller is based on the virtual dynamics of the desired orbit (VDDO), in which the desired orbit is assumed as a constraint. However, depending on the desired orbit, zero-division happens, and then the computation of control input breaks down. To address this issue, a zero-division-avoidable smoother, which functions as a low-pass filter and maintains computability even when the computation includes zero-division, is applied to compute the controller input based on the VDDO. This application establishes a systematic design of a VDDO-based controller that avoids zero-division. We investigated the performance of the proposed controller via experiments and simulations for three given orbits: a unit circle, super-ellipse, and spiric section. Results showed that the proposed time-independent controller can avoid zero-division while approaching the desired orbits. Furthermore, an experiment in which a human forces a robot to stop showed that the robot could restart from an unfavorable state and approach the desired orbits once more.

Abstract:
Autonomous Emergency Braking (AEB) optimization relies on accurately annotated real-world trigger events, particularly rare but critical delayed and false AEB triggers that expose system deficiencies. However, these minority events comprise less than 5% of thousands of daily triggers, making manual annotation prohibitively expensive at scale. We present the first automated AEB annotation framework to address this problem. During development, we identified two fundamental challenge that severely impair delayed/false trigger annotation accuracy: extreme class imbalance where minority events are overwhelmed by true triggers, and asymmetric label noise where mislabeled majority samples suppress minority class learning. To overcome these challenges, we propose two key innovations: (1) Specific data augmentation that synthesizes realistic samples by manipulating focal target attributes, transplanting ego-vehicle dynamics, and masking non-focal agents; (2) Probe-guided noise suppression using stable hardness estimation to clean mislabeled true trigger samples. We deploy our model as a practical annotation system with full-stack architecture, efficiently identifying critical AEB events from thousands of daily samples. Production results demonstrate 80% improvement in delayed/false triggers recall and 50% reduction in manual workload. Beyond immediate gains, the system enables continuous self-improvement through accumulated high-quality annotations, establishing a necessary data foundation for on-vehicle AEB system optimization.

Abstract:
Embodied Referring Expression Grounding (REVERIE) is a Vision-and-Language Navigation (VLN) task that better reflects real-world human instructions. Unlike conventional VLN, REVERIE is more challenging as agents must navigate in unseen environments and ground remote objects described by short, high-level commands. This requires agents not only to plan a route without detailed step-by-step guidance but also to accurately localize the target object at the destination. Existing VLN agents mainly emphasize navigation performance while overlooking object grounding success, leading to a significant performance gap. We introduce a model-agnostic interaction framework with two auxiliary agents, Where-I-Am (WIA) and Where-to-Go (W2G). Specifically, WIA predicts the current room type from environmental observations, while W2G infers the target room type from high-level instructions. Our framework is plug-and-play and can be integrated with various VLN models. On the REVERIE benchmark, it improves navigation success rate (SR) by 7.78% and remote grounding success (RGS) by 5.48% over the baselines, demonstrating the effectiveness and generality of our design. Furthermore, in challenging unseen test environments, our framework achieves competitive results on the REVERIE dataset, outperforming the previous state-of-the-art VLN agent (without additional training data) with a 2.27% gain in RGS.

Abstract:
Human-robot co-carrying tasks reveal their potential in both industrial and everyday applications by leveraging the strengths of both parties. However, such collaborative tasks pose numerous challenges due to varied human intentions under time-varying workspaces, leading to human-robot conflicts. In this paper, we develop a cooperation control framework for human-robot co-carrying tasks constructed by utilizing reference generator and low-level controller to aim to achieve safe interaction and synchronized human-robot movement. Firstly, the human motion predictions are corrected in the event of prediction errors based on the conflicts measured by the interaction forces through admittance control, thereby mitigating conflict levels. Low-level controller using an energy-compensation passive velocity field control approach allows encoding the corrected motion to produce control torques for the robot. In this manner, the closed-loop robotic system is passive when the energy level exceeds the predetermined threshold, and otherwise. Furthermore, the passivity, stability, energy-compensation rate, and power flow regulation are analyzed from theoretical viewpoints. Human-in-the-loop experiments involving 18 participants have demonstrated that the proposed method significantly enhances task performance and reduces human workload, as evidenced by both objective metrics and subjective evaluations.

Abstract:
In this paper, we developed an autonomous decentralized control method that incorporates phase-difference adjustment based on a sigmoid function, enabling the design of both increases and decreases in discrepancy. The method was applied to a peristaltic mixing pump capable of mixing and transporting solidliquid multiphase fluids. This study aims to realize a soft robotics system that autonomously switches motion modes according to changes in the physical properties of the transported material, thereby integratively mimicking both the motility and motion-switching functions of the intestine. Conventional autonomous decentralized control methods have been applied to the locomotion of amoeba-type and snake-type robots. However, when such control laws are applied to pumps, it is difficult to achieve appropriate motion switching in environments where the contents harden due to mixing. In this paper, we employed a sigmoid function that allows bidirectional control of discrepancy and constructed a new control law based on target phase-difference adjustment without feedback. The control law was implemented in a four-unit pump, and we confirmed that the desired motion patterns could be reproduced according to the preset target phase differences. As a result, the phase differences between all units converged to the target values within approximately 1030 s after actuation began, producing the intended motion patterns. Furthermore, polyvinyl alcohol solution and borax water were used as contents whose fluidity decreases during mixing. We verified that autonomous motion switching occurred as the discrepancy increased. The results showed that, in units containing hardened material, a conveying motion with a phase difference of π/3 was generated, whereas in units with residual

Abstract:
The control of pneumatic soft robotics is challenging due to nonlinearites arising from many factors including pneumatic system components and material properties of the soft actuator. Manual methods for PID controller tuning are inadequate for the nonlinear and time-variant dynamics present in soft robotics. Affordable pneumatic components such as on/off valves cause discontinuities in flow rate, introducing nonlinearities and oscillatory fluctuations into the system. This study proposes a dual-loop control system: one for PID and Fractional-Order PID (FOPID) control of a solenoid valve that feeds air into the actuator, and another for PID control of the pump upstream of the valve. The PID and FOPD parameters are optimized using evolutionary algorithms: Genetic Algorithm (GA), Particle Swarm Optimization (PSO), and Simulated Annealing (SA). Simulations and real-world experiments are conducted to validate the optimized parameters. Our results demonstrate that the dual-loop hardware configuration reduces fluctuations from the valves compared with a single-loop control scheme. The experimental statistical analysis confirms that FOPID achieves the highest significant improvements in rise time (PSO) and peak time (GA, PSO), while PID performs better for overshoot (GA, PSO). These findings highlight the importance of selecting an appropriate optimization algorithm based on the specific control objective, as FOPID does not outperform PID in every metric across all methods.

Abstract:
We develop an approach to detect objects in forward-looking sonar (FLS) images using corresponding optical images and without the need for expert manual labeling of sonar images. Sonar sensing is more robust to disadvantageous underwater environmental conditions than optical sensing, but the scarcity of labeled sonar data leads to decreased performance of methods which rely on an abundance of training data. We aim to transfer insights from data-rich applications such as object detection in optical imaging to the data-scarce area of object detection in sonar images. Our approach involves recording of contemporaneous images from commercially available sensors viable for use aboard unmanned underwater vehicles. We collect new optical and sonar data in a shallow, clear-water environment and employ existing object detection techniques for optical images. We leverage the commonality of the sensors fields of view and our algorithmic processing of the sonar image to transfer knowledge of object bounding boxes to sonar images to create a dataset. Through this transfer, we enable training of a model that detects objects in unseen sonar images and does not require optical images as input at test time.

Abstract:
High-precision wafer metrology poses significant cost and throughput challenges in modern semiconductor manufacturing, where frequent process changes and recipe variations demand highly adaptive and scalable solutions. In this paper, we present a Generative-FewShot-Active Virtual Metrology (GFA-VM) framework that unifies large-scale generative modeling, few-shot fine-tuning, and uncertainty-driven active sampling into a single, data-centric system. A foundational generative model, built on a hybrid architecture of Transformer networks and Variational Autoencoders (VAEs), learns diverse sensor characteristics in an offline stage without relying on extensive labeled data. During online inference, the model produces both wafer quality predictions and predictive uncertainties; samples exceeding a dynamic uncertainty threshold are selected for physical measurement and few-shot model recalibration. This selective sampling both reduces measurement costs and adapts rapidly to new process conditions (e.g., novel recipes or equipment upgrades), requiring only a handful of freshly labeled wafers. The paper further addresses the long-term stability of the system through a self-updating mechanism that adjusts the uncertainty threshold when distributional shifts occur. Empirical evaluations confirm that our GFA-VM approach achieves state-of-the-art accuracy while significantly reducing metrology overhead compared to conventional virtual metrology methods.

Abstract:
Modern robotics applications require an inverse kinematics (IK) solver that is fast, robust and consistent, and that provides all possible solutions. Currently, the Franka robot arm is the most widely used manipulator in robotics research. With 7 DOFs, the IK of this robot is not only complex due to its redundancy, but also due to the link offsets at the wrist and elbow. Due to this complexity, none of the Franka IK solvers available in the literature provide satisfactory results when used in real-world applications. Therefore, in this paper we introduce GeoFIK (Geometric Franka IK), an analytical IK solver that allows the use of different joint variables to resolve the redundancy. The approach uses screw theory to describe the entire geometry of the robot, computing the Jacobian matrix prior to the joint angles. All singularities are handled. As an example of how the geometric elements obtained by the IK can be exploited, a solver with the swivel angle as the free variable is provided. Several experiments are carried out to validate the speed, robustness, and reliability of GeoFIK against three state-of-the-art solvers.