ICRA2025

Abstract:
Novel-view synthesis approaches play a critical role in vast scene reconstruction. However, these methods rely heavily on dense image inputs and prolonged training times, making them unsuitable where computational resources are limited. Additionally, few-shot methods often struggle with poor reconstruction quality in vast environments. This paper presents DGTR, a novel distributed framework for efficient Gaussian reconstruction for sparse-view vast scenes. Our approach divides the scene into regions, processed independently by drones with sparse image inputs. Using a feed-forward Gaussian model, we predict high-quality Gaussian primitives, followed by a global alignment algorithm to ensure geometric consistency. Depth priors is incorporated to further enhance training, while a distillation-based model aggregation mechanism enables efficient reconstruction. Our method achieves high-quality large-scale scene reconstruction and novel-view synthesis in significantly reduced training times, outperforming existing approaches in both speed and scalability. We demonstrate the effectiveness of our framework on vast aerial scenes, achieving high-quality results within minutes. Code will released on our project page https://3d-aigc.github.io/DGTR.

Abstract:
Collaborative perception in multi-agent system enhances overall perceptual capabilities by facilitating the exchange of complementary information among agents. Current mainstream collaborative perception methods rely on discretized feature maps to conduct fusion, which however, lacks flexibility in extracting and transmitting the informative features and can hardly focus on the informative features during fusion. To address these problems, this paper proposes a novel Anchor-Centric paradigm for Collaborative Object detection (ACCO). It avoids grid precision issues and allows more flexible and efficient anchor-centric communication and fusion. ACCO is composed by three main components: (1) Anchor featuring block (AFB) that targets to generate anchor proposals and projects prepared anchor queries to image features. (2) Anchor confidence generator (ACG) is designed to minimize communication by selecting only the features in the confident anchors to transmit. (3) A local-global fusion module, in which local fusion is anchor alignment-based fusion (LAAF) and global fusion is conducted by spatial-aware cross-attention (SACA). LAAF and SACA run in multilayers, so agents conduct anchor-centric fusion iteratively to adjust the anchor proposals. Comprehensive experiments are conducted to evaluate ACCO on OPV2V and Dair-V2x datasets, which demonstrate ACCO's superiority in reducing the communication volume, and in improving the perception range and detection performances. Code can be found at: https://github.com/sidiangongyuan/ACCO.

Abstract:
Event cameras have the potential to capture continuous motion information over time and space, making them well-suited for optical flow estimation. However, most existing learning-based methods for event-based optical flow adopt frame-based techniques, ignoring the spatio-temporal characteristics of events. Additionally, these methods assume linear motion between consecutive events within the loss time window, which increases optical flow errors in long-time sequences. In this work, we observe that rich spatio-temporal information and accurate nonlinear motion between events are crucial for event-based optical flow estimation. Therefore, we propose E-NMSTFlow, a novel unsupervised event-based optical flow network focusing on long-time sequences. We propose a Spatio-Temporal Motion Feature Aware (STMFA) module and an Adaptive Motion Feature Enhancement (AMFE) module, both of which utilize rich spatio-temporal information to learn spatio-temporal data associations. Meanwhile, we propose a nonlinear motion compensation loss that utilizes the accurate nonlinear motion between events to improve the unsupervised learning of our network. Extensive experiments demonstrate the effectiveness and superiority of our method. Remarkably, our method ranks first among unsupervised learning methods on the MVSEC and DSEC-Flow datasets.

Abstract:
Long-horizon contact-rich manipulation has long been a challenging problem, as it requires reasoning over both discrete contact modes and continuous object motion. We introduce Implicit Contact Diffuser (ICD), a diffusion-based model that generates a sequence of neural descriptors that specify a series of contact relationships between the object and the environment. This sequence is then used as guidance for an MPC method to accomplish a given task. The key advantage of this approach is that the latent descriptors provide more taskrelevant guidance to MPC, helping to avoid local minima for contact-rich manipulation tasks. Our experiments demonstrate that ICD outperforms baselines on complex, long-horizon, contact-rich manipulation tasks, such as cable routing and notebook folding. Additionally, our experiments also indicate that ICD can generalize a target contact relationship to a different environment. More visualizations can be found on our website https://implicit-contact-diffuser.github.io.

Abstract:
Learning-based methods, such as imitation learning (IL) and reinforcement learning (RL), can produce excel control policies over challenging agile robot tasks, such as sports robot. However, no existing work has harmonized learning-based policy with model-based methods to reduce training complexity and ensure the safety and stability for agile badminton robot control. In this paper, we introduce Hamlet, a novel hybrid control system for agile badminton robots. Specifically, we propose a model-based strategy for chassis locomotion which provides a base for arm policy. We introduce a physics-informed “IL+RL” training framework for learning-based arm policy. In this train framework, a modelbased strategy with privileged information is used to guide arm policy training during both IL and RL phases. In addition, we train the critic model during IL phase to alleviate the performance drop issue when transitioning from IL to RL. We present results on our self-engineered badminton robot, achieving 94.5% success rate against the serving machine and \mathbf9 0. 7 % success rate against human players. Our system can be easily generalized to other agile mobile manipulation tasks e.g., agile catching, table tennis. A video demonstrating our system can be viewed at https://youtu.be/8-ixKAD18Mk.

Abstract:
Recent progress in reinforcement learning (RL) and tactile sensing has significantly advanced dexterous manipulation. However, these methods often utilize simplified tactile signals due to the gap between tactile simulation and the real world. We introduce a sensor model for tactile skin that enables zero-shot sim-to-real transfer of ternary shear and binary normal forces. Using this model, we develop an RL policy that leverages sliding contact for dexterous inhand translation. We conduct extensive real-world experiments to assess how tactile sensing facilitates policy adaptation to various unseen object properties and robot hand orientations. We demonstrate that our 3-axis tactile policies consistently outperform baselines that use only shear forces, only normal forces, or only proprioception. Videos and details available on the project website.

Abstract:
One of the most important research challenges in upper-limb prosthetics is enhancing the user-prosthesis communication to closely resemble the experience of a natural limb. As prosthetic devices become more complex, users often struggle to control the additional degrees of freedom. In this context, leveraging shared-autonomy principles can significantly improve the usability of these systems. In this paper, we present a novel eye-in-hand prosthetic grasping system that follows these principles. Our system initiates the approach-to-grasp action based on user's command and automatically configures the DoFs of a prosthetic hand. First, it reconstructs the 3D geometry of the target object without the need of a depth camera. Then, it tracks the hand motion during the approach-to-grasp action and finally selects a candidate grasp configuration according to user's intentions. We deploy our system on the Hannes prosthetic hand and test it on able-bodied subjects and amputees to validate its effectiveness. We compare it with a multi-DoF prosthetic control baseline and find that our method enables faster grasps, while simplifying the user experience. Code and demo videos are available online at this https URL.

Abstract:
Most control techniques for prosthetic grasping focus on dexterous fingers control, but overlook the wrist motion. This forces the user to perform compensatory movements with the elbow, shoulder and hip to adapt the wrist for grasping. We propose a computer vision-based system that leverages the collaboration between the user and an automatic system in a shared autonomy framework, to perform continuous control of the wrist degrees of freedom in a prosthetic arm, promoting a more natural approach-to-grasp motion. Our pipeline allows to seamlessly control the prosthetic wrist to follow the target object and finally orient it for grasping according to the user intent. We assess the effectiveness of each system component through quantitative analysis and finally deploy our method on the Hannes prosthetic arm. Code and videos: https: //hsp-iit.github.io/hannes-wrist-control.

Abstract:
Unknown dynamic load carrying is one important practical application for quadruped robots. Such a problem is non-trivial, posing three major challenges in quadruped locomotion control. First, how to model or represent the dynamics of the load in a generic manner. Second, how to make the robot capture the dynamics without any external sensing. Third, how to enable the robot to interact with load handling the mutual effect and stabilizing the load. In this work, we propose a general load modeling approach called load characteristics modeling to capture the dynamics of the load. We integrate this proposed modeling technique and leverage recent advances in Reinforcement Learning (RL) based locomotion control to enable the robot to infer the dynamics of load movement and interact with the load indirectly to stabilize it and realize the sim-to-real deployment to verify its effectiveness in real scenarios. We conduct extensive comparative simulation experiments to validate the effectiveness and superiority of our proposed method. Results show that our method outperforms other methods in sudden load resistance, load stabilizing and locomotion with heavy load on rough terrain. Project Page.

Abstract:
This paper presents a novel learning-based approach to dynamic robot-to-human handover, addressing the challenges of delivering objects to a moving receiver. We hypothesize that dynamic handover, where the robot adjusts to the receiver's movements, results in more efficient and comfortable interaction compared to static handover, where the receiver is assumed to be stationary. To validate this, we developed a nonparametric method for generating continuous handover motion, conditioned on the receiver's movements, and trained the model using a dataset of 1,000 human-to-human handover demonstrations. We integrated preference learning for improved handover effectiveness and applied impedance control to ensure user safety and adaptiveness. The approach was evaluated in both simulation and real-world settings, with user studies demonstrating that dynamic handover significantly reduces handover time and improves user comfort compared to static methods. Videos and demonstrations of our approach are available at https://zerotohero7886.github.io/dyn-r2h-handover/.

Abstract:
We present an approach to learn general robot manipulation priors from 3D hand-object interaction trajectories. We build a framework to use in-the-wild videos to generate sensorimotor robot trajectories. We do so by lifting both the human hand and the manipulated object in a shared 3D space and retargeting human motions to robot actions. Generative modeling on this data gives us a task-agnostic base policy. This policy captures a general yet flexible manipulation prior. We empirically demonstrate that finetuning this policy, with both reinforcement learning (RL) and behavior cloning (BC), enables sample-efficient adaptation to downstream tasks and simultaneously improves robustness and generalizability compared to prior approaches. Qualitative experiments are available at: https://hgaurav2k.github.io/hop/.

Abstract:
The Multi-Agent Path Finding (MAPF) problem aims to determine the shortest and collision-free paths for multiple agents in a known, potentially obstacle-ridden environment. It is the core challenge for robotic deployments in large-scale logistics and transportation. Decentralized learningbased approaches have shown great potential for addressing the MAPF problems, offering more reactive and scalable solutions. However, existing learning-based MAPF methods usually rely on agents making decisions based on a limited field of view (FOV), resulting in short-sighted policies and inefficient cooperation in complex scenarios. There, a critical challenge is to achieve consensus on potential movements between agents based on limited observations and communications. To tackle this challenge, we introduce a new framework that applies sheaf theory to decentralized deep reinforcement learning, enabling agents to learn geometric cross-dependencies between each other through local consensus and utilize them for tightly cooperative decision-making. In particular, sheaf theory provides a mathematical proof of conditions for achieving global consensus through local observation. Inspired by this, we incorporate a neural network to approximately model the consensus in latent space based on sheaf theory and train it through self-supervised learning. During the task, in addition to normal features for MAPF as in previous works, each agent distributedly reasons about a learned consensus feature, leading to efficient cooperation on pathfinding and collision avoidance. As a result, our proposed method demonstrates significant improvements over state-of-the-art learning-based MAPF planners, especially in relatively large and complex scenarios, demonstrating its superiority over baselines in various simulations and real-world robot experiments.

Abstract:
Human demonstrations as prompts are a power-ful way to program robots to do long-horizon manipulation tasks. However, translating these demonstrations into robot-executable actions presents significant challenges due to execution mismatches in movement styles and physical capabilities. Existing methods for human-robot translation either depend on paired data, which is infeasible to scale, or rely heavily on frame-level visual similarities that often break down in practice. To address these challenges, we propose RHyME, a novel framework that automatically pairs human and robot trajectories using sequence-level optimal transport cost functions. Given long-horizon robot demonstrations, RHyME synthesizes semantically equivalent human videos by retrieving and composing short-horizon human clips. This approach facilitates effective policy training without the need for paired data. RHyME successfully imitates a range of cross-embodiment demonstrators, both in simulation and with a real human hand, achieving over 50% increase in task success compared to previous methods. We release our code and datasets at this website.

Abstract:
We focus on agile, continuous, and terrain-adaptive jumping of quadrupedal robots in discontinuous terrains such as stairs and stepping stones. Unlike single-step jumping, continuous jumping requires accurately executing highly dynamic motions over long horizons, which is challenging for existing approaches. To accomplish this task, we design a hierarchical learning and control framework, which consists of a learned heightmap predictor for robust terrain perception, a reinforcement-learning-based centroidal-level motion policy for versatile and terrain-adaptive planning, and a low-level model-based leg controller for accurate motion tracking. In addition, we minimize the sim-to-real gap by accurately modeling the hardware characteristics. Our framework enables a Unitree Go1 robot to perform agile and continuous jumps on human-sized stairs and sparse stepping stones, for the first time to the best of our knowledge. In particular, the robot can cross two stair steps in each jump and completes a 3.5m long, 2.8m high, 14-step staircase in 4.5 seconds. Moreover, the same policy outperforms baselines in various other parkour tasks, such as jumping over single horizontal or vertical discontinuities. Experiment videos can be found at https://yxyang.github.io/jumping_cod/.

Abstract:
Sim2real for robotic manipulation is difficult due to the challenges of simulating complex contacts and generating realistic task distributions. To tackle the latter problem, we introduce ManipGen, which leverages a new class of policies for sim2real transfer: local policies. Locality enables a variety of appealing properties including invariances to absolute robot and object pose, skill ordering, and global scene configuration. We combine these policies with foundation models for vision, language and motion planning and demonstrate SOTA zero-shot performance of our method to Robosuite benchmark tasks in simulation (97 %). We transfer our local policies from simulation to reality and observe they can solve unseen long-horizon manipulation tasks with up to 8 stages with significant pose, object and scene configuration variation. ManipGen outperforms SOTA approaches such as SayCan, Open VLA, LLMTrajGen and VoxPoser across 50 real-world manipulation tasks by 36%, 76%, 62% and 60% respectively. Video results at mihdalal.github.io/manipgen

Abstract:
Creating robots that can assist in farms and gardens can help reduce the mental and physical workload experienced by farm workers. We tackle the problem of object search in a farm environment, providing a method that allows a robot to semantically reason about the location of an unseen target object among a set of previously seen objects in the environment using a Large Language Model (LLM). We leverage object-to-object semantic relationships to plan a path through the environment that will allow us to accurately and efficiently locate our target object while also reducing the overall distance traveled, without needing high-level room or area-level semantic relationships. During our evaluations, we found that our method outperformed a current state-of-the-art baseline and our ablations. Our offline testing yielded an average path efficiency of 84 %, reflecting how closely the predicted path aligns with the ideal path. Upon deploying our system on the Boston Dynamics Spot robot in a real-world farm environment, we found that our system had a success rate of 80 %, with a success weighted by path length of 0.67, which demonstrates a reasonable trade-off between task success and path efficiency under real-world conditions. The project website can be viewed at: adi-balaji.github.io/losae

Abstract:
Current robotic systems can understand the categories and poses of objects well. But understanding physical properties like mass, friction, and hardness, in the wild, remains challenging. We propose a new method that reconstructs 3D objects using the Gaussian splatting representation and predicts various physical properties in a zero-shot manner. We propose two techniques during the reconstruction phase: a geometryaware regularization loss function to improve the shape quality and a region-aware feature contrastive loss function to promote region affinity. Two other new techniques are designed during inference: a feature-based property propagation module and a volume integration module tailored for the Gaussian representation. Our framework is named as zero-shot physical understanding with Gaussian splatting, or PUGS. PUGS achieves new state-of-the-art results on the standard benchmark of ABO-500 mass prediction. We provide extensive quantitative ablations and qualitative visualization to demonstrate the mechanism of our designs. We show the proposed methodology can help address challenging real-world grasping tasks. Our codes, data, and models are available at https://github.com/EverNorif/PUGS

Abstract:
Aiming to replicate human-like dexterity, perceptual experiences, and motion patterns, we explore learning from human demonstrations using a bimanual system with multifingered hands and visuotactile data. Two significant challenges exist: the lack of an affordable and accessible teleoperation system suitable for a dual-arm setup with multifingered hands, and the scarcity of multifingered hand hardware equipped with touch sensing. To tackle the first challenge, we develop HATO, a low-cost hands-arms teleoperation system that leverages off-the-shelf electronics, complemented with a software suite that enables efficient data collection; the comprehensive software suite also supports multimodal data processing, scalable policy learning, and smooth policy deployment. To tackle the latter challenge, we introduce a novel hardware adaptation by repurposing two prosthetic hands equipped with touch sensors for research. Using visuotactile data collected from our system, we learn skills to complete long-horizon, high-precision tasks which are difficult to achieve without multifingered dexterity and touch feedback. Furthermore, we empirically investigate the effects of dataset size, sensing modality, and visual input preprocessing on policy learning. Our results mark a promising step forward in bimanual multifingered manipulation from visuotactile data. Videos, code, and datasets can be found here.

Abstract:
Cloth folding is a complex task due to the inevitable self-occlusions of clothes, their complicated dynamics, and the disparate materials, geometries, and textures that garments can have. In this work, we learn folding actions conditioned on text commands. Translating high-level, abstract instructions into precise robotic actions requires sophisticated language understanding and manipulation capabilities. To do that, we leverage a pre-trained vision-language model and repurpose it to predict manipulation actions. Our model, BiFold, can take context into account and achieves state-of-the-art performance on an existing language-conditioned folding benchmark. To address the lack of annotated bimanual folding data, we introduce a novel dataset with automatically parsed actions and language-aligned instructions, enabling better learning of text-conditioned manipulation. BiFold attains the best performance on our dataset and demonstrates strong generalization to new instructions, garments, and environments.

Abstract:
All-weather image restoration (AWIR) is crucial for reliable autonomous navigation under adverse weather conditions. AWIR models are trained to address a specific set of weather conditions such as fog, rain, and snow. But this causes them to often struggle with out-of-distribution (OoD) samples or unseen degradations which limits their effectiveness for realworld autonomous navigation. To overcome this issue, existing models must either be retrained or fine-tuned, both of which are inefficient and impractical, with retraining needing access to large datasets, and fine-tuning involving many parameters. In this paper, we propose using Low-Rank Adaptation (LoRA) to efficiently adapt a pre-trained all-weather model to novel weather restoration tasks. Furthermore, we observe that LoRA lowers the performance of the adapted model on the pre-trained restoration tasks. To address this issue, we introduce a LoRAbased fine-tuning method called LoRA-Align (LoRA-A) which seeks to align the singular vectors of the fine-tuned and pretrained weight matrices using Singular Value Decomposition (SVD). This alignment helps preserve the model's knowledge of its original tasks while adapting it to unseen tasks. We show that images restored with LoRA and LoRA-A can be effectively used for computer vision tasks in autonomous navigation, such as semantic segmentation and depth estimation. Project page: https://sudraj2002.github.io/loraapage/.

Abstract:
A dense SLAM system is essential for mobile robots, as it provides localization and allows navigation, path planning, obstacle avoidance, and decision making in unstructured environments. Due to increasing computational demands, the use of GPUs in dense SLAM is expanding. In this work, we present coVoxSLAM, a novel GPU-accelerated volumetric SLAM system that takes full advantage of the parallel processing power of the GPU to build globally consistent maps even in large-scale environments. It was deployed on different platforms (discrete and embedded GPUs) and compared with the state-of-the-art. The results obtained using public datasets show that coVoxSLAM delivers a significant performance improvement considering execution times while maintaining accurate localization. The presented system is available as an open-source system on GitHub.11https://github.com/lrse-uba/coVoxSLAM

Abstract:
Referring Image Segmentation (RIS), aims to segment the object referred by a given sentence in an image by understanding both visual and linguistic information. However, existing RIS methods tend to explore top-performance models, disregarding considerations for practical applications on resources-limited edge devices. This oversight poses a significant challenge for on-device RIS inference. To this end, we propose an effective and efficient post-training quantization framework termed PTQ4RIS. Specifically, we first conduct an in-depth analysis of the root causes of performance degradation in RIS model quantization and propose dual-region quantization (DRQ) and reorder-based outlier-retained quantization (RORQ) to address the quantization difficulties in visual and text encoders. Extensive experiments on three benchmarks with different bits settings (from 8 to 4 bits) demonstrates its superior performance. Importantly, we are the first PTQ method specifically designed for the RIS task, highlighting the feasibility of PTQ in RIS applications. The code is available at https://github.com/gugu511yy/PTQ4RIS.

Abstract:
Diffusion models have revolutionized imitation learning, allowing robots to replicate complex behaviours. However, diffusion often relies on cameras and other exteroceptive sensors to observe the environment and lacks long-term memory. In space, military, and underwater applications, robots must be highly robust to failures in exteroceptive sensors, operating using only proprioceptive information. In this paper, we propose ProDapt, a method of incorporating long-term memory of previous contacts between the robot and the environment in the diffusion process, allowing it to complete tasks using only proprioceptive data. This is achieved by identifying “keypoints”, essential past observations maintained as inputs to the policy. We test our approach using a UR10e robotic arm in both simulation and real experiments and demonstrate the necessity of this long-term memory for task completion.

Abstract:
We introduce a new method that extracts knowledge from a large language model (LLM) to produce object-level plans, which describe high-level changes to object state, and uses them to bootstrap task and motion planning (TAMP). Existing work uses LLMs to directly output task plans or generate goals in representations like PDDL. However, these methods fall short because they rely on the LLM to do the actual planning or output a hard-to-satisfy goal. Our approach instead extracts knowledge from an LLM in the form of plan schemas as an object-level representation called functional object-oriented networks (FOON), from which we automatically generate PDDL subgoals. Our method markedly outperforms alternative planning strategies in completing several pick-and-place tasks in simulation. ††Project Website: https://davidpaulius.github.io/olpllm/

Abstract:
This paper presents a unified surface reconstruction and rendering framework for LiDAR-visual systems, integrating Neural Radiance Fields (NeRF) and Neural Distance Fields (NDF) to recover both appearance and structural information from posed images and point clouds. We address the structural visible gap between NeRF and NDF by utilizing a visible-aware occupancy map to classify space into the free, occupied, visible unknown, and background regions. This classification facilitates the recovery of a complete appearance and structure of the scene. We unify the training of the NDF and NeRF using a spatial-varying scale SDF-to-density transformation for levels of detail for both structure and appearance. The proposed method leverages the learned NDF for structure-aware NeRF training by an adaptive sphere tracing sampling strategy for accurate structure rendering. In return, NeRF further refines structural in recovering missing or fuzzy structures in the NDF. Extensive experiments demonstrate the superior quality and versatility of the proposed method across various scenarios. To benefit the community, the codes will be released at https://github.com/hku-mars/M2Mapping.

Abstract:
Real-life robot navigation involves more than just reaching a destination; it requires optimizing movements while addressing scenario-specific goals. An intuitive way for humans to express these goals is through abstract cues like verbal commands or rough sketches. Such human guidance may lack details or be noisy. Nonetheless, we expect robots to navigate as intended. For robots to interpret and execute these abstract instructions in line with human expectations, they must share a common understanding of basic navigation concepts with humans. To this end, we introduce CANVAS, a novel framework that combines visual and linguistic instructions for commonsense-aware navigation. Its success is driven by imitation learning, enabling the robot to learn from human navigation behavior. We present COMMAND, a comprehensive dataset with human-annotated navigation results, spanning over 48 hours and 219 km, designed to train commonsense-aware navigation systems in simulated environments. Our experiments show that CANVAS outperforms the strong rule-based system ROS NavStack across all environments, demonstrating superior performance with noisy instructions. Notably, in the orchard environment, where ROS NavStack records a 0% total success rate, CANVAS achieves a total success rate of 67%. CANVAS also closely aligns with human demonstrations and commonsense constraints, even in unseen environments. Furthermore, real-world deployment of CANVAS showcases impressive Sim2Real transfer with a total success rate of 69%, highlighting the potential of learning from human demonstrations in simulated environments for real-world applications.

Abstract:
Recent advancements in learned 3D representations have enabled significant progress in solving complex robotic manipulation tasks, particularly for rigid-body objects. However, manipulating granular materials such as beans, nuts, and rice remains challenging due to the intricate physics of particle interactions, high-dimensional and partially observable state, inability to visually track individual particles in a pile, and the computational demands of accurate dynamics prediction. Current deep latent dynamics models often struggle to generalize in granular material manipulation due to a lack of inductive biases. In this work, we propose a novel approach that learns a visual dynamics model over Gaussian splatting repre-sentations of scenes and leverages this model for manipulating granular media via Model-Predictive Control. Our method enables efficient optimization for complex manipulation tasks on piles of granular media. We evaluate our approach in both simulated and real-world settings, demonstrating its ability to solve unseen planning tasks and generalize to new environments in a zero-shot transfer. We also show significant prediction and manipulation performance improvements compared to existing granular media manipulation methods.

Abstract:
In this work, we aim to learn a unified vision-based policy for multi-fingered robot hands to manipulate a variety of objects in diverse poses. Though prior work has shown benefits of using human videos for policy learning, performance gains have been limited by the noise in estimated trajectories. Moreover, reliance on privileged object information such as ground-truth object states further limits the applicability in realistic scenarios. To address these limitations, we propose a new framework ViViDex to improve vision-based policy learning from human videos. It first uses reinforcement learning with trajectory guided rewards to train state-based policies for each video, obtaining both visually natural and physically plausible trajectories from the video. We then rollout successful episodes from state-based policies and train a unified visual policy without using any privileged information. We propose coordinate transformation to further enhance the visual point cloud representation, and compare behavior cloning and diffusion policy for the visual policy training. Experiments both in simulation and on the real robot demonstrate that ViViDex outperforms state-of-theart approaches on three dexterous manipulation tasks. Project website: zerchen.github.io/projects/vividex.html.

Abstract:
Behavior cloning (BC) is a widely used approach in imitation learning where a robot learns a control policy by observing an expert supervisor. However the learned policy can make errors and might lead to safety violations which limits their utility in safety-critical robotics applications. While prior works have tried improving a BC policy via additional real or synthetic action labels adversarial training or runtime filtering none of them explicitly focus on reducing the BC policy's safety violations during training time. We propose SAFE-GIL a design-time method to learn safety-aware behavior cloning policies. SAFE-GIL deliberately injects adversarial disturbance in the system during data collection to guide the expert towards safety-critical states. This disturbance injection simulates potential policy errors that the system might encounter during the test time. By ensuring that training more closely replicates expert behavior in safety-critical states our approach results in safer policies despite policy errors during the test time. We further develop a reachability-based method to compute this adversarial disturbance. We compare SAFE-GIL with various behavior cloning techniques and online safety-filtering methods in three domains autonomous ground navigation aircraft taxiing and aerial navigation on a quadrotor testbed. Our method demonstrates a significant reduction in safety failures particularly in low data regimes where the likelihood of learning errors and therefore safety violations is higher. See our website here: https://y-u-c.github.io/safegil/.

Abstract:
Equipping humanoid robots with the capability to understand emotional states of human interactants and express emotions appropriately according to situations is essential for affective human-robot interaction. However, enabling current vision-aware multimodal emotion recognition models for affective human-robot interaction in the real-world raises embodiment challenges: addressing the environmental noise issue and meeting real-time requirements. First, in multi-party conversation scenarios, the noises inherited in the visual observation of the robot, which may come from either 1) distracting objects in the scene or 2) inactive speakers appearing in the field of view of the robot, hinder the models from extracting emotional cues from vision inputs. Secondly, real-time response, a desired feature for an interactive system, is also challenging to achieve. To tackle both challenges, we introduce an affective human-robot interaction system called UGotMe designed specifically for multiparty conversations. Two denoising strategies are proposed and incorporated into the system to solve the first issue. Specifically, to filter out distracting objects in the scene, we propose extracting face images of the speakers from the raw images and introduce a customized active face extraction strategy to rule out inactive speakers. As for the second issue, we employ efficient data transmission from the robot to the local server to improve real-time response capability. We deploy UGotMe on a human robot named Ameca to validate its real-time inference capabilities in practical scenarios. Videos demonstrating real-world deployment are available at https://lipzh5.github.io/HumanoidVLE/

Affiliations: Department of Computer Science and Informatics, Cardiff University; College of Libera Arts, University of Minnesota - Twin Cities; Independent researcher, USA; Agricultural & Biological Engineering, Mississippi State University; Information and Software Engineering, University of Electronic Science and Technology of China; Transportation Research Institute, University of Toronto; Link Lab Civil and Environmental Engineering, University of Virginia; Intelligent Transportation Thrust, The Hong Kong University of Science and Technology (Guangzhou)

Abstract:
In this paper, we introduce a novel approach for autonomous driving trajectory generation by harnessing the complementary strengths of diffusion probabilistic models (a.k.a., diffusion models) and transformers. Our proposed framework, termed the “World-centric Diffusion Transformer” (WcDT), optimizes the entire trajectory generation process, from feature extraction to model inference. To enhance the scene diversity and stochasticity, the historical trajectory data is first preprocessed into “Agent Move Statement” and encoded into latent space using Denoising Diffusion Probabilistic Models (DDPM) enhanced with Diffusion with Transformer (DiT) blocks. Then, the latent features, historical trajectories, HD map features, and historical traffic signal information are fused with various transformer-based encoders that is used to enhance the interaction of agents with other elements in the traffic scene. The encoded traffic scenes are then decoded by a trajectory decoder to generate multimodal future trajectories. Comprehensive experimental results show that the proposed approach exhibits superior performance in generating both realistic and diverse trajectories, showing its potential for integration into automatic driving simulation systems. Our code is available at https://github.com/yangchen1997/WcDT.

Abstract:
Evaluation methods for autonomous driving are crucial for algorithm optimization. However, due to the complexity of driving intelligence, there is currently no comprehensive evaluation method for the level of autonomous driving intelligence. In this paper, we propose an evaluation framework for driving behavior intelligence in complex traffic environments, aiming to fill this gap. We constructed a natural language evaluation dataset of human professional drivers and passengers through naturalistic driving experiments and post-driving behavior evaluation interviews. Based on this dataset, we developed an LLM-powered driving evaluation framework. The effectiveness of this framework was validated through simulated experiments in the CARLA urban traffic simulator and further corroborated by human assessment. Our research provides valuable insights for evaluating and designing more intelligent, human-like autonomous driving agents. The implementation details of the framework11https://github.com/AIR-DISCOVER/Driving-Intellenge-Evaluation-Framework and detailed information about the dataset22https://github.com/AIR-DISCOVER/Driving-Evaluation-Datasetcan be found at the provided links.

Abstract:
Mobile robots are essential in applications such as autonomous delivery and hospitality services. Applying learning-based methods to address mobile robot tasks has gained popularity due to its robustness and generalizability. Traditional methods such as Imitation Learning (IL) and Reinforcement Learning (RL) offer adaptability but require large datasets, carefully crafted reward functions, and face sim-to-real gaps, making them challenging for efficient and safe real-world deployment. We propose an online human-in-the-loop learning method PVP4Real that combines IL and RL to address these issues. PVP4Real enables efficient real-time policy learning from online human intervention and demonstration, without reward or any pretraining, significantly improving data efficiency and training safety. We validate our method by training two different robots—a legged quadruped, and a wheeled delivery robot—in two mobile robot tasks, one of which even uses raw RGBD image as observation. The training finishes within 15 minutes. Our experiments show the promising future of human-in-the-loop learning in addressing the data efficiency issue in real-world robotic tasks. More information is available at https://metadriverse.github.io/pvp4real/.

Abstract:
The well-established modular autonomous driving system is decoupled into different standalone tasks, e.g. perception, prediction and planning, suffering from information loss and error accumulation across modules. In contrast, end-to-end paradigms unify multi-tasks into a fully differentiable framework, allowing for optimization in a planning-oriented spirit. Despite the great potential of end-to-end paradigms, both the performance and efficiency of existing methods are not satisfactory, particularly in terms of planning safety. We attribute this to the computationally expensive BEV (bird's eye view) features and the straightforward design for prediction and planning. To this end, we explore the sparse representation and review the task design for end-to-end autonomous driving, proposing a new paradigm named SparseDrive. Concretely, SparseDrive consists of a symmetric sparse perception module and a parallel motion planner. The sparse perception module unifies detection, tracking and online mapping with a symmetric model architecture, learning a fully sparse representation of the driving scene. For motion prediction and planning, we review the great similarity between these two tasks, leading to a parallel design for motion planner. Based on this parallel design, which models planning as a multi-modal problem, we propose a hierarchical planning selection strategy, which incorporates a collision-aware rescore module, to select a rational and safe trajectory as the final planning output. With such effective designs, SparseDrive surpasses previous state-of-the-arts by a large margin in performance of all tasks, while achieving much higher training and inference efficiency.

Abstract:
3D Gaussian Splatting (3DGS) has become a popular solution in SLAM, as it can produce high-fidelity novel views. However, previous GS-based methods primarily target indoor scenes and rely on RGB-D sensors or pretrained depth estimation models, hence underperforming in outdoor scenarios. To address this issue, we propose a RGB-only gaussian splatting SLAM method for unbounded out-door scenes—OpenGS-SLAM. Technically, we first employ a pointmap regression network to generate consistent pointmaps between frames for pose estimation. Compared to commonly used depth maps, pointmaps include spatial relationships and scene geometry across multiple views, enabling robust camera pose estimation. Then, we propose integrating the estimated camera poses with 3DGS rendering as an end-to-end differentiable pipeline. Our method achieves simultaneous optimization of camera poses and 3DGS scene parameters, significantly enhancing system tracking accuracy. Specifically, we also design an adaptive scale mapper for the pointmap regression network, which provides more accurate pointmap mapping to the 3DGS map representation. Our experiments on the Waymo dataset demonstrate that OpenGS-SLAM reduces tracking error to 9.8% of previous 3DGS methods, and achieves state-of-the-art results in novel view synthesis. Project page: https://3dagentworld.github.io/pengs-slam/.

Abstract:
Safe motion planning algorithms are necessary for deploying autonomous robots in unstructured environments to prevent harm to humans and avoid damage to nearby objects. Generating these motion plans in real-time is also important to ensure that the robot can adapt to sudden changes in its environment. Many trajectory optimization methods introduce heuristics that balance safety and real-time performance, potentially increasing the risk of the robot colliding with its environment. This paper addresses this challenge by proposing Conformalized Reachable Sets for Obstacle Avoidance With Spheres (CROWS). CROWS is a novel real-time, receding-horizon trajectory planner that generates probablistically-safe motion plans. Offline, CROWS learns a novel neural network-based representation of a sphere-based reachable set that overapproximates the swept volume of the robot's motion. CROWS then uses conformal prediction to compute a confidence bound that provides a probabilistic safety guarantee on the learned reachable set. At runtime, CROWS performs trajectory optimization to select a trajectory that is probabilstically-guaranteed to be collision-free. We demonstrate that CROWS outperforms a variety of state-of-the-art methods in solving challenging motion planning tasks in cluttered environments while remaining collision-free. Code and video demonstrations can be found at https://roahmlab.github.io/crows/.

Abstract:
Effectively performing object rearrangement is an essential skill for mobile manipulators, e.g., setting up a dinner table. A key challenge in such problems is deciding an appropriate ordering to effectively untangle object-object dependencies while considering the necessary motions for realizing manipulation tasks (e.g., pick and place). Computing time-optimal multi-object rearrangement solutions for mobile manipulators remains a largely untapped research direction. In this work, we propose ORLA, which leverages delayed/lazy evaluation in searching for a high-quality object pick-n-place sequence that considers both end-effector and mobile robot base travel. ORLA readily handles multi-layered rearrangement tasks powered by learning-based stability predictions. Employing an optimal solver for finding temporary locations for displacing objects, ORLA can achieve global optimality. Through extensive simulation and ablation study, we confirm the effectiveness of ORLA delivering quality solutions for challenging rearrangement instances. Supplementary materials are available at: gaokai15.github.io/ORLA-Star/

Abstract:
Embodied reasoning systems integrate robotic hardware and cognitive processes to perform complex tasks, typically in response to a natural language query about a specific physical environment. This usually involves changing the belief about the scene or physically interacting and changing the scene (e.g. sort the objects from lightest to heaviest). In order to facilitate the development of such systems we introduce a new modular Closed Loop Interactive Embodied Reasoning (CLIER) approach that takes into account the measurements of non-visual object properties, changes in the scene caused by external disturbances as well as uncertain outcomes of robotic actions. CLIER performs multi-modal reasoning and action planning and generates a sequence of primitive actions that can be executed by a robot manipulator. Our method operates in a closed loop, responding to changes in the environment. Our approach is developed with the use of MuBle simulation environment and tested in \mathbf1 0 interactive benchmark scenarios. We extensively evaluate our reasoning approach in simulation and in real-world manipulation tasks with a success rate above \mathbf7 6 % and 64%, respectively.

Abstract:
Motion planning is crucial for safe navigation in complex urban environments. Historically, motion planners (MPs) have been evaluated with procedurally-generated simulators like CARLA. However, such synthetic benchmarks do not capture real-world multi-agent interactions. nuPlan, a recently released MP benchmark, addresses this limitation by augmenting real-world driving logs with closed-loop simulation logic, effectively turning the fixed dataset into a reactive simulator. We analyze the characteristics of nuPlan's recorded logs and find that each city has its own unique driving behaviors, suggesting that robust planners must adapt to different environments. We learn to model such unique behaviors with BehaviorNet, a graph convolutional neural network (GCNN) that predicts reactive agent behaviors using features derived from recently-observed agent histories; intuitively, some aggressive agents may tailgate lead vehicles, while others may not. To model such phenomena, BehaviorNet predicts the parameters of an agent's motion controller rather than directly predicting its spacetime trajectory (as most forecasters do). Finally, we present AdaptiveDriver, a model-predictive control (MPC) based planner that unrolls different world models conditioned on Behavior-Net's predictions. Our extensive experiments demonstrate that AdaptiveDriver achieves state-of-the-art results on the nuPlan closed-loop planning benchmark, improving over prior work by 2% on Test-14 Hard R-CLS, and generalizes even when evaluated on never-before-seen cities. project page

Abstract:
The complexity of the real world demands robotic systems that can intelligently adapt to unseen situations. We present STEER, a robot learning framework that bridges highlevel, commonsense reasoning with precise, flexible low-level control. Our approach translates complex situational awareness into actionable low-level behavior through training languagegrounded policies with dense annotation. By structuring policy training around fundamental, modular manipulation skills expressed in natural language, STEER exposes an expressive interface for humans or Vision-Language Models (VLMs) to intelligently orchestrate the robot's behavior by reasoning about the task and context. Our experiments demonstrate the skills learned via STEER can be combined to synthesize novel behaviors to adapt to new situations or perform completely new tasks without additional data collection or training. Project website: https://lauramsmith.github.io/steer

Abstract:
Synthesizing photo-realistic visual observations from an ego vehicle's driving trajectory is a critical step towards scalable training of self-driving models. Reconstruction-based methods create 3D scenes from driving logs and synthesize geometry-consistent driving videos through neural rendering, but their dependence on costly object annotations limits their ability to generalize to in-the-wild driving scenarios. On the other hand, generative models can synthesize action-conditioned driving videos in a more generalizable way but often struggle with maintaining 3D visual consistency. In this paper, we present DreamDrive, a 4D spatial-temporal scene generation approach that combines the merits of generation and reconstruction, to synthesize generalizable 4D driving scenes and dynamic driving videos with 3D consistency. Specifically, we leverage the generative power of video diffusion models to synthesize a sequence of visual references and further elevate them to 4D with a novel hybrid Gaussian representation. Given a driving trajectory, we then render 3D-consistent driving videos via Gaussian splatting. The use of generative priors allows our method to produce high-quality 4D scenes from in-the-wild driving data, while neural rendering ensures 3D-consistent video generation from the 4D scenes. Extensive experiments on nuScenes and in-the-wild driving data demonstrate that DreamDrive can generate controllable and generalizable 4D driving scenes, synthesize novel views of driving videos with high fidelity and 3D consistency, decompose static and dynamic elements in a self-supervised manner, and enhance perception and planning tasks for autonomous driving.

Abstract:
Navigation Among Movable Obstacles (NAMO) poses a challenge for traditional path-planning methods when obstacles block the path, requiring push actions to reach the goal. We propose a framework that enables movability-aware planning to overcome this challenge without relying on explicit obstacle placement. Our framework integrates a global Semantic Visibility Graph and a local Model Predictive Path Integral (SVG-MPPI) approach to efficiently sample rollouts, taking into account the continuous range of obstacle movability. A physics engine is adopted to simulate the interaction result of the rollouts with the environment, and generate trajectories that minimize contact force. In qualitative and quantitative experiments, SVG-MPPI outperforms the existing paradigm that uses only binary movability for planning, achieving higher success rates with reduced cumulative contact forces. Our code is available at: https://github.com/tud-amrISVG-MPPI

Abstract:
Deep visual odometry, despite extensive research, still faces limitations in accuracy and generalizability that prevent its broader application. To address these challenges, we propose an Oriented FAST and Rotated BRIEF (ORB)-guided visual odometry with selective online adaptation named ORB-SfMLearner. We present a novel use of ORB features for learning-based ego-motion estimation, leading to more robust and accurate results. We also introduce the cross-attention mechanism to enhance the explainability of PoseNet and have revealed that driving direction of the vehicle can be explained through the attention weights. To improve generalizability, our selective online adaptation allows the network to rapidly and selectively adjust to the optimal parameters across different domains. Experimental results on KITTI and vKITTI datasets show that our method outperforms previous state-of-the-art deep visual odometry methods in terms of ego-motion accuracy and generalizability.

Abstract:
Topology reasoning is crucial for autonomous driving as it enables comprehensive understanding of connec-tivity and relationships between lanes and traffic elements. While recent approaches have shown success in perceiving driving topology using vehicle-mounted sensors, their scalability is hindered by the reliance on training data captured by consistent sensor configurations. We identify that the key factor in scalable lane perception and topology reasoning is the elimination of this sensor-dependent feature. To address this, we propose SMART, a scalable solution that leverages easily available standard-definition (SD) and satellite maps to learn a map prior model, supervised by large-scale geo-referenced high-definition (HD) maps independent of sensor settings. Attributed to scaled training, SMART alone achieves superior offline lane topology understanding using only SD and satellite inputs. Extensive experiments further demonstrate that SMART can be seamlessly integrated into any online topology reasoning methods, yielding significant improvements of up to 28% on the OpenLane-V2 benchmark. Project page: https://jay-ye.github.io/smart.

Abstract:
Autonomous driving has the potential to significantly enhance productivity and provide numerous societal benefits. Ensuring robustness in these safety-critical systems is essential, particularly when vehicles must navigate adverse weather conditions and sensor corruptions that may not have been encountered during training. Current methods often overlook uncertainties arising from adversarial conditions or distributional shifts, limiting their real-world applicability. We propose an efficient adaptation of an uncertainty estimation technique for 3D occupancy prediction. Our method dynamically calibrates model confidence using epistemic uncertainty estimates. Our evaluation under various camera corruption scenarios, such as fog or missing cameras, demonstrates that our approach effectively quantifies epistemic uncertainty by assigning higher uncertainty values to unseen data. We introduce region-specific corruptions to simulate defects affecting only a single camera and validate our findings through both scene-level and region-level assessments. Our results show superior performance in Out-of-Distribution (OoD) detection and confidence calibration compared to common baselines such as Deep Ensembles and MC-Dropout. Our approach consistently demonstrates reliable uncertainty measures, indicating its potential for enhancing the robustness of autonomous driving systems in real-world scenarios. Code and dataset are available at https://github.com/ika-rwth-aachen/OCCUQ.

Abstract:
In recent years, robust matching methods using deep learning-based approaches have been actively studied and improved in computer vision tasks. However, there remains a persistent demand for both robust and fast matching techniques. To address this, we propose a novel Mamba-based local feature matching approach, called MambaGlue, where Mamba is an emerging state-of-the-art architecture rapidly gaining recognition for its superior speed in both training and inference, and promising performance compared with Transformer architectures. In particular, we propose two modules: a) MambaAttention mixer to simultaneously and selectively understand the local and global context through the Mamba-based self-attention structure and b) deep confidence score regressor, which is a multi-layer perceptron (MLP)-based architecture that evaluates a score indicating how confidently matching predictions correspond to the ground-truth correspondences. Consequently, our MambaGlue achieves a balance between robustness and efficiency in real-world applications. As verified on various public datasets, we demonstrate that our MambaGlue yields a substantial performance improvement over baseline approaches while maintaining fast inference speed. Our code will be available on https://github.com/url-kaist/MambaGlue.

Abstract:
Current methods based on Neural Radiance Fields fail in the low data limit, particularly when training on incomplete scene data. Prior works augment training data only in next-best-view applications, which lead to hallucinations and model collapse with sparse data. In contrast, we propose adding a set of views during training by rejection sampling from a posterior uncertainty distribution, generated by combining a volumetric uncertainty estimator with spatial coverage. We validate our results on partially observed scenes; on average, our method performs 39.9% better with 87.5% less variability across established scene reconstruction benchmarks, as compared to state of the art baselines. We further demonstrate that augmenting the training set by sampling from any distribution leads to better, more consistent scene reconstruction in sparse environments. This work is foundational for robotic tasks where augmenting a dataset with informative data is critical in resource-constrained, a priori unknown environments. Videos and source code are available at https://murpheylab.github.iollow-data-nerfl

Abstract:
We present Verticoder, a self-supervised representation learning approach for robot mobility on vertically challenging terrain. Using the same pre-training process, Ver-ticodercan handle four different downstream tasks, in-cluding forward kinodynamics learning, inverse kinodynamics learning, behavior cloning, and patch reconstruction with a single representation. Verticoder uses a TransformerEn-coder to learn the local context of its surroundings by random masking and next patch reconstruction. We show that Verti-coderachieves better performance across all four different tasks compared to specialized End - to- End models with 77 % fewer parameters. We also show Verticoder's comparable performance against state-of-the-art kinodynamic modeling and planning approaches in real-world robot deployment. These results underscore the efficacy of Verticoder in mitigating overfitting and fostering more robust generalization across diverse environmental contexts and downstream vehicle kin-odynamic tasks11https://github.com/mhnazeri/VertiCoder.

Abstract:
Realistic scene reconstruction and view synthesis are essential for advancing autonomous driving systems by simulating safety-critical scenarios. 3D Gaussian Splatting (3DGS) excels in real-time rendering and static scene reconstructions but struggles with modeling driving scenarios due to complex backgrounds, dynamic objects, and sparse camera views. We propose AutoSplat, a framework employing Gaussian splatting to realistically reconstruct autonomous driving scenes. By imposing geometric constraints on Gaussians representing the road and sky regions, our method enables multi-view consistent simulation of challenging scenarios, including lane changes. Leveraging 3D templates, we introduce a reflected Gaussian consistency constraint to supervise both the visible and unseen side of foreground objects. Moreover, to model the dynamic appearance of foreground objects, we estimate temporally-dependent residual spherical harmonics for each foreground Gaussian. Extensive experiments on Pandaset [1] and KITTI [2] demonstrate that AutoSplat outperforms state-of-the-art methods in scene reconstruction and novel view synthesis across diverse driving scenarios. Our project page can be found here: https://autosplat.github.io/

Abstract:
Camera-based occupancy prediction is a main-stream approach for 3D perception in autonomous driving, aiming to infer complete 3D scene geometry and semantics from 2D images. Almost existing methods focus on improving performance through structural modifications, such as lightweight backbones and complex cascaded frameworks, with good yet limited performance. Few studies explore from the perspective of representation fusion, leaving the rich diversity of features in 2D images underutilized. Motivated by this, we propose CIGOcc, a two-stage occupancy prediction framework based on multi-level representation fusion. CIGOcc extracts segmentation, graphics, and depth features from an input image and introduces a deformable multi-level fusion mechanism to fuse these three multi-level features. Additionally, CIGOcc incorporates knowledge distilled from SAM to further enhance prediction accuracy. Without increasing training costs, CIGOcc achieves state-of-the-art performance on the SemanticKITTI benchmark. The code is provided in the supplementary material and will be released project page.

Abstract:
Existing indoor SLAM datasets primarily focus on robot sensing, often lacking building architectures. To address this gap, we design and construct the first dataset to couple the SLAM and BIM, named SLABIM. This dataset provides BIM and SLAM -oriented sensor data, both modeling a university building at HKUST. The as-designed BIM is decomposed and converted for ease of use. We employ a multi-sensor suite for multi-session data collection and mapping to obtain the as-built model. All the related data are timestamped and organized, enabling users to deploy and test effectively. Furthermore, we deploy advanced methods and report the experimental results on three tasks: registration, localization and semantic mapping, demonstrating the effectiveness and practicality of SLAB 1M. We make our dataset open-source at https://github.com/HKUST-Aerial-Robotics/SLABIM.

Abstract:
Vision-language-action (VLA) models trained on large-scale internet data and robot demonstrations have the potential to serve as generalist robot policies. However, despite their large-scale training, VLAs are often brittle to task-irrelevant visual details such as distractor objects or background colors. We introduce Bring Your Own VLA (BYOVLA): a run-time intervention scheme that (1) dynamically identifies regions of the input image that the model is sensitive to, and (2) minimally alters task-irrelevant regions to reduce the model's sensitivity using automated image editing tools. Our approach is compatible with any off the shelf VLA without model fine-tuning or access to the model's weights. Hardware experiments on language-instructed manipulation tasks demonstrate that BYOVLA enables state-of-the-art VLA models to nearly retain their nominal performance in the presence of distractor objects and backgrounds, which otherwise degrade task success rates by up to 60%. Website with additional information, videos, and code: https://aasherh.github.io/byovla/.

Abstract:
Preference feedback collected by human or VLM annotators is often noisy, presenting a significant challenge for preference-based reinforcement learning that relies on accurate preference labels. To address this challenge, we propose TREND, a novel framework that integrates few-shot expert demonstrations with a tri-teaching strategy for effective noise mitigation. Our method trains three reward models simultaneously, where each model views its small-loss preference pairs as useful knowledge and teaches such useful pairs to its peer network for updating the parameters. Remarkably, our approach requires as few as one to three expert demonstrations to achieve high performance. We evaluate TREND on various robotic manipulation tasks, achieving up to 90% success rates even with noise levels as high as 40%, highlighting its effective robustness in handling noisy preference feedback.

Abstract:
With the rising focus on quadrupeds, a generalized policy capable of handling different robot models and sensor inputs becomes highly beneficial. Although several methods have been proposed to address different morphologies, it remains a challenge for learning-based policies to manage various combinations of proprioceptive information. This paper presents Masked Sensory-Temporal Attention (MSTA), a novel transformer-based mechanism with masking for quadruped locomotion. It employs direct sensor-level attention to enhance the sensory-temporal understanding and handle different combinations of sensor data, serving as a foundation for incorporating unseen information. MSTA can effectively understand its states even with a large portion of missing information, and is flexible enough to be deployed on physical systems despite the long input sequence.

Abstract:
Preference-Based reinforcement learning (PBRL) learns directly from the preferences of human teachers regarding agent behaviors without needing meticulously designed reward functions. However, existing PBRL methods often learn primarily from explicit preferences, neglecting the possibility that teachers may choose equal preferences. This neglect may hinder the understanding of the agent regarding the task perspective of the teacher, leading to the loss of important information. To address this issue, we introduce the Equal Preference Learning Task, which optimizes the neural network by promoting similar reward predictions when the behaviors of two agents are labeled as equal preferences. Building on this task, we propose a novel PBRL method, Multi-Type Preference Learning (MTPL), which allows simultaneous learning from equal preferences while leveraging existing methods for learning from explicit preferences. To validate our approach, we design experiments applying MTPL to four existing state-of-the-art baselines across ten locomotion and robotic manipulation tasks in the DeepMind Control Suite. The experimental results indicate that simultaneous learning from both equal and explicit preferences enables the PBRL method to more comprehensively understand the feedback from teachers, thereby enhancing feedback efficiency. Project page: https://github.com/FeiCuiLengMMbb/paper_MTPL

Abstract:
Understanding the traversability of terrain is essential for autonomous robot navigation, particularly in unstructured environments such as natural landscapes. Although traditional methods, such as occupancy mapping, provide a basic framework, they often fail to account for the complex mobility capabilities of some platforms such as legged robots. In this work, we propose a method for estimating terrain traversability by learning from demonstrations of human walking. Our approach leverages dense, pixel-wise feature embeddings generated using the DINOv2 vision Transformer model, which are processed through an encoder-decoder MLP architecture to analyze terrain segments. The averaged feature vectors, extracted from the masked regions of interest, are used to train the model in a reconstruction-based framework. By minimizing reconstruction loss, the network distinguishes between familiar terrain with a low reconstruction error and unfamiliar or hazardous terrain with a higher reconstruction error. This approach facilitates the detection of anomalies, allowing a legged robot to navigate more effectively through challenging terrain. We run real-world experiments on the ANYmal legged robot both indoor and outdoor to prove our proposed method. The code is open-source, while video demonstrations can be found on our website: https://rpl-cs-ucl.github.io/STEPP/

Affiliations: Department of Surgery, Shanghai Key Laboratory of Gastric Neoplasms, Shanghai Institute of Digestive Surgery, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China; Department of Urology, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, China; School of Intelligent Science and Technology, Nanjing University, China; School of Computer Science, Peking University, Beijing, China

Abstract:
Robot-assisted surgery has profoundly influenced current forms of minimally invasive surgery. However, in transurethral urological surgical robots, they need to work in a liquid environment. This causes vaporization of the liquid when shearing and heating is performed, resulting in bubble atomization that affects the visual perception of the robot. This can lead to the need for uninterrupted pauses in the surgical procedure, which makes the surgery take longer. To address the atomization characteristics of liquids under urological surgical robotic vision, we propose an unsupervised zero-shot dehaze method (RSF-Dehaze). Specifically, the proposed Region Similarity Filling Module (RSFM) of RSF-Dehaze significantly improves the recovery of blurred region tissues. In addition, we organize and propose a dehaze dataset for robotic vision in urological surgery (USRobot-Dehaze dataset). In particular, this dataset contains the three most common urological surgical robot operation scenarios. To the best of our knowledge, we are the first to organize and propose a publicly available dehaze dataset for urological surgical robot vision. The proposed RSF-Dehaze proves the effectiveness of our method in three urological surgical robot operation scenarios with extensive comparative experiments with 20 most classical and advanced dehazing and image recovery algorithms. The proposed source code and dataset are available at https://github.com/wurenkai/RSF-Dehaze.

Abstract:
To use assistive robots in everyday life, a remote control system with common devices, such as 2D devices, is helpful to control the robots anytime and anywhere as intended. Hand-drawn sketches are one of the intuitive ways to control robots with 2D devices. However, since similar sketches have different intentions from scene to scene, existing work requires additional modalities to set the sketches' semantics. This requires complex operations for users and leads to decreasing usability. In this paper, we propose Sketch-MoMa, a teleoperation system using user-given hand-drawn sketches as instructions to control a robot. We use Vision-Language Models (VLMs) to understand the user-given sketches superimposed on an observation image and infer drawn shapes and low-level tasks of the robot. We utilize sketches and the generated shapes for recognition and motion planning of the generated low-level tasks for precise and intuitive operations. We validate our approach using state-of-the-art VLMs with 7 tasks and 5 sketch shapes. We also demonstrate that our approach effectively specifies more detailed intentions, such as how to grasp and how much to rotate. Moreover, we show the competitive usability of our approach compared with the existing 2D interface through a user experiment with 14 participants. Our videos and results are available at this link.

Abstract:
Robust autonomous navigation in environments with limited visibility remains a critical challenge in robotics. We present a novel approach that leverages Non-Line-of-Sight (NLOS) sensing using single-photon LiDAR to improve visibility and enhance autonomous navigation. Our method enables mobile robots to “see around corners” by utilizing multi-bounce light information, effectively expanding their perceptual range without additional infrastructure. We propose a three-module pipeline: (1) Sensing, which captures multi-bounce histograms using SPAD-based LiDAR; (2) Perception, which estimates occupancy maps of hidden regions from these histograms using a convolutional neural network; and (3) Control, which allows a robot to follow safe paths based on the estimated occupancy. We evaluate our approach through simulations and real-world experiments on a mobile robot navigating an L-shaped corridor with hidden obstacles. Our work represents the first experimental demonstration of NLOS imaging for autonomous navigation, paving the way for safer and more efficient robotic systems operating in complex environments. We also contribute a novel dynamics-integrated transient rendering framework for simulating NLOS scenarios, facilitating future research in this domain.

Abstract:
Deploying depth estimation networks in the real world requires high-level robustness against various adverse conditions to ensure safe and reliable autonomy. For this purpose, many autonomous vehicles employ multi-modal sensor systems, including an RGB camera, NIR camera, thermal camera, LiDAR, or Radar. They mainly adopt two strategies to use multiple sensors: modality-wise and multi-modal fused inference. The former method is flexible but memory-inefficient, unreliable, and vulnerable. Multi-modal fusion can provide high-level reliability, yet it needs a specialized architecture. In this paper, we propose an effective solution, named align-and-fuse strategy, for the depth estimation from multi-spectral images. In the align stage, we align embedding spaces between multiple spectrum bands to learn shareable representation across multi-spectral images by minimizing contrastive loss of global and spatially aligned local features with geometry cue. After that, in the fuse stage, we train an attachable feature fusion module that can selectively aggregate the multi-spectral features for reliable and robust prediction results. Based on the proposed method, a single-depth network can achieve both spectral-invariant and multi-spectral fused depth estimation while preserving reliability, memory efficiency, and flexibility.

Abstract:
Classical manipulator motion planners work across different robot embodiments [1]. However they plan on a pre-specified static environment representation, and are not scalable to unseen dynamic environments. Neural Motion Planners (NMPs) [2] are an appealing alternative to conventional planners as they incorporate different environmental constraints to learn motion policies directly from raw sensor observations. Contemporary state-of-the-art NMPs can successfully plan across different environments [3]. However none of the existing NMPs generalize across robot embodiments. In this paper we propose Cross-Embodiment Motion Policy (XMoP), a neural policy for learning to plan over a distribution of manipulators. XMoP implicitly learns to satisfy kinematic constraints for a distribution of robots and zero-shot transfers the planning behavior to unseen robotic manipulators within this distribution. We achieve this generalization by formulating a whole-body control policy that is trained on planning demonstrations from over three million procedurally sampled robotic manipulators in different simulated environments. Despite being completely trained on synthetic embodiments and environments, our policy exhibits strong sim-to-real generalization across manipulators with different kinematic variations and degrees of freedom with a single set of frozen policy parameters. We evaluate XMoP on 7 commercial manipulators and show successful cross-embodiment motion planning, achieving an average 70 % success rate on baseline benchmarks. Furthermore, we demonstrate sim-to-real deployment on two unseen manipulators solving novel planning problems across three real-world domains even with dynamic obstacles.

Abstract:
NeRF-based SLAM has recently achieved promising results in tracking and reconstruction. However, existing methods face challenges in providing sufficient scene representation, capturing structural information, and maintaining global consistency in scenes emerging significant movement or being forgotten. To this end, we present HS-SLAM to tackle these problems. To enhance scene representation capacity, we propose a hybrid encoding network that combines the complementary strengths of hash-grid, tri-planes, and one-blob, improving the completeness and smoothness of reconstruction. Additionally, we introduce structural supervision by sampling patches of non-local pixels rather than individual rays to better capture the scene structure. To ensure global consistency, we implement an active global bundle adjustment (BA) to eliminate camera drifts and mitigate accumulative errors. Experimental results demonstrate that HS-SLAM outperforms the baselines in tracking and reconstruction accuracy while maintaining the efficiency required for robotics.

Abstract:
In this paper, we present a real-time photo-realistic SLAM method based on marrying Gaussian Splatting with LiDAR-Inertial-Camera SLAM. Most existing radiance-field-based SLAM systems mainly focus on bounded indoor environments, equipped with RGB-D or RGB sensors. However, they are prone to decline when expanding to unbounded scenes or encountering adverse conditions, such as violent motions and changing illumination. In contrast, oriented to general scenarios, our approach additionally tightly fuses LiDAR, IMU, and camera for robust pose estimation and photo-realistic online mapping. To compensate for regions unobserved by the LiDAR, we propose to integrate both the triangulated visual points from images and LiDAR points for initializing 3D Gaussians. In addition, the modeling of the sky and varying camera exposure have been realized for high-quality rendering. Notably, we implement our system purely with C++ and CUDA, and meticulously design a series of strategies to accelerate the online optimization of the Gaussian-based scene representation. Extensive experiments demonstrate that our method outperforms its counterparts while maintaining real-time capability. Impressively, regarding photo-realistic mapping, our method with our estimated poses even surpasses all the compared approaches that utilize privileged ground-truth poses for mapping. Our code will be released on project page https://xingxingzuo.github.io/gaussian_lic.

Abstract:
To navigate safely and efficiently in crowded spaces, robots should not only perceive the current state of the environment but also anticipate future human movements. In this paper, we propose a reinforcement learning architecture, namely Falcon, to tackle socially-aware navigation by explicitly predicting human trajectories and penalizing actions that block future human paths. To facilitate realistic evaluation, we introduce a novel SocialNav benchmark containing two new datasets, Social-HM3D and Social-MP3D. This benchmark offers large-scale photo-realistic indoor scenes populated with a reasonable amount of human agents based on scene area size, incorporating natural human movements and trajectory patterns. We conduct a detailed experimental analysis with the state-of-the-art learning-based method and two classic rulebased path-planning algorithms on the new benchmark. The results demonstrate the importance of future prediction and our method achieves the best task success rate of 55% while maintaining about 90% personal space compliance. We will release our code and datasets.

Abstract:
Safe and successful deployment of robots requires not only the ability to generate complex plans but also the capacity to frequently replan and correct execution errors. This paper addresses the challenge of long-horizon trajectory planning under temporally extended objectives in a receding horizon manner. To this end, we propose Doppler, a data-driven hierarchical framework that generates and updates plans based on instruction specified by linear temporal logic (LTL). Our method decomposes temporal tasks into chain of options with hierarchical reinforcement learning from offline non-expert datasets. It leverages diffusion models to generate options with low-level actions. We devise a determinantal-guided posterior sampling technique during batch generation, which improves the speed and diversity of diffusion generated options, leading to more efficient querying. Experiments on robot navigation and manipulation tasks demonstrate that Doppler can generate sequences of trajectories that progressively satisfy the specified formulae for obstacle avoidance and sequential visitation.

Abstract:
Grasp learning in noisy environments, such as occlusions, sensor noise, and out-of-distribution (OOD) objects, poses significant challenges. Recent learning-based approaches focus primarily on capturing aleatoric uncertainty from inherent data noise. The epistemic uncertainty, which represents the OOD recognition, is often addressed by ensembles with multiple forward paths, limiting real-time application. In this paper, we propose an uncertainty-aware approach for 6-DoF grasp detection using evidential learning to comprehensively capture both uncertainties in real-world robotic grasping. As a key contribution, we introduce vMF-Contact, a novel architecture for learning hierarchical contact grasp representations with probabilistic modeling of directional uncertainty as von Mises-Fisher (vMF) distribution. To achieve this, we analyze the theoretical formulation of the second-order objective on the posterior parametrization, providing formal guarantees for the model's ability to quantify uncertainty and improve grasp prediction performance. Moreover, we enhance feature expressiveness by applying partial point reconstructions as an auxiliary task, improving the comprehension of uncertainty quantification as well as the generalization to unseen objects. In the real-world experiments, our method demonstrates a significant improvement by 39% in the overall clearance rate compared to the baselines. The code is available under: https://github.com/YitianShi/vMF-Contact/tree/main

Abstract:
In order to navigate safely and reliably in off-road and unstructured environments, robots must detect anomalies that are out-of-distribution (OOD) with respect to the training data. We present an analysis-by-synthesis approach for pixel-wise anomaly detection without making any assumptions about the nature of OOD data. Given an input image, we use a generative diffusion model to synthesize an edited image that removes anomalies while keeping the remaining image unchanged. Then, we formulate anomaly detection as analyzing which image segments were modified by the diffusion model. We propose a novel inference approach for guided diffusion by analyzing the ideal guidance gradient and deriving a principled approximation that bootstraps the diffusion model to predict guidance gradients. Our editing technique is purely test-time that can be integrated into existing workflows without the need for retraining or fine-tuning. Finally, we use a combination of vision-language foundation models to compare pixels in a learned feature space and detect semantically meaningful edits, enabling accurate anomaly detection for off-road navigation.

Abstract:
Locating objects described in natural language presents a significant challenge for autonomous agents. Existing CLIP-based open-vocabulary methods successfully perform 3D object grounding with simple (bare) queries, but cannot cope with ambiguous descriptions that demand an understanding of object relations. To tackle this problem, we propose a modular approach called BBQ (Beyond Bare Queries), which constructs 3D scene graph representation with metric and semantic spatial edges and utilizes a large language model as a human-toagent interface through our deductive scene reasoning algorithm. BBQ employs robust DINO-powered associations to construct 3D object-centric map and an advanced raycasting algorithm with a 2D vision-language model to describe them as graph nodes. On the Replica and ScanNet datasets, we have demonstrated that BBQ takes a leading place in openvocabulary 3D semantic segmentation compared to other zeroshot methods. Also, we show that leveraging spatial relations is especially effective for scenes containing multiple entities of the same semantic class. On challenging Sr3D+, Nr3D and ScanRefer benchmarks, our deductive approach demonstrates a significant improvement, enabling objects grounding by complex queries compared to other state-of-the-art methods. The combination of our design choices and software implementation has resulted in significant data processing speed in experiments on the robot on-board computer. This promising performance enables the application of our approach in intelligent robotics projects. We made the code publicly available at linukc.github.io/BeyondBareQueries.

Abstract:
Effectively manipulating articulated objects in household scenarios is a crucial step toward achieving general embodied artificial intelligence. Mainstream research in 3D vision has primarily focused on manipulation through depth perception and pose detection. However, in real-world environments, these methods often face challenges due to imperfect depth perception, such as with transparent lids and reflective handles. Moreover, they generally lack the diversity in partbased interactions required for flexible and adaptable manipulation. To address these challenges, we introduced a largescale part-centric dataset for articulated object manipulation that features both photo-realistic material randomizations and detailed annotations of part-oriented, scene-level actionable interaction poses. We evaluated the effectiveness of our dataset by integrating it with several state-of-the-art methods for depth estimation and interaction pose prediction. Additionally, we proposed a novel modular framework that delivers superior and robust performance for generalizable articulated object manipulation. Our extensive experiments demonstrate that our dataset significantly improves the performance of depth perception and actionable interaction pose prediction in both simulation and real-world scenarios. More information and demos can be found at: https://pku-epic.github.io/GAPartManip/.

Abstract:
3D semantic occupancy prediction is a crucial task in visual perception, as it requires the simultaneous comprehension of both scene geometry and semantics. It plays a crucial role in understanding 3D scenes and has great potential for various applications, such as robotic vision perception and autonomous driving. Many existing works utilize planar-based representations such as Bird's Eye View (BEV) and Tri-Perspective View (TPV). These representations aim to simplify the complexity of 3D scenes while preserving essential object information, thereby facilitating efficient scene representation. However, in dense indoor environments with prevalent occlusions, directly applying these planar-based methods often leads to difficulties in capturing global semantic occupancy, ultimately degrading model performance. In this paper, we present a new vertical slice representation that divides the scene along the vertical axis and projects spatial point features onto the nearest pair of parallel planes. To utilize these slice features, we propose SliceOcc, an RGB camera-based model specifically tailored for indoor 3D semantic occupancy prediction. SliceOcc utilizes pairs of slice queries and cross-attention mechanisms to extract planar features from input images. These local planar features are then fused to form a global scene representation, which is employed for indoor occupancy prediction. Experimental results on the EmbodiedScan dataset demonstrate that SliceOcc achieves a mIoU of 15.45 % across 81 indoor categories, setting a new state-of-the-art performance among RGB camera-based models for indoor 3D semantic occupancy prediction.

Abstract:
In this paper, we address the critical bottleneck in robotics caused by the scarcity of diverse 3D data by presenting a novel two-stage approach for generating high-quality 3D models from a single image. This method is motivated by the need to efficiently expand 3D asset creation, particularly for robotics datasets, where the variety of object types is currently limited compared to general image datasets. Unlike previous methods that primarily rely on general diffusion priors, which often struggle to align with the reference image, our approach leverages subject-specific prior knowledge. By incorporating subject-specific priors in both geometry and texture, we ensure precise alignment between the generated 3D content and the reference object. Specifically, we introduce a shading modeaware prior into the NeRF optimization process, enhancing the geometry and refining texture in the coarse outputs to achieve superior quality. Extensive experiments demonstrate that our method significantly outperforms prior approaches.

Abstract:
Navigating and understanding complex environments over extended periods of time is a significant challenge for robots. People interacting with the robot may want to ask questions like where something happened, when it occurred, or how long ago it took place, which would require the robot to reason over a long history of their deployment. To address this problem, we introduce a Retrieval-augmented Memory for Embodied Robots, or ReMEmbR, a system designed for long-horizon video question answering for robot navigation. To evaluate ReMEmbR, we introduce the NaVQA dataset where we annotate spatial, temporal, and descriptive questions to long-horizon robot navigation videos. ReMEmbR employs a structured approach involving a memory building and a querying phase, leveraging temporal information, spatial information, and images to efficiently handle continuously growing robot histories. Our experiments demonstrate that ReMEmbR outperforms LLM and VLM baselines, allowing ReMEmbR to achieve effective long-horizon reasoning with low latency. Additionally, we deploy ReMEmbR on a robot and show that our approach can handle diverse queries. The dataset, code, videos, and other material can be found at the following link: https://nvidia-ai-iot.github.io/remembr.

Abstract:
Accurate depth estimation is crucial for 3D scene comprehension in robotics and autonomous vehicles. Fisheye cameras, known for their wide field of view, have inherent geometric benefits. However, their use in depth estimation is restricted by a scarcity of ground truth data and image distortions. We present FisheyeDepth, a self-supervised depth estimation model tailored for fisheye cameras. We incorporate a fisheye camera model into the projection and reprojection stages during training to handle image distortions, thereby improving depth estimation accuracy and training stability. Furthermore, we incorporate real-scale pose information into the geometric projection between consecutive frames, replacing the poses estimated by the conventional pose network. Essentially, this method offers the necessary physical depth for robotic tasks, and also streamlines the training and inference procedures. Additionally, we devise a multi-channel output strategy to improve robustness by adaptively fusing features at various scales, which reduces the noise from real pose data. We demonstrate the superior performance and robustness of our model in fisheye image depth estimation through evaluations on public datasets and real-world scenarios. The project website is available at: https://github.com/guoyangzhaolFisheyeDepth.

Abstract:
Traversability estimation is the foundation of path planning for a general navigation system. However, complex and dynamic environments pose challenges for the latest methods using self-supervised learning (SSL) technique. Firstly, existing SSL-based methods generate sparse annotations lacking detailed boundary information. Secondly, their strategies focus on hard samples for rapid adaptation, leading to forgetting and biased predictions. In this work, we propose IMOST, a continual traversability learning framework composed of two key modules: incremental dynamic memory (IDM) and self-supervised annotation (SSA). By mimicking human memory mechanisms, IDM allocates novel data samples to new clusters according to information expansion criterion. It also updates clusters based on diversity rule, ensuring a representative characterization of new scene. This mechanism enhances scene-aware knowledge diversity while maintaining a compact memory capacity. The SSA module, integrating FastSAM, utilizes point prompts to generate complete annotations in real time which reduces training complexity. Furthermore, IMOST has been successfully deployed on the quadruped robot, with performance evaluated during the online learning process. Experimental results on both public and self-collected datasets demonstrate that our IMOST outperforms current state-of-the-art method, maintains robust recognition capabilities and adaptability across various scenarios. The code is available at https://github.com/SJTUMKH/OCLTrav.

Abstract:
We propose Hier-SLAM, a semantic 3D Gaussian Splatting SLAM method featuring a novel hierarchical categorical representation, which enables accurate global 3D semantic mapping, scaling-up capability, and explicit semantic label prediction in the 3D world. The parameter usage in semantic SLAM systems increases significantly with the growing complexity of the environment, making it particularly challenging and costly for scene understanding. To address this problem, we introduce a novel hierarchical representation that encodes semantic information in a compact form into 3D Gaussian Splatting, leveraging the capabilities of large language models (LLMs). We further introduce a novel semantic loss designed to optimize hierarchical semantic information through both inter-level and cross-level optimization. Furthermore, we enhance the whole SLAM system, resulting in improved tracking and mapping performance. Our Hier-SLAM outperforms existing dense SLAM methods in both mapping and tracking accuracy, while achieving a 2x operation speed-up. Additionally, it achieves on-par semantic rendering performance compared to existing methods while significantly reducing storage and training time requirements. Rendering FPS impressively reaches 2,000 with semantic information and 3,000 without it. Most notably, it showcases the capability of handling the complex real-world scene with more than 500 semantic classes, highlighting its valuable scaling-up capability. The open-source code is available at https://github.com/LeeBY68/Hier-SLAM.

Abstract:
In autonomous driving, 3D object detection is essential for accurately identifying and tracking objects. Despite the continuous development of various technologies for this task, a significant drawback is observed in most of them—they experience substantial performance degradation when detecting objects in unseen domains. In this paper, we propose a method to improve the generalization ability for 3D object detection on a single domain. We primarily focus on generalizing from a single source domain to target domains with distinct sensor configurations and scene distributions. To learn sparsity-invariant features from a single source domain, we selectively subsample the source data to a specific beam, using confidence scores determined by the current detector to identify the density that holds utmost importance for the detector. Subsequently, we employ the teacher-student framework to align the Bird's Eye View (BEV) features for different point clouds densities. We also utilize feature content alignment (FCA) and graph-based embedding relationship alignment (GERA) to instruct the detector to be domain-agnostic. Extensive experiments demonstrate that our method exhibits superior generalization capabilities compared to other baselines. Furthermore, our approach even outperforms certain domain adaptation methods that can access to the target domain data. The code is available at https://github.com/Tiffamy/3DOD-LSF.

Abstract:
Recently, neural radiance fields (NeRF) have gained significant attention in the field of visual localization. However, existing NeRF-based approaches either lack geometric constraints or require extensive storage for feature matching, limiting their practical applications. To address these challenges, we propose an efficient and novel visual localization approach based on the neural implicit map with complementary features. Specifically, to enforce geometric constraints and reduce storage requirements, we implicitly learn a 3D keypoint descriptor field, avoiding the need to explicitly store point-wise features. To further address the semantic ambiguity of descriptors, we introduce additional semantic contextual feature fields, which enhance the quality and reliability of 2D-3D correspondences. Besides, we propose descriptor similarity distribution alignment to minimize the domain gap between 2D and 3D feature spaces during matching. Finally, we construct the matching graph using both complementary descriptors and contextual features to establish accurate 2D3D correspondences for 6-DoF pose estimation. Compared with the recent NeRF-based approaches, our method achieves a 3 × faster training speed and a 45 × reduction in model storage. Extensive experiments on two widely used datasets demonstrate that our approach outperforms or is highly competitive with other state-of-the-art NeRF-based visual localization methods. Project page: https://zju3dv.github.io/neuraloc

Abstract:
Surgical phase recognition is critical for assisting surgeons in understanding surgical videos. Existing studies focused more on online surgical phase recognition, by leveraging preceding frames to predict the current frame. Despite great progress, they formulated the task as a series of frame-wise classification, which resulted in a lack of global context of the entire procedure and incoherent predictions. Moreover, besides online analysis, accurate offline surgical phase recognition is also in significant clinical need for retrospective analysis, and existing online algorithms do not fully analyze the entire video, thereby limiting accuracy in offline analysis. To over-come these challenges and enhance both online and offline inference capabilities, we propose a universal Surgical Phase LocalizAtion Network, named SurgPLAN++, with the principle of temporal detection. To ensure a global understanding of the surgical procedure, we devise a phase localization strategy for SurgPLAN ++ to predict phase segments across the entire video through phase proposals. For online analysis, to generate high-quality phase proposals, SurgPLAN++ incorporates a data augmentation strategy to extend the streaming video into a pseudo-complete video through mirroring, center-duplication, and down-sampling. For offline analysis, SurgPLAN++ capi-talizes on its global phase prediction framework to continu-ously refine preceding predictions during each online inference step, thereby significantly improving the accuracy of phase recognition. We perform extensive experiments to validate the effectiveness, and our SurgPLAN++ achieves remarkable performance in both online and offline modes, which outper-forms state-of-the-art methods. The source code is available at https://github.com/franciszchenlSurgPLAN-Plus.

Abstract:
Multi-agent coordination is crucial for reliable multi-robot navigation in shared spaces such as automated warehouses. In regions of dense robot traffic, local coordination methods may fail to find a deadlock-free solution. In these scenarios, it is appropriate to let a central unit generate a global schedule that decides the passing order of robots. However, the runtime of such centralized coordination methods increases significantly with the problem scale. In this paper, we propose to leverage Graph Neural Network Variational Autoencoders (GNN-VAE) to solve the multi-agent coordination problem at scale faster than through centralized optimization. We formulate the coordination problem as a graph problem and collect ground truth data using a Mixed-Integer Linear Program (MILP) solver. During training, our learning framework encodes good quality solutions of the graph problem into a latent space. At inference time, solution samples are decoded from the sampled latent variables, and the lowest-cost sample is selected for coordination. By construction, our GNN-VAE framework returns solutions that always respect the constraints of the considered coordination problem. Numerical results show that our approach trained on small-scale problems can achieve high-quality solutions even for large-scale problems with 250 robots, being much faster than other baselines.

Abstract:
Enhancing a low-light noisy RAW image into a well-exposed and clean sRGB image is a significant challenge for modern digital cameras. Prior approaches have difficulties in recovering fine-grained details and true colors of the scene under extremely low-light environments due to near-to-zero SNR. Meanwhile, diffusion models have shown significant progress towards general domain image generation. In this paper, we propose to leverage the pre-trained latent diffusion model to perform the neural ISP for enhancing extremely low-light images. Specifically, to tailor the pre-trained latent diffusion model to operate on the RAW domain, we train a set of lightweight taming modules to inject the RAW information into the diffusion denoising process via modulating the intermediate features of UNet. We further observe different roles of UNet denoising and decoder reconstruction in the latent diffusion model, which inspires us to decompose the lowlight image enhancement task into latent-space low-frequency content generation and decoding-phase high-frequency detail maintenance. Through extensive experiments on representative datasets, we demonstrate our simple design not only achieves state-of-the-art performance in quantitative evaluations but also shows significant superiority in visual comparisons over strong baselines, which highlight the effectiveness of powerful generative priors for neural ISP under extremely low-light environments.

Abstract:
Catching objects in flight (i.e., thrown objects) is a common daily skill for humans, yet it presents a significant challenge for robots. This task requires a robot with agile and accurate motion, a large spatial workspace, and the ability to interact with diverse objects. In this paper, we build a mobile manipulator composed of a mobile base, a 6-DoF arm, and a 12-DoF dexterous hand to tackle such a challenging task. We propose a two-stage reinforcement learning framework to efficiently train a whole-body-control catching policy for this high-DoF system in simulation. The objects' throwing configurations, shapes, and sizes are randomized during training to enhance policy adaptivity to various trajectories and object characteristics in flight. The results show that our trained policy catches diverse objects with randomly thrown trajectories, at a high success rate of about 80 % in simulation, outperforming the baselines significantly. The policy trained in simulation can be deployed seamlessly to the real world with only onboard sensing and computation, which achieves catching sandbags in various shapes, randomly thrown by humans. Our video and code are available at https://mobile-dex-catch.github.io.

Abstract:
We would like to estimate the pose and full shape of an object from a single observation, without assuming known 3D model or category. In this work, we propose OmniShape, the first method of its kind to enable probabilistic pose and shape estimation. OmniShape is based on the key insight that shape completion can be decoupled into two multi-modal distributions: one capturing how measurements project into a normalized object reference frame defined by the dataset and the other modelling a prior over object geometries represented as triplanar neural fields. By training separate conditional diffusion models for these two distributions, we enable sampling multiple hypotheses from the joint pose and shape distri-bution. OmniShape demonstrates compelling performance on challenging real world datasets. Project website: https://tri-ml.glthub.io/omnishape.

Abstract:
Numerous real-world control problems involve dynamics and objectives affected by unobservable hidden parameters, ranging from autonomous driving to robotic manipulation, which cause performance degradation during sim-to-real transfer. To represent these kinds of domains, we adopt hiddenparameter Markov decision processes (HIP-MDPs), which model sequential decision problems where hidden variables parameterize transition and reward functions. Existing approaches, such as domain randomization, domain adaptation, and meta-learning, simply treat the effect of hidden parameters as additional variance and often struggle to effectively handle HIP-MDP problems, especially when the rewards are parameterized by hidden variables. We introduce PrivilegedDreamer, a model-based reinforcement learning framework that extends the existing model-based approach by incorporating an explicit parameter estimation module. PrivilegedDreamer features its novel dual recurrent architecture that explicitly estimates hidden parameters from limited historical data and enables us to condition the model, actor, and critic networks on these estimated parameters. Our empirical analysis on five diverse HIP-MDP tasks demonstrates that PrivilegedDreamer outperforms state-of-the-art model-based, model-free, and domain adaptation learning algorithms. Additionally, we conduct ablation studies to justify the inclusion of each component in the proposed architecture.

Abstract:
In this paper, we introduce a novel approach to implicitly encode precise robot morphology using forward kinematics based on a configuration space signed distance function. Our proposed Robot Neural Distance Function (RNDF) optimizes the balance between computational efficiency and accuracy for signed distance queries conditioned on the robot's configuration for each link. Compared to the baseline method, the proposed approach achieves an 81.1% reduction in distance error while utilizing only 47.6% of model parameters. Its parallelizable and differentiable nature provides direct access to joint-space derivatives, enabling a seamless connection between robot planning in Cartesian task space and configuration space. These features make RNDF an ideal surrogate model for general robot optimization and learning in 3D spatial planning tasks. Specifically, we apply RNDF to robotic arm-hand modeling and demonstrate its potential as a core platform for wholearm, collision-free grasp planning in cluttered environments. The code and model are available at https://github.com/roboticmanipulation/RNDF.

Abstract:
When performing tasks like laundry, humans naturally coordinate both hands to manipulate objects and anticipate how their actions will change the state of the clothes. However, achieving such coordination in robotics remains challenging due to the need to model object movement, predict future states, and generate precise bimanual actions. In this work, we address these challenges by infusing the predictive nature of human manipulation strategies into robot imitation learning. Specifically, we disentangle task-related state transitions from agent-specific inverse dynamics modeling to enable effective bimanual coordination. Using a demonstration dataset, we train a diffusion model to predict future states given historical observations, envisioning how the scene evolves. Then, we use an inverse dynamics model to compute robot actions that achieve the predicted states. Our key insight is that modeling object movement can help learning policies for bimanual coordination manipulation tasks. Evaluating our framework across diverse simulation and real-world manipulation setups, including multimodal goal configurations, bimanual manipulation, deformable objects, and multi-object setups, we find that it consistently outperforms state-of-the-art state-to-action mapping policies. Our method demonstrates a remarkable capacity to navigate multimodal goal configurations and action distributions, maintain stability across different control modes, and synthesize a broader range of behaviors than those present in the demonstration dataset.

Abstract:
Collaborative perception (CP) leverages visual data from connected and autonomous vehicles (CAV) to expand an ego vehicle's field of view (FoV). Despite recent progress, current CP methods do expand the ego vehicle's 360-degree perceptual range almost equally, but faces two key challenges. Firstly, in areas with uneven traffic distribution, focusing on directions with little traffic offers limited benefits. Secondly, under limited communication budgets, allocating excessive bandwidth to less critical directions lowers the perception accuracy in more vital areas. To address these issues, we propose Directed-CP, a proactive and direction-aware CP system aiming at improving CP in specific directions. Our key idea is to enable an ego vehicle to proactively signal its interested directions and readjust its attention to enhance local directional CP performance. To achieve this, we first propose an RSU-aided direction masking mechanism that assists an ego vehicle in identifying vital directions. Additionally, we design a direction-aware selective attention module to wisely aggregate pertinent features based on ego vehicle's directional priorities, communication budget, and the positional data of CAVs. Moreover, we introduce a direction-weighted detection loss (DWLoss) to capture the divergence between directional CP outcomes and the ground truth, facilitating effective model training. Extensive experiments on the V2X-Sim 2.0 dataset demonstrate that our approach achieves 19.8% higher local perception accuracy in interested directions and 2.5% higher overall perception accuracy than the state-of-the-art methods in collaborative 3D object detection tasks.

Abstract:
Recent progress in imitation learning from human demonstrations has shown promising results in teaching robots manipulation skills. To further scale up training datasets, recent works start to use portable data collection devices without the need for physical robot hardware. However, due to the absence of on-robot feedback during data collection, the data quality depends heavily on user expertise, and many devices are limited to specific robot embodiments. We propose ARCap, a portable data collection system that provides visual feedback through augmented reality (AR) and haptic warnings to guide users in collecting high-quality demonstrations. Through extensive user studies, we show that ARCap enables novice users to collect robot-executable data that matches robot kinematics and avoids collisions with the scenes. With data collected from ARCap, robots can perform challenging tasks, such as manipulation in cluttered environments and long-horizon cross-embodiment manipulation. ARCap is fully open-source and easy to calibrate; all components are built from off-the-shelf products. More details and results can be found on our website: stanford-tml.github.io/ARCap

Abstract:
We present LightStereo, a cutting-edge stereomatching network crafted to accelerate the matching process. Departing from conventional methodologies that rely on aggregating computationally intensive 4D costs, LightStereo adopts the 3D cost volume as a lightweight alternative. While similar approaches have been explored previously, our breakthrough lies in enhancing performance through a dedicated focus on the channel dimension of the 3D cost volume, where the distribution of matching costs is encapsulated. Our exhaustive exploration has yielded plenty of strategies to amplify the capacity of the pivotal dimension, ensuring both precision and efficiency. We compare the proposed LightStereo with existing state-of-the-art methods across various benchmarks, which demonstrate its superior performance in speed, accuracy, and resource utilization. LightStereo achieves a competitive EPE metric in the SceneFlow datasets while demanding a minimum of only 22 GFLOPs and 17 ms of runtime, and ranks 1st on KITTI 2015 among real-time models. Our comprehensive analysis reveals the effect of 2 D cost aggregation for stereo matching, paving the way for realworld applications of efficient stereo systems. Code is available at https://github.com/XiandaGuo/OpenStereo.

Abstract:
Teaching robots to autonomously complete everyday tasks remains a challenge. Imitation Learning (IL) is a powerful approach that imbues robots with skills via demonstrations, but is limited by the labor-intensive process of collecting teleoperated robot data. Human videos offer a scalable alternative, but it remains difficult to directly train IL policies from them due to the lack of robot action labels. To address this, we propose to represent actions as short-horizon 2D trajectories on an image. These actions, or motion tracks, capture the predicted direction of motion for both human hands and robot end-effectors. We instantiate an IL policy called Motion Track Policy (MT- \pi ) which receives image observations and outputs motion tracks as actions. By leveraging this unified, cross-embodiment action space, MT- \pi completes tasks with high success given just minutes of human video and limited additional robot demonstrations. At test time, we predict motion tracks from two camera views, recovering 6DoF trajectories via multi-view synthesis. MT- \pi achieves an average success rate of 86.5% across 4 real-world tasks, outperforming state-of-the-art IL baselines which do not leverage human data or our action space by 40%, and generalizes to scenarios seen only in human videos. Code and videos are available on our website (https://portal-cornell.github.io/motion_track_policy/).

Abstract:
Recent works in the robot learning community have successfully introduced generalist models capable of controlling various robot embodiments across a wide range of tasks, such as navigation and locomotion. However, achieving agile control, which pushes the limits of robotic performance, still relies on specialist models that require extensive parameter tuning. To leverage generalist-model adaptability and flexibility while achieving specialist-level agility, we propose AnyCar, a transformer-based generalist dynamics model designed for agile control of various wheeled robots. To collect training data, we unify multiple simulators and leverage different physics backends to simulate vehicles with diverse sizes, scales, and physical properties across various terrains. With robust training and real-world fine-tuning, our model enables precise adaptation to different vehicles, even in the wild and under large state estimation errors. In real-world experiments, AnyCar shows both few-shot and zero-shot generalization across a wide range of vehicles and environments, where our model, combined with a sampling-based MPC, outperforms specialist models by up to 54%. These results represent a key step toward building a foundation model for agile wheeled robot control. AnyCar is fully open-source to support further research.

Abstract:
Autonomous vehicles and robots often struggle with reliable visual perception at night due to the low illumination and motion blur caused by the long exposure time of RGB cameras. Existing methods address this challenge by sequentially connecting the off-the-shelf pretrained lowlight enhancement and deblurring models. Unfortunately, these methods often lead to noticeable artifacts (e.g., color distortions) in the over-exposed regions or make it hardly possible to learn the motion cues of the dark regions. In this paper, we interestingly find vision-language models, e.g., Contrastive LanguageImage Pretraining (CLIP), can comprehensively perceive diverse degradation levels at night. In light of this, we propose a novel transformer-based joint learning framework, named DAP-LED, which can jointly achieve low-light enhancement and deblurring, benefiting downstream tasks, such as depth estimation, segmentation, and detection in the dark. The key insight is to leverage CLIP to adaptively learn the degradation levels from images at night. This subtly enables learning rich semantic information and visual representation for optimization of the joint tasks. To achieve this, we first introduce a CLIPguided cross-fusion module to obtain multi-scale patch-wise degradation heatmaps from the image embeddings. Then, the heatmaps are fused via the designed CLIP-enhanced transformer blocks to retain useful degradation information for effective model optimization. Experimental results show that, compared to existing methods, our DAP-LED achieves state-of-the-art performance in the dark. Meanwhile, the enhanced results are demonstrated to be effective for three downstream tasks. For demo and more results, please check the project page: https://vlislab22.github.io/dap-led/.

Abstract:
Many manipulation tasks require the robot to rearrange objects relative to one another. Such tasks can be described as a sequence of relative poses between parts of a set of rigid bodies. In this work, we propose Match Policy, a simple but novel pipeline for solving high-precision pick and place tasks. Instead of predicting actions directly, our method registers the pick and place targets to the stored demonstrations. This transfers action inference into a point cloud registration task and enables us to realize nontrivial manipulation policies without any training. Match Policy is designed to solve high-precision tasks with a key-frame setting. By leveraging the geometric interaction and the symmetries of the task, it achieves extremely high sample efficiency and generalizability to unseen configurations. We demonstrate its state-of-the-art performance across various tasks on RLbench benchmark compared with several strong baselines and test it on a real robot with six tasks. Videos and code are available on https://haojhuang.github.io/match_page/.

Abstract:
Long-term monitoring and exploration of extreme environments, such as underwater storage facilities, is costly, labor-intensive, and hazardous. Automating this process with low-cost, collaborative robots can greatly improve efficiency. These robots capture images from different positions, which must be processed simultaneously to create a spatio-temporal model of the facility. In this paper, we propose a novel approach that integrates data simulation, a multi-modal deep learning network for coordinate prediction, and image reassembly to address the challenges posed by environmental disturbances causing drift and rotation in the robots' positions and orientations. Our approach enhances the precision of alignment in noisy environments by integrating visual information from snapshots, global positional context from masks, and noisy coordinates. We validate our method through extensive experiments using synthetic data that simulate real-world robotic operations in underwater settings. The results demonstrate very high coordinate prediction accuracy and plausible image assembly, indicating the real-world applicability of our approach. The assembled images provide clear and coherent views of the underwater environment for effective monitoring and inspection, showcasing the potential for broader use in extreme settings, further contributing to improved safety, efficiency, and cost reduction in hazardous field monitoring.

Abstract:
Trajectory prediction models in autonomous driving are vulnerable to perturbations from non-causal agents whose actions should not affect the ego-agent's behavior. Such perturbations can lead to incorrect predictions of other agents' trajectories, potentially compromising the safety and efficiency of the ego-vehicle's decision-making process. Motivated by this challenge, we propose Causal tRajecTory predICtion (CRiTIC), a novel model that utilizes a causal discovery network to identify inter-agent causal relations over a window of past time steps. To incorporate discovered causal relationships, we propose a novel Causal Attention Gating mechanism to selectively filter information in the proposed Transformer- based architecture. We conduct extensive experiments on two autonomous driving benchmark datasets to evaluate the robustness of our model against non-causal perturbations and its generalization capacity. Our results indicate that the robustness of predictions can be improved by up to 54% without a significant detriment to prediction accuracy. Lastly, we demonstrate the superior domain generalizability of the proposed model, which achieves up to 29% improvement in cross-domain performance. These results underscore the potential of our model to enhance both robustness and generalization capacity for trajectory prediction in diverse autonomous driving domairis.4

Abstract:
We introduce ThermoStereoRT, a real-time thermal stereo matching method designed for all-weather conditions that recovers disparity from two rectified thermal stereo images, envisioning applications such as night-time drone surveillance or under-bed cleaning robots. Leveraging a lightweight yet powerful backbone, ThermoStereoRT constructs a 3D cost volume from thermal images and employs multi-scale attention mechanisms to produce an initial disparity map. To refine this map, we design a novel channel and spatial attention module. Addressing the challenge of sparse ground truth data in thermal imagery, we utilize knowledge distillation to boost performance without increasing computational demands. Comprehensive evaluations on multiple datasets demonstrate that ThermoStereoRT delivers both real-time capacity and robust accuracy, making it a promising solution for real-world deployment in various challenging environments. Our code will be released on https://github.com/SJTU-ViSYS-team/ThermoStereoRT.

Abstract:
Behavior cloning facilitates the learning of dexterous manipulation skills, yet the complexity of surgical environments, the difficulty and expense of obtaining patient data, and robot calibration errors present unique challenges for surgical robot learning. We provide an enhanced surgical digital twin with photorealistic human anatomical organs, integrated into a comprehensive simulator designed to generate high-quality synthetic data to solve fundamental tasks in surgical autonomy. We present SuFIA-BC: visual Behavior Cloning policies for Surgical First Interactive Autonomy Assistants. We investigate visual observation spaces including multi-view cameras and 3D visual representations extracted from a single endoscopic camera view. Through systematic evaluation, we find that the diverse set of photorealistic surgical tasks introduced in this work enables a comprehensive evaluation of prospective behavior cloning models for the unique challenges posed by surgical environments. We observe that current state-of-the-art behavior cloning techniques struggle to solve the contact-rich and complex tasks evaluated in this work, regardless of their underlying perception or control architectures. These findings highlight the importance of customizing perception pipelines and control architectures, as well as curating larger-scale synthetic datasets that meet the specific demands of surgical tasks. Project website: orbit-surgical.github.io/sufia-bc/

Abstract:
Although research has produced promising results demonstrating the utility of active inference (AIF) in Markov decision processes (MDPs), there is relatively less work that builds AIF models in the context of environments and problems that take the form of partially observable Markov decision processes (POMDPs). In POMDP scenarios, the agent must infer the unobserved environmental state from raw sensory observations, e.g., pixels in an image. Additionally, less work exists in examining the most difficult form of POMDP-centered control: continuous action space POMDPs under sparse reward signals. In this work, we address issues facing the AIF modeling paradigm by introducing novel prior preference learning techniques and self-revision schedules to help the agent excel in sparse-reward, continuous action, goal-based robotic control POMDP environments. Empirically, we show that our agents offer improved performance over state-of-the-art models in terms of cumulative rewards, relative stability, and success rate.

Abstract:
Imitation learning has demonstrated significant potential in performing high-precision manipulation tasks using visual feedback. However, it is common practice in imitation learning for cameras to be fixed in place, resulting in issues like occlusion and limited field of view. Furthermore, cameras are often placed in broad, general locations, without an effective viewpoint specific to the robot's task. In this work, we investigate the utility of active vision (AV) for imitation learning and manipulation, in which, in addition to the manipulation policy, the robot learns an AV policy from human demonstrations to dynamically change the robot's camera viewpoint to obtain better information about its environment and the given task. We introduce AV-ALOHA, a new bimanual teleoperation robot system with AV, an extension of the ALOHA 2 robot system, incorporating an additional 7-DoF robot arm that only carries a stereo camera and is solely tasked with finding the best viewpoint. This camera streams stereo video to an operator wearing a virtual reality (VR) headset, allowing the operator to control the camera pose using head and body movements. The system provides an immersive teleoperation experience, with bimanual first-person control, enabling the operator to dynamically explore and search the scene and simultaneously interact with the environment. We conduct imitation learning experiments of our system both in real-world and in simulation, across a variety of tasks that emphasize viewpoint planning. Our results demonstrate the effectiveness of human-guided AV for imitation learning, showing significant improvements over fixed cameras in tasks with limited visibility. Project website: https://soltanilara.github.io/av-alohal

Abstract:
The objective of this study is to enable fast and safe manipulation tasks in home environments. Specifically, we aim to develop a system that can recognize its surroundings and identify target objects while in motion, enabling it to plan and execute actions accordingly. We propose a periodic sampling-based whole-body trajectory planning method, called the “Robot Local Planner (RLP).” This method leverages unique features of home environments to enhance computational efficiency, motion optimality, and robustness against recognition and control errors, all while ensuring safety. The RLP minimizes computation time by planning with minimal waypoints and generating safe trajectories. Furthermore, overall motion optimality is improved by periodically executing trajectory planning to select more optimal motions. This approach incorporates inverse kinematics that are robust to base position errors, further enhancing robustness. Evaluation experiments demonstrated that the RLP outperformed existing methods in terms of motion planning time, motion duration, and robustness, confirming its effectiveness in home environments. Moreover, application experiments using a tidy-up task achieved high success rates and short operation times, thereby underscoring its practical feasibility.

Abstract:
Transparent object perception is indispensable for numerous robotic tasks. However, accurately segmenting and estimating the depth of transparent objects remain challenging due to complex optical properties. Existing methods primarily delve into only one task using extra inputs or specialized sensors, neglecting the valuable interactions among tasks and the subsequent refinement process, leading to suboptimal and blurry predictions. To address these issues, we propose a monocular framework, which is the first to excel in both segmentation and depth estimation of transparent objects, with only a single-image input. Specifically, we devise a novel semantic and geometric fusion module, effectively integrating the multi-scale information between tasks. In addition, drawing inspiration from human perception of objects, we further incorporate an iterative strategy, which progressively refines initial features for clearer results. Experiments on two challenging synthetic and real-world datasets demonstrate that our model surpasses state-of-the-art monocular, stereo, and multi-view methods by a large margin of about 38.8%-46.2% with only a single RGB input. Codes and models are publicly available at https://github.com/L-J-Yuan/MODEST.

Abstract:
Guide dog robots offer promising solutions to enhance mobility and safety for visually impaired individuals, addressing the limitations of traditional guide dogs, particularly in perceptual intelligence and communication. With the emergence of Vision-Language Models (VLMs), robots are now capable of generating natural language descriptions of their surroundings, aiding in safer decision-making. However, existing VLMs often struggle to accurately interpret and convey spatial relationships, which is crucial for navigation in complex environments such as street crossings. We introduce the SpaceAware Instruction Tuning (SAIT) dataset and the Space-Aware Benchmark (SA-Bench) to address the limitations of current VLMs in understanding physical environments. Our automated data generation pipeline focuses on the virtual path to the destination in 3D space and the surroundings, enhancing environmental comprehension and enabling VLMs to provide more accurate guidance to visually impaired individuals. We also propose an evaluation protocol to assess VLM effectiveness in delivering walking guidance. Comparative experiments demonstrate that our space-aware instruction-tuned model outperforms state-of-the-art algorithms. We have fully opensourced the SAIT dataset and SA-Bench, along with the related code, at https://github.com/byungokhan/Space-awareVLM.

Abstract:
Accurate localization is essential for enabling modern full self-driving services. These services heavily rely on map-based traffic information to reduce uncertainties in recognizing lane shapes, traffic light locations, and traffic signs. Achieving this level of reliance on map information requires centimeter-level localization accuracy, which is currently only achievable with LiDAR sensors. However, LiDAR is known to be vulnerable to spoofing attacks that emit malicious lasers against LiDAR to overwrite its measurements. Once localization is compromised, the attack could lead the victim off roads or make them ignore traffic lights. Motivated by these serious safety implications, we design SLAMSpoof, the first practical LiDAR spoofing attack on localization systems for self-driving to assess the actual attack significance on autonomous vehicles. SLAMSpoof can effectively find the effective attack location based on our scan matching vulnerability score (SMVS), a point-wise metric representing the potential vulnerability to spoofing attacks. To evaluate the effectiveness of the attack, we conduct real-world experiments on ground vehicles and confirm its high capability in real-world scenarios, inducing position errors of \geq 4.2 meters (more than typical lane width) for all 3 popular LiDAR-based localization algorithms. We finally discuss the potential countermeasures of this attack. Code is available at https://github.com/Keio-CSG/slamspoof.

Abstract:
Dexterous grasping is a fundamental yet challenging skill in robotic manipulation, requiring precise interaction between robotic hands and objects. In this paper, we present \mathcalD(\mathcalR, \mathcalO) Grasp, a novel framework that models the interaction between the robotic hand in its grasping pose and the object, enabling broad generalization across various robot hands and object geometries. Our model takes the robot hand's description and object point cloud as inputs and efficiently predicts kinematically valid and stable grasps, demonstrating strong adaptability to diverse robot embodiments and object geometries. Extensive experiments conducted in both simulated and real-world environments validate the effectiveness of our approach, with significant improvements in success rate, grasp diversity, and inference speed across multiple robotic hands. Our method achieves an average success rate of 87.53% in simulation in less than one second, tested across three different dexterous robotic hands. In real-world experiments using the LeapHand, the method also demonstrates an average success rate of \mathbf8 9%. \mathcalD(\mathcalR, \mathcalO) Grasp provides a robust solution for dexterous grasping in complex and varied environments. The code, appendix, and videos are available on our project website at https://nus-lins-lab.github.io/drograspweb/.

Abstract:
Knowledge of a manipulator's workspace is fundamental for a variety of tasks including robot design, grasp planning and robot base placement. Consequently, workspace representations are well studied in robotics. Two important representations are reachability maps and inverse reachability maps. The former predicts whether a given end-effector pose is reachable from where the robot currently is, and the latter suggests suitable base positions for a desired end-effector pose. Typically, the reachability map is built by discretizing the 6D space containing the robot's workspace and determining, for each cell, whether it is reachable or not. The reachability map is subsequently inverted to build the inverse map. This is a cumbersome process which restricts the applications of such maps. In this work, we exploit commonalities of existing six and seven axis robot arms to reduce the dimension of the discretization from 6D to 4D. We propose Reachability Map 4D (RM4D), a map that only requires a single 4D data structure for both forward and inverse queries. This gives a much more compact map that can be constructed by an order of magnitude faster than existing maps, with no inversion overheads and no loss in accuracy. Finally, we showcase the efficiency gains by applying RM4D for finding suitable base positions in a scenario with 800 target grasps.

Abstract:
Forecasting the semantics and 3D structure of scenes is essential for robots to navigate and plan actions safely. Recent methods have explored semantic and panoptic scene forecasting; however, they do not consider the geometry of the scene. In this work, we propose the panoptic-depth forecasting task for jointly predicting the panoptic segmentation and depth maps of unobserved future frames, from monocular camera images. To facilitate this work, we extend the popular KITTI-360 and Cityscapes benchmarks by computing depth maps from LiDAR point clouds and leveraging sequential labeled data. We also introduce a suitable evaluation metric that quantifies both the panoptic quality and depth estimation accuracy of forecasts in a coherent manner. Furthermore, we present two baselines and propose the novel PDcast architecture that learns rich spatio-temporal representations by incorporating a transformer-based encoder, a forecasting module, and task-specific decoders to predict future panoptic-depth outputs. Extensive evaluations demonstrate the effectiveness of PDcast across two datasets and three forecasting tasks, consistently addressing the primary challenges. We make the code publicly available at https://pdcast.cs.uni-freiburg.de

Abstract:
Today's tactile sensors have a variety of different designs, making it challenging to develop general-purpose methods for processing touch signals. In this paper, we learn a unified representation that captures the shared information between different tactile sensors. Unlike current approaches that focus on reconstruction or task-specific supervision, we leverage contrastive learning to integrate tactile signals from two different sensors into a shared embedding space, using a dataset in which the same objects are probed with multiple sensors. We apply this approach to paired touch signals from GelSlim and Soft Bubble sensors. We show that our learned features provide strong pretraining for downstream pose estimation and classification tasks. We also show that our embedding enables models trained using one touch sensor to be deployed using another without additional training. Project details can be found at https://www.mmintlab.com/research/cttp/.

Abstract:
Autonomous mobile robots (e.g., warehouse logistics robots) often need to traverse complex, obstacle-rich, and changing environments to reach multiple fixed goals (e.g., ware-house shelves). Traditional motion planners need to calculate the entire multi-goal path from scratch in response to changes in the environment, which results in a large consumption of computing resources. This process is not only time-consuming but also may not meet real-time requirements in application scenarios that require rapid response to environmental changes. In this paper, we provide a novel Multi-Goal Motion Memory technique11https://github.com/yuanjielu-64/MGMM_ICRA2025.git that allows sampling-based motion planners to use previous planning experiences to accelerate future multi-goal planning in changing environments. This algorithm allows robots to use previous planning experiences to accelerate future multi-goal planning in changing environments. Specifically, our approach predicts dynamically feasible trajectories and distances between goal pairs to guide the sampling process to construct a motion map, to inform Traveling Salesman Problem (TSP) solvers to compute a tour, and to efficiently produce motion plans. Experiments conducted with a vehicle and a snake-like robot in obstacle-rich environments show that the proposed Motion Memory technique can substantially accelerate planning speed by up to 90%. Furthermore, the solution quality is comparable to state-of-the-art algorithms and even better in some environments.

Abstract:
This work proposes a semi-elastic optimizationbased LiDAR-inertial state estimation method, which balances the constraints from LiDAR, IMU and consistency according to their unique characteristics, thereby imparts appropriate elasticity for current state to be optimized to the correct value, and ensure the accuracy, consistency, and robustness of state estimation. We incorporate the proposed LiDAR-inertial state estimation method into a self-developed optimizationbased LiDAR-inertial odometry (LIO) framework. Experimental results on four public datasets demonstrate that the proposed method enhances the performance of optimizationbased LiDAR-inertial state estimation. We have released the source code of this work for the development of the community.

Abstract:
The recent introduction of large language models (LLMs) has revolutionized the field of robotics by enabling contextual reasoning and intuitive human-robot interaction in domains as varied as manipulation, locomotion, and self-driving vehicles. When viewed as a stand-alone technology, LLMs are known to be vulnerable to jailbreaking attacks, wherein mali-cious prompters elicit harmful text by bypassing LLM safety guardrails. To assess the risks of deploying LLMs in robotics, in this paper, we introduce ROBOPAIR, the first algorithm designed to jailbreak LLM-controlled robots. Unlike existing, textual attacks on LLM chatbots, Robopairelicits harmful physical actions from LLM-controlled robots, a phenomenon we experimentally demonstrate in three scenarios: (i) a white-box setting, wherein the attacker has full access to the NVID IA Dolphins self-driving LLM, (ii) a gray-box setting, wherein the attacker has partial access to a Clearpath Robotics Jackal UGV robot equipped with a GPT-40 planner, and (iii) a black-box setting, wherein the attacker has only query access to the GPT-3.5-integrated Unitree Robotics Go2robot dog. In each scenario and across three new datasets of harmful robotic actions, we demonstrate that ROBOPAIR, as well as several static baselines, finds jailbreaks quickly and effectively, often achieving 100 % attack success rates. Our results reveal, for the first time, that the risks of jailbroken LLMs extend far beyond text generation, given the distinct possibility that jailbroken robots could cause physical damage in the real world. Indeed, our results on the U nitree G02represent the first successful jailbreak of a deployed commercial robotic system. Addressing this emerging vulnerability is critical for ensuring the safe deployment of LLMs in robotics. Additional media is available at: https://robopair.org.

Abstract:
Partially Observable Markov Decision Processes (POMDPs) provide a structured framework for decision-making under uncertainty, but their application requires efficient belief updates. Sequential Importance Resampling Particle Filters (SIRPF), also known as Bootstrap Particle Filters, are commonly used as belief updaters in large approximate POMDP solvers, but they face challenges such as particle deprivation and high computational costs as the system's state dimension grows. To address these issues, this study introduces Rao-Blackwellized POMDP (RB-POMDP) approximate solvers and outlines generic methods to apply Rao-Blackwellization in both belief updates and online planning. We compare the performance of SIRPF and Rao-Blackwellized Particle Filters (RBPF) in a simulated localization problem where an agent navigates toward a target in a GPS-denied environment using POMCPOW and RB-POMCPOW planners. Our results not only confirm that RBPFs maintain efficient belief approximations over time with fewer particles, but, more surprisingly, RBPFs combined with quadrature-based integration improve planning quality significantly compared to SIRPF-based planning under the same computational limits.

Abstract:
We introduce One-Shot Dual-Arm Imitation Learning (ODIL), which enables dual-arm robots to learn precise and coordinated everyday tasks from just a single demonstration of the task. ODIL uses a new three-stage visual servoing (3-VS) method for precise alignment between the end-effector and target object, after which replay of the demonstration trajectory is sufficient to perform the task. This is achieved without requiring prior task or object knowledge, or additional data collection and training following the single demonstration. Furthermore, we propose a new dual-arm coordination paradigm for learning dual-arm tasks from a single demonstration. ODIL was tested on a real-world dual-arm robot, demonstrating state-of-the-art performance across six precise and coordinated tasks in both 4-DoF and 6-DoF settings, and showing robustness in the presence of distractor objects and partial occlusions. Videos are available at: https://www.robot-learning.uk/one-shot-dual-arm.

Abstract:
Vision-Language MOT is a critical tracking problem that has recently garnered increasing attention. It aims to track objects based on human language commands, displacing the traditional use of templates or pre-set information from training sets in conventional tracking tasks. However, a key challenge remains in understanding why language is used for tracking, hindering further development. In this paper, we introduce Language-Guided MOT, a unified task framework, and LaMOT, a corresponding large-scale benchmark, which encompasses diverse scenarios and language descriptions and comprises 1,660 sequences from 4 different datasets. The purpose of LaMOT is to unify various Vision-Language MOT tasks while providing a standardized evaluation platform. To ensure high-quality annotations, we manually assign appropriate descriptive texts to each target in every video and conduct careful inspection and correction. To our knowledge, LaMOt is the first benchmark dedicated to Language-Guided MOT. Additionally, we propose a simple yet effective tracker, termed LaMOTer. By establishing a unified task framework, providing challenging benchmarks, and offering insights for future algorithm design and evaluation, we expect to contribute to the advancement of research in Vision-Language MOT. We will release the data at https://github.com/Nathan-Li123/LaMOT.

Abstract:
The recently developed Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have shown encour-aging and impressive results for visual SLAM. However, most representative methods require RGBD sensors and are only available for indoor environments. The robustness of reconstruction in largescale outdoor scenarios remains unexplored. This paper introduces a large-scale 3DGS-based visual SLAM with stereo cameras, termed LSG-SLAM. The proposed LSG-SLAM employs a multi-modality strategy to estimate prior poses under large view changes. In tracking, we introduce feature-alignment warping constraints to alleviate the adverse effects of appearance similarity in rendering losses. For the scalability of large-scale scenarios, we introduce continuous Gaussian Splatting submaps to tackle unbounded scenes with limited memory. Loops are detected between GS sub maps by place recognition and the relative pose between looped keyframes is optimized utilizing rendering and feature warping losses. After the global optimization of camera poses and Gaussian points, a structure refinement module enhances the reconstruction quality. With extensive evaluations on the EuRoc and KITTI datasets, LSG-SLAM achieves superior performance over existing Neural, 3DGS-based, and even traditional approaches. Project page: https://lsg-slam.github.io.

Abstract:
Ultra-wideband (UWB) has shown promising potential in GPS-denied localization thanks to its lightweight and drift-free characteristics, while the accuracy is limited in real scenarios due to its sensitivity to sensor arrangement and non-Gaussian pattern induced by multi-path or multi-signal interference, which commonly occurs in many typical applications like long tunnels. We introduce a novel neural fusion framework for ranging inertial odometry which involves a graph attention UWB network and a recurrent neural inertial network. Our graph net learns scene-relevant ranging patterns and adapts to any number of anchors or tags, realizing accurate positioning without calibration. Additionally, the integration of least squares and the incorporation of nominal frame enhance overall performance and scalability. The effectiveness and robustness of our methods are validated through extensive experiments on both public and self-collected datasets, spanning indoor, outdoor, and tunnel environments. The results demonstrate the superiority of our proposed IR-ULSG in handling challenging conditions, including scenarios outside the convex envelope and cases where only a single anchor is available.

Abstract:
End-to-end autonomous driving systems have recently made rapid progress, thanks to simulators such as CARLA. They can drive without infraction of common driving rules on uncongested roads but are still struggling with dense traffic scenarios. We conjecture that this occurs because it lacks understanding of the dynamics of the surrounding vehicles, caused by the absence of explicit short-term memory within the perception path of end- to-end models. To address this challenge, we revise the perception module to explicitly model temporal information, by extending it with an auxiliary task that is well-known in computer vision research: optical flow. We generate a novel benchmark using the CARLA simulator to train our model, FlowFuser, and prove its superior ability to avoid collisions with other agents on the road.

Abstract:
Video segmentation is essential for advancing robotics and autonomous driving, particularly in open-world settings where continuous perception and object association across video frames are critical. While the Segment Anything Model (SAM) has excelled in static image segmentation, extending its capabilities to video segmentation poses significant challenges. We tackle two major hurdles: a) SAM's embedding limitations in associating objects across frames, and b) granularity inconsistencies in object segmentation. To this end, we introduce VideoSAM, an end-to-end framework designed to address these challenges by improving object tracking and segmentation consistency in dynamic environments. VideoSAM integrates an agglomerated backbone, RADIO, enabling object association through similarity metrics and introduces Cycle-ack-Pairs Propagation with a memory mechanism for stable object tracking. Additionally, we incorporate an autoregressive object-token mechanism within the SAM decoder to maintain consistent granularity across frames. Our method is extensively evaluated on the UVO and BURST benchmarks, and robotic videos from RoboTAP, demonstrating its effectiveness and robustness in real-world scenarios. All codes will be available.

Abstract:
Combining manipulation with the mobility of legged robots is essential for a wide range of robotic applications. However, integrating an arm with a mobile base significantly increases the system's complexity, making precise end-effector control challenging. Existing model-based approaches are often constrained by their modeling assumptions, leading to limited robustness. Meanwhile, recent Reinforcement Learning (RL) implementations restrict the arm's workspace to be in front of the robot or track only the position to obtain decent tracking accuracy. In this work, we address these limitations by introducing a whole-body RL formulation for end-effector pose tracking in a large workspace on rough, unstructured terrains. Our proposed method involves a terrain-aware sampling strategy for the robot's initial configuration and end-effector pose commands, as well as a game-based curriculum to extend the robot's operating range. We validate our approach on the ANYmal quadrupedal robot with a six DoF robotic arm. Through our experiments, we show that the learned controller achieves precise command tracking over a large workspace and adapts across varying terrains such as stairs and slopes. On deployment, it achieves a pose-tracking error of 2.64 cm and 3.64°, outperforming existing competitive baselines. The video of our work is available at: wholebody-pose-tracking.

Abstract:
Odometry is crucial for the navigation of autonomous vehicles in unknown environments. While cameras and LiDARs are commonly used to estimate the ego-motion of a vehicle, these sensors face limitations under bad lighting and severe weather conditions. Automotive radars overcome these challenges, but radar point clouds are generally sparse and noisy, making it difficult to identify useful features within a radar scan. In this paper, we address the problem of ego-motion estimation using a single automotive radar sensor. We propose a simple, yet effective, heuristic-based method to extract the ground plane from single radar scans and perform ground plane matching between consecutive scans. Additionally, we perform a windowed factor-graph optimization of the poses together with the ground plane, improving the accuracy of the pose estimation. We put our work to the test using the 4DRadarDataset. Our findings illustrate the state-of-the-art performance of our odometry approach compared to existing alternatives that use radar point clouds.

Abstract:
Radar has gained much attention in autonomous driving due to its accessibility and robustness. However, its standalone application for depth perception is constrained by issues of sparsity and noise. Radar-camera depth estimation offers a more promising complementary solution. Despite significant progress, current approaches fail to produce satisfactory dense depth maps, due to the unsatisfactory processing of the sparse and noisy radar data. They constrain the regions of interest for radar points in rigid rectangular regions, which may introduce unexpected errors and confusions. To address these issues, we develop a structure-aware strategy for radar depth enhancement, which provides more targeted regions of interest by leveraging the structural priors of RGB images. Furthermore, we design a Multi-Scale Structure Guided Network to enhance radar features and preserve detailed structures, achieving accurate and structure-detailed dense metric depth estimation. Building on these, we propose a structure-aware radar-camera depth estimation framework, named SA-RCD. Extensive experiments demonstrate that our SA-RCD achieves state-of-the-art performance on the nuScenes dataset. Our code will be available at https://github.com/FreyZhangYeh/SA-RCD.

Abstract:
Imitation learning techniques have been shown to be highly effective in real-world control scenarios, such as robotics. However, these approaches not only suffer from compounding error issues but also require human experts to provide complete trajectories. Although there exist interactive methods where an expert oversees the robot and intervenes if needed, these extensions usually only utilize the data collected during intervention periods and ignore the feedback signal hidden in non-intervention timesteps. In this work, we create a model to formulate how the interventions occur in such cases, and show that it is possible to learn a policy with just a handful of expert interventions. Our key insight is that it is possible to get crucial information about the quality of the current state and the optimality of the chosen action from expert feedback, regardless of the presence or the absence of intervention. We evaluate our method on various discrete and continuous simulation environments, a real-world robotic manipulation task, as well as a human subject study. Videos and the code can be found at https://liralab.usc.edu/mile.

Abstract:
Understanding a scene in terms of objects and their properties is fundamental for various vision-based robotic applications, including item picking. To effectively clear a bin, a robot must comprehend objects as graspable entities, often without prior access to models of the target object. This study focuses on open world object segmentation with the additional requirement of assigning identical class labels for repeated instances of the same object. This capability enables item picking tasks with homogeneous bins, filtering out packaging material, and sorting tasks. We propose a novel pipeline for detecting repeated instances of identical objects, building on recent advancements in vision foundation models and exploring approaches for estimating object similarities based on feature embeddings or keypoint correspondence matching. Through a comprehensive experimental evaluation, we establish a new state-of-the-art on ARMBench repeated objects segmentation, a particularly challenging open problem in bin-picking robotics. Additionally, we demonstrate the real-world application of our method integrated into a robot picking cell to showcase its relevance to industrial use cases.

Abstract:
Underwater robotic manipulation faces significant challenges due to complex fluid dynamics and unstructured environments, causing most manipulation systems to rely heavily on human teleoperation. In this paper, we introduce AquaBot, a fully autonomous manipulation system that combines behavior cloning from human demonstrations with self-learning optimization to improve beyond human teleoperation performance. With extensive real-world experiments, we demonstrate AquaBot's versatility across diverse manipulation tasks, including object grasping, trash sorting, and rescue retrieval. Our real-world experiments show that AquaBot's self-optimized policy outperforms a human operator by 41% in speed. AquaBot represents a promising step towards autonomous and self-improving underwater manipulation systems. We will open-source both hardware and software implementation details.

Abstract:
Swarm robotics focuses on designing and coordinating large groups of relatively simple robots to perform tasks in a decentralised and collective manner. The swarm provides a resilient and flexible solution for many applications. However, contemporary swarm robots have a significant power problem in that secondary (i.e. rechargeable) batteries are slow to charge and offer lifetimes of only a few years, increasing maintenance costs and pollution due to battery replacement. We imagine a different future, wherein battery-free robots powered by supercapacitors can be recharged in seconds, offer long-life autonomous operation and can rapidly pass charge between one another using trophallaxis. In pursuit of this vision, we contribute the CapBot, a battery-free swarm robot equipped with Mecanum wheels, a Cortex M4F application processor and Bluetooth Low Energy networking. The CapBot fully recharges in 16 s, offers 51 min of autonomous operation at top speed, and can transfer up to 50 % of its available charge to a peer via trophallaxis in under 20 s. The CapBot is fully open source and all software and hardware source is available online.

Abstract:
Energy-efficient visual-inertial motion tracking on SWAP-constrained edge devices (e.g., drones and AR glasses) is essential but challenging. Our previous work [1] introduced the first-of-its-kind quantized visual-inertial odometry (QVIO), utilizing either raw measurement quantization (zQVIO) or single-bit residual quantization (rQVIO). While QVIO has demonstrated significant data transfer reduction with competitive performance, it has limitations. Specifically, zQVIO directly quantizes raw measurements into multi-bit values, while requiring the ad-hoc inflation of measurement noise to account for quantization errors. On the other hand, rQVIO is limited to single-bit measurement with certain accuracy loss. This work introduces QVIO2 to address these issues. The proposed QVIO2 improves data quantization strategies and derives a Maximum A Posteriori (MAP) quantized estimator that rigorously handles both multi-bit and single-bit, raw and residual quantized measurements in a unified manner. These improvements lead to more communication-efficient and accurate systems. Additionally, we optimize the communication protocol to further reduce data transfer by eliminating unnecessary transmissions. Extensive numerical and experimental results demonstrate reduced communication requirements and improved accuracy. Compared to the previous QVIO system, zQVIO2 achieves the same accuracy with a 30 % reduction in data transfer, while rQVIO2 improves accuracy without increasing data communication. In real-world scenarios, our new zQVIO2 and rQVIO2 have demonstrated nearly no accuracy loss with only 4.6 bits and 3.5 bits of data communication, achieving compression rates of 7 × and 9.1 ×.

Abstract:
In autonomous driving, motion prediction aims at forecasting the future trajectories of nearby agents, helping the ego vehicle to anticipate behaviors and drive safely. A key challenge is generating a diverse set of future predictions, commonly addressed using data-driven models with Multiple Choice Learning (MCL) architectures and Winner-Takes-All (WTA) training objectives. However, these methods face initialization sensitivity and training instabilities. Additionally, to compensate for limited performance, some approaches rely on training with a large set of hypotheses, requiring a post-selection step during inference to significantly reduce the number of predictions. To tackle these issues, we take inspiration from annealed MCL, a recently introduced technique that improves the convergence properties of MCL methods through an annealed Winner-Takes-All loss (aWTA). In this paper, we demonstrate how the aWTA loss can be integrated with state-of-the-art motion forecasting models to enhance their performance using only a minimal set of hypotheses, eliminating the need for the cumbersome post-selection step. Our approach can be easily incorporated into any trajectory prediction model normally trained using WTA and yields significant improvements. To facilitate the application of our approach to future motion forecasting models, the code is made publicly available: https://github.com/valeoai/MF_aWTA.

Abstract:
In this paper, we tackle the novel computer vision problem of depth estimation through a translucent barrier. This is an important problem for robotics when manipulating objects through plastic wrapping, or when predicting the depth of items behind a translucent barrier for manipulation. We propose two approaches for providing depth prediction models the ability to see through translucent barriers: removing translucent barriers through image inpainting before passing to standard depth prediction models as input, and directly training depth models with images with translucent barriers. We show that image inpainting allows standard learned monocular and stereo depth estimation models to achieve 3 cm MAE for predicting depth of shelved items behind plastic, whereas training with real images with translucent barriers allows them to achieve centimeter or sub-centimeter MAE. We demonstrate in real robot experiments that depth-aided space estimation allows the robot to place 46 % additional items into shelves with translucent barriers. This paper also provides a publicly available dataset of objects occluded by translucent barriers in a tabletop environment and a shelf environment which will allow others to contribute to this novel problem that's critical for many robotic manipulation applications including suction gripping and item packing (available at https://sites.google.com/view/vulcan-depth-estimation).

Abstract:
Deep neural networks (DNNs) have gained significant attention in 3D object reconstruction. However, detecting and reconstructing hidden or buried objects underground remains a challenging task. Ground Penetrating Radar (GPR) has emerged as a cost-effective and non-destructive technology for subsurface object detection, including soil structures and pipelines. In this study, we present a deep convolutional neural network-based method for detecting target signals and performing curve parameter regression using multiple B-scans from GPR data. By leveraging the detection and regression outcomes, we further generate fitted curves that represent underground structures. To reconstruct a comprehensive and detailed 3D root structure, we design a shape reconstruction network that takes sparse sliced 3D points as input. The proposed approach is extensively trained and validated using synthetic 3D root datasets and simulated GPR data generated with gprMax. Additionally, the trained model demonstrates strong generalization capabilities when applied to real-world GPR data, ensuring its practical applicability.

Abstract:
In this paper, we present PTZ-Calib, a robust two-stage PTZ camera calibration method, that efficiently and accurately estimates camera parameters for arbitrary viewpoints. Our method includes an offline and an online stage. In the offline stage, we first uniformly select a set of reference images that sufficiently overlap to encompass a complete 360° view. We then utilize the novel PTZ-IBA (PTZ Incremental Bundle Adjustment) algorithm to automatically calibrate the cameras within a local coordinate system. Additionally, for practical application, we can further optimize camera parameters and align them with the geographic coordinate system using extra global reference 3D information. In the online stage, we formulate the calibration of any new viewpoints as a relocalization problem. Our approach balances the accuracy and computational efficiency to meet real-world demands. Extensive evaluations demonstrate our robustness and superior performance over state-of-the-art methods on various real and synthetic datasets. Datasets and source code can be accessed online at https://github.com/gjgjh/PTZ-Calib

Abstract:
This paper introduces LiGSM, a novel LiDARenhanced 3D Gaussian Splatting (3DGS) mapping framework that improves the accuracy and robustness of 3D scene mapping by integrating LiDAR data. LiGSM constructs joint loss from images and LiDAR point clouds to estimate the poses and optimize their extrinsic parameters, enabling dynamic adaptation to variations in sensor alignment. Furthermore, it leverages LiDAR point clouds to initialize 3DGS, providing a denser and more reliable starting points compared to sparse SfM points. In scene rendering, the framework augments standard image-based supervision with depth maps generated from LiDAR projections, ensuring an accurate scene representation in both geometry and photometry. Experiments on public and self-collected datasets demonstrate that LiGSM outperforms comparative methods in pose tracking and scene rendering.

Abstract:
In vision-based robot localization and SLAM, Visual Place Recognition (VPR) is essential. This paper addresses the problem of VPR, which involves accurately recognizing the location corresponding to a given query image. A popular approach to vision-based place recognition relies on low-level visual features. Despite significant progress in recent years, place recognition based on low-level visual features is challenging when there are changes in scene appearance. To address this, end-to-end training approaches have been proposed to overcome the limitations of hand-crafted features. However, these approaches still fail under drastic changes and require large amounts of labeled data to train models, presenting a significant limitation. Methods that leverage high-level semantic information, such as objects or categories, have been proposed to handle variations in appearance. In this paper, we introduce a novel VPR approach that remains robust to scene changes and does not require additional training. Our method constructs semantic image descriptors by extracting pixel-level embeddings using a zero-shot, language-driven semantic segmentation model. We validate our approach in challenging place recognition scenarios using real-world public dataset. The experiments demonstrate that our method outperforms non-learned image representation techniques and off-the-shelf convolutional neural network (CNN) descriptors. Our code is available at https://github.com/woo-soojin/context-based-vlpr.

Abstract:
Room segmentation plays a significant role in scene understanding, semantic mapping, and scene coverage for robots navigating in real-world indoor environments. However, most previous works take a passive segmentation that requires a complete and uncluttered grid map as input, often resulting in lower segmentation accuracy and cannot be deployed in unknown environments. In this paper, we propose an active room segmentation framework that can enable a robot to incrementally and autonomously perform room segmentation in cluttered indoor environments. Our framework consists of three key components: i) a door extraction module where a visual semantic feature, specifically, door, is extracted to better identify rooms in cluttered environments, ii) a within-room exploration module that detects frontiers within the currently exploring room, and iii) a topological module that represents connectivity between rooms and determines next room for exploration. We show through experiments that the proposed method depicts two distinct advantages against existing methods in segmentation accuracy and autonomy. The code is available at https://github.com/FreeformRobotics/Active_room_segmentation.

Abstract:
Lifelong mapping is crucial for the long-term deployment of robots in dynamic environments. In this paper, we present ELite, an ephemerality-aided LiDAR-based lifelong mapping framework which can seamlessly align multiple session data, remove dynamic objects, and update maps in an end-toend fashion. Map elements are typically classified as static or dynamic, but cases like parked cars indicate the need for more detailed categories than binary. Central to our approach is the probabilistic modeling of the world into two-stage ephemerality, which represent the transiency of points in the map within two different time scales. By leveraging the spatiotemporal context encoded in ephemeralities, ELite can accurately infer transient map elements, maintain a reliable up-to-date static map, and improve robustness in aligning the new data in a more finegrained manner. Extensive real-world experiments on long-term datasets demonstrate the robustness and effectiveness of our system. The source code is publicly available for the robotics community: https://github.com/dongjae0107/ELite.

Abstract:
Trajectory prediction, the task of forecasting future agent behavior from past data, is central to safe and efficient autonomous driving. A diverse set of methods (e.g., rule-based or learned with different architectures and datasets) have been proposed, yet it is often the case that the performance of these methods is sensitive to the deployment environment (e.g., how well the design rules model the environment, or how accurately the test data match the training data). Building upon the principled theory of online convex optimization but also going beyond convexity and stationarity, we present a lightweight and model-agnostic method to aggregate different trajectory predictors online. We propose treating each individual trajectory predictor as an “expert” and maintaining a probability vector to mix the outputs of different experts. Then, the key technical approach lies in leveraging online data - the true agent behavior to be revealed at the next timestepto form a convex-or-nonconvex, stationary-or-dynamic loss function whose gradient steers the probability vector towards choosing the best mixture of experts. We instantiate this method to aggregate trajectory predictors trained on different cities in the nuScenes dataset and show that it performs just as well, if not better than, any singular model, even when deployed on the out-of-distribution LYFT dataset.

Abstract:
Radar has become an essential sensor for autonomous navigation, especially in challenging environments where camera and LiDAR sensors fail. 4D single-chip millimeter-wave radar systems, in particular, have drawn increasing attention thanks to their ability to provide spatial and Doppler information with low hardware cost and power consumption. However, most single-chip radar systems using traditional signal processing, such as Fast Fourier Transform, suffer from limited spatial resolution in radar detection, significantly limiting the performance of radar-based odometry and Simultaneous Localization and Mapping (SLAM) systems. In this paper, we develop a novel radar signal processing pipeline that integrates spatial domain beamforming techniques, and extend it to 3D Direction of Arrival estimation. Experiments using public datasets are conducted to evaluate and compare the performance of our proposed signal processing pipeline against traditional methodologies. These tests specifically focus on assessing structural precision across diverse scenes and measuring odometry accuracy in different radar odometry systems. This research demonstrates the feasibility of achieving more accurate radar odometry by simply replacing the standard FFT-based processing with the proposed pipeline. The codes are available at GitHubhttps://github.com/SenseRoboticsLab/DBE-Radar.

Abstract:
Reliable simultaneous localization and mapping (SLAM) algorithms are necessary for safety-critical autonomous navigation. In the communication-constrained multi-agent setting, navigation systems increasingly use point-to-point range sensors as they afford measurements with low bandwidth requirements and known data association. The state estimation problem for these systems takes the form of range-aided (RA) SLAM. However, distributed algorithms for solving the RA-SLAM problem lack formal guarantees on the quality of the returned estimate. To this end, we present the first distributed algorithm for RA-SLAM that can efficiently recover certifiably globally optimal solutions. Our algorithm, distributed certifiably correct RA-SLAM (DCORA), achieves this via the Riemannian Staircase method, where computational procedures developed for distributed certifiably correct pose graph optimization are generalized to the RA-SLAM problem. We demonstrate DCORA's efficacy on real-world multi-agent datasets by achieving absolute trajectory errors comparable to those of a state-of-the-art centralized certifiably correct RA-SLAM algorithm. Additionally, we perform a parametric study on the structure of the RA-SLAM problem using synthetic data, revealing how common parameters affect DCORA's performance.

Abstract:
This paper proposes SOLVR, a unified pipeline for learning based LiDAR-Visual re-localisation which performs place recognition and 6-DoF registration across sensor modalities. We propose a strategy to align the input sensor modalities by leveraging stereo image streams to produce metric depth predictions with pose information, followed by fusing multiple scene views from a local window using a probabilistic occupancy framework to expand the limited field-of-view of the camera. Additionally, SOLVR adopts a flexible definition of what constitutes positive examples for different training losses, allowing us to simultaneously optimise place recognition and registration performance. Furthermore, we replace RANSAC with a registration function that weights a simple least-squares fitting with the estimated inlier likelihood of sparse keypoint correspondences, improving performance in scenarios with a low inlier ratio between the query and retrieved place. Our experiments on the KITTI and KITTI360 datasets show that SOLVR achieves state-of-the-art performance for LiDAR-Visual place recognition and registration, particularly improving registration accuracy over larger distances between the query and retrieved place.

Abstract:
Object navigation in multi-floor environments presents a formidable challenge in robotics, requiring sophisticated spatial reasoning and adaptive exploration strategies. Traditional approaches have primarily focused on single-floor scenarios, overlooking the complexities introduced by multi-floor structures. To address these challenges, we first propose a Multi-floor Navigation Policy (MFNP) and implement it in Zero-Shot object navigation tasks. Our framework comprises three key components: (i) Multi-floor Navigation Policy, which enables an agent to explore across multiple floors; (ii) Multi-modal Large Language Models (MLLMs) for reasoning in the navigation process; and (iii) Inter-Floor Navigation, ensuring efficient floor transitions. We evaluate MFNP on the Habitat-Matterport 3D (HM3D) and Matterport 3D (MP3D) datasets, both include multi-floor scenes. Our experiment results demonstrate that MFNP significantly outperforms all the existing methods in Zero-Shot object navigation, achieving higher success rates and improved exploration efficiency. Ablation studies further highlight the effectiveness of each component in addressing the unique challenges of multi-floor navigation. Meanwhile, we conducted real-world experiments to evaluate the feasibility of our policy. Upon deployment of MFNP, the Unitree quadruped robot demonstrated successful multi-floor navigation and found the target object in a completely unseen environment. By introducing MFNP, we offer a new paradigm for tackling complex, multi-floor environments in object navigation tasks, opening avenues for future research in vision-based navigation in realistic, multi-floor settings.

Abstract:
Accurate tracking of tissues and instruments in videos is crucial for Robotic-Assisted Minimally Invasive Surgery (RAMIS), as it enables the robot to comprehend the surgical scene with precise locations and interactions of tissues and tools. Traditional keypoint-based sparse tracking is limited by featured points, while flow-based dense two-view matching suffers from long-term drifts. Recently, the Tracking Any Point (TAP) algorithm was proposed to overcome these limitations and achieve dense accurate long-term tracking. However, its efficacy in surgical scenarios remains untested, largely due to the lack of a comprehensive surgical tracking dataset for evaluation. To address this gap, we introduce a new annotated surgical tracking dataset for benchmarking tracking methods for surgical scenarios, comprising real-world surgical videos with complex tissue and instrument motions. We extensively evaluate state-of-the-art (SOTA) TAP-based algorithms on this dataset and reveal their limitations in challenging surgical scenarios, including fast instrument motion, severe occlusions, and motion blur, etc. Furthermore, we propose a new tracking method, namely SurgMotion, to solve the challenges and further improve the tracking performance. Our proposed method outperforms most TAP-based algorithms in surgical instruments tracking, and especially demonstrates significant improvements over baselines in challenging medical videos. Our code and dataset are available at https://github.com/zhanbh1019/SurgicalMotion.

Abstract:
We present a novel second-order trajectory optimization algorithm based on Stein Variational Newton's Method and Maximum Entropy Differential Dynamic Programming. The proposed algorithm, called Stein Variational Differential Dynamic Programming, is a kernel-based extension of Maximum Entropy Differential Dynamic Programming that combines the best of the two worlds of sampling-based and gradient-based optimization. The resulting algorithm avoids known drawbacks of gradient-based dynamic optimization in terms of getting stuck at local minima, while it overcomes limitations of sampling-based stochastic optimization in terms of introducing undesirable stochasticity when applied in online fashion. To test the efficacy of the proposed algorithm, experiments are conducted in Model Predictive Control mode. The experiments include comparisons with unimodal and multimodal Maximum Entropy Differential Dynamic Programming as well as Model Predictive Path Integral Control and its multimodal and Stein Variational extensions. The results demonstrate the superior performance of the proposed algorithms and confirm the hypothesis that there is a middle ground between sampling-and gradient-based optimization that is indeed beneficial for dynamic optimization.

Abstract:
Gesture-based communication is a standard underwater communication strategy that is taught to divers as part of their regular diver training and it would seem a natural mechanism to leverage for diver to robot communication underwater. Enabling an unmanned underwater vehicle (UUV) to understand such sequences would involve having the robot learn the large set of gestures that divers use and the way they are combined. As perfect transcription of gestures is unlikely, the communication process also requires an error-correcting framework to ensure that communication is clear and correct. Here we describe an interactive process that provides this infrastructure. A weakly supervised transfer learning approach is used to recognize standard SCUBA gestures in individual video frames and within a Sim2Real process to train a LSTM to recognize gesture sequences. This process is placed within a per-gesture and per-sequence interaction process to assist and confirm the recognition of individual gestures and to confirm entire gesture sequences. Individual aspects of this process and complete end-to-end operation are demonstrated using an unmanned underwater vehicle.

Abstract:
Interactive segmentation has an important role in facilitating the annotation process of future LiDAR datasets. Existing approaches sequentially segment individual objects at each LiDAR scan, repeating the process throughout the entire sequence, which is redundant and ineffective. In this work, we propose interactive 4D segmentation, a new paradigm that allows segmenting multiple objects on multiple LiDAR scans simultaneously, and Interactive4D, the first interactive 4D segmentation model that segments multiple objects on superimposed consecutive LiDAR scans in a single iteration by utilizing the sequential nature of LiDAR data. While performing interactive segmentation, our model leverages the entire spacetime volume, leading to more efficient segmentation. Operating on the 4D volume, it directly provides consistent instance IDs over time and also simplifies tracking annotations. Moreover, we show that click simulations are crucial for successful model training on LiDAR point clouds. To this end, we design a click simulation strategy that is better suited for the characteristics of LiDAR data. To demonstrate its accuracy and effectiveness, we evaluate Interactive4D on multiple LiDAR datasets, where Interactive4D achieves a new state-of-the-art by a large margin. We publicly release the code and models at https://vision.rwth-aachen.de/Interactive4D.

Abstract:
This paper explores decision-making processes in robotic systems tasked with reconstructing scalar fields through sensing in uncertain environments. Each robot must handle noisy perception and operate within specific environmental and physical constraints. The complexity increases in multiagent scenarios, where robots must not only plan their actions but also anticipate the movements and strategies of other agents. Effective coordination is crucial to prevent collisions and minimize redundant tasks. To address this challenge, we propose an online, distributed multi-robot sampling algorithm that combines Monte Carlo Tree Search (MCTS) with Gaussian regression. In this approach, each robot iteratively selects its next sampling point while exchanging limited information with other robots and predicting their future actions. Predictions about other robots future actions are computed with a MCTS that is recomputed at each iteration to incorporate all information collected up to that point. We evaluate the performance of our method across diverse environments and team sizes, comparing it to algorithmic alternatives.

Abstract:
Recently, gravity has been highlighted as a crucial constraint for state estimation to alleviate potential vertical drift. Existing online gravity estimation methods rely on pose estimation combined with IMU measurements, which is considered best practice when direct velocity measurements are unavailable. However, with radar sensors providing direct velocity data-a measurement not yet utilized for gravity estimation-we found a significant opportunity to improve gravity estimation accuracy substantially. GaRLIO, the proposed gravity-enhanced Radar-LiDAR-Inertial Odometry, can robustly predict gravity to reduce vertical drift while simultaneously enhancing state estimation performance using pointwise velocity measurements. Furthermore, GaRLIO ensures robustness in dynamic environments by utilizing radar to remove dynamic objects from LiDAR point clouds. Our method is validated through experiments in various environments prone to vertical drift, demonstrating superior performance compared to traditional LiDAR-Inertial Odometry methods. We make our source code publicly available to encourage further research and development. https://github.com/ChiyunNoh/GaRLIO

Affiliations: Intelligent Transportation Thrust, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China; Li Auto Inc., Shanghai, China; College of Land and Environment, Shenyang Agricultural University, Shenyang, China; School of Transportation, Southeast University, Nanjing, China; College of Libera Arts, University of Minnesota, Twin Cities, Minneapolis, MN, USA; Robotics and Autonomous Systems Thrust, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China

Abstract:
Advancements in intelligent technologies have significantly improved navigation in complex traffic environments by enhancing environment perception and trajectory prediction for automated vehicles. However, current research often overlooks the joint reasoning of scenario agents and lacks explainability in trajectory prediction models, limiting their practical use in real-world situations. To address this, we introduce the Explainable Conditional Diffusion-based Multimodal Trajectory Prediction (DMTP) model, which is designed to elucidate the environmental factors influencing predictions and reveal the underlying mechanisms. Our model integrates a modified conditional diffusion approach to capture multimodal trajectory patterns and employs a revised Shapley Value model to assess the significance of global and scenario-specific features. Experiments using the Waymo Open Motion Dataset demonstrate that our explainable model excels in identifying critical inputs and significantly outperforms baseline models in accuracy. Moreover, the factors identified align with the human driving experience, underscoring the model's effectiveness in learning accurate predictions. Code is available in our open-source repository: https://github.com/ocean-luna/Explainable-Prediction.

Abstract:
Neural Radiance Fields (NeRFs) have become a leading technique for novel view synthesis, with promising applications in robotics. However, due to shape-radiance ambiguity, NeRFs often require additional depth inputs for regularization in outdoor scenarios. LiDAR provides accurate depth measurements, but current methods typically combine only a few frames, resulting in sparse depth maps and discrepancies with camera images. The asynchronous nature of LiDAR, where each point is captured at a different timestamp, introduces depth inaccuracies when treated as simultaneous. These errors, along with inherent LiDAR noise, create inconsistencies that hinder reconstruction accuracy. To address these challenges, we propose a continuous-time framework for joint Camera-LiDAR optimization, enabling more consistent radiance field reconstruction and improving both view synthesis and geometric accuracy. To address these issues, we introduce a continuoustime framework for joint Camera-LiDAR optimization, aiming to consistently reconstruct the radiance field for better view synthesis and geometric accuracy.

Abstract:
Legged robots with advanced manipulation capabilities have the potential to significantly improve household duties and urban maintenance. Despite considerable progress in developing robust locomotion and precise manipulation methods, seamlessly integrating these into cohesive whole-body control for real-world applications remains challenging. In this paper, we present a modular framework for robust and generalizable whole-body loco-manipulation controller based on a single arm-mounted camera. By using reinforcement learning (RL), we enable a robust low-level policy for command execution over 5 dimensions (5D) and a grasp-aware high-level policy guided by a novel metric, Generalized Oriented Reachability Map (GORM). The proposed system achieves state-of-the-art one-time grasping accuracy of 89% in real world, including challenging tasks such as grasping transparent objects. Through extensive simulations and real-world experiments, we demonstrate that our system can effectively manage a large workspace, from floor level to above body height, and perform diverse whole-body loco-manipulation tasks. See our robot at work: quadwbg.github.io.

Abstract:
Finding an high-quality solution for the tabletop object rearrangement planning is a challenging problem. Compared to determining a goal arrangement [1], rearrangement planning is challenging due to the dependencies between objects and the buffer capacity available to hold objects. Although [3] has proposed an A based searching strategy with lazy evaluation for the high-quality solution, it is not scalable, with the success rate decreasing as the number of objects increases. To overcome this limitation, we propose an enhanced A-based algorithm that improves state representation and employs incremental goal attempts with lazy evaluation at each iteration. This approach aims to enhance scalability while maintaining solution quality. Our evaluation demonstrates that our algorithm can provide superior solutions compared to [3], in a shorter time, for both stationary and mobile robots.

Abstract:
Robotic search and rescue, exploration, and inspection require trajectory planning across a variety of domains. A popular approach to trajectory planning for these types of missions is ergodic search, which biases a trajectory to spend time in parts of the exploration domain that are believed to contain more information. Most prior work on ergodic search has been limited to searching simple surfaces, like a 2D Euclidean plane or a sphere, as they rely on projecting functions defined on the exploration domain onto analytically obtained Fourier basis functions. In this paper, we extend ergodic search to any surface that can be approximated by a triangle mesh. The basis functions are approximated through finite element methods on a triangle mesh of the domain. We formally prove that this approximation converges to the continuous case as the mesh approximation converges to the true domain. We demonstrate that on domains where analytical basis functions are available (plane, sphere), the proposed method obtains equivalent results, and while on other domains (torus, bunny, wind turbine), the approach is versatile enough to still search effectively. Lastly, we also compare with an existing ergodic search technique that can handle complex domains and show that our method results in a higher quality exploration.

Abstract:
Imitation learning has been proven effective in mimicking demonstrations across various robotic manipulation tasks. However, to develop robust policies, current imitation methods, such as diffusion policy, require training on extensive demonstrations, making data collection labor-intensive. In contrast, model-based planning with dynamics models can effectively cover a sufficient range of configurations using only off-policy data. Yet, without the guidance of expert demonstrations, many tasks are difficult and time-consuming to plan using the dynamics models. Therefore, we take the best of both model learning and imitation learning, and propose neural dynamics augmented imitation learning that covers a large scene configurations with few-shot demonstrations. This method trains a robust diffusion policy in a local support region using few-shot demonstrations and rearranges objects outside this region into it using offline-trained neural dynamics models. Extensive experiments across various tasks in both simulations and real-world scenarios, including granular manipulation, contact-rich task and multi-object interaction task, have demonstrated that trained with only 1 to 30 demonstrations, our proposed method can robustly cover a significantly larger area than the policy trained purely from the demonstrations. Our project page is available at: https://dynamics-dp.github.io.

Abstract:
This paper introduces affordance-based explanations of robot navigational decisions. The rationale behind affordance-based explanations draws on the theory of affordances, a principle rooted in ecological psychology that describes potential actions the objects in the environment offer to the robot. We demonstrate how affordances can be incorporated into visual and textual explanations for common robot navigation and path-planning scenarios. Furthermore, we formalize and categorize the concept of affordance-based explanations and connect it to existing explanation types in robotics. We present the results of a user study that shows participants to be, on average, highly satisfied with visual-textual, i.e., multimodal, affordance-based explanations of robot navigation. Furthermore, we investigate the complexity of different types of textual affordance-based explanations. Our research contributes to the expanding domain of explainable robotics, focusing on explaining robot actions in navigation.

Abstract:
When reward functions are hand-designed, deep reinforcement learning algorithms often suffer from reward misspecification, causing them to learn suboptimal policies in terms of the intended task objectives. In the single-agent case, inverse reinforcement learning (IRL) techniques attempt to address this issue by inferring the reward function from expert demonstrations. However, in multi-agent problems, misalignment between the learned and true objectives is exacerbated due to increased environment non-stationarity and variance that scales with multiple agents. As such, in multi-agent general-sum games, multi-agent IRL algorithms have difficulty balancing cooperative and competitive objectives. To address these issues, we propose Multi-Agent Marginal Q-Learning from Demon-strations (MAMQL), a novel sample-efficient framework for multi-agent IRL. For each agent, MAMQL learns a critic marginalized over the other agents' policies, allowing for a well-motivated use of Boltzmann policies in the multi-agent context. We identify a connection between optimal marginalized critics and single-agent soft-Q IRL, allowing us to apply a direct, simple optimization criterion from the single-agent domain. Across our experiments on three different simulated domains, MAMQL significantly outperforms previous multi-agent methods in average reward, sample efficiency, and reward recovery by often more than 2-5x. We make our code available at https://sites.google.com/view/mamq/.

Abstract:
Traditionally, unmanned aerial vehicles (UAVs) rely on CMOS-based cameras to collect images about the world below. One of the most successful applications of UAVs is to generate orthomosaics or orthomaps, in which a series of images are integrated to develop a larger map. However, using CMOS-based cameras with global or rolling shutters means that orthomaps are vulnerable to challenging light conditions, motion blur, and high-speed motion of independently moving objects (IMOs) under the camera. Event cameras are less sensitive to these issues, as their pixels trigger asynchronously on brightness changes. This work introduces the first orthomosaic approach using event cameras. We focus on addressing high-dynamic range and low-light problems in orthomosaics. In contrast to existing methods relying only on CMOS cameras, our approach enables map generation even in challenging light conditions, including direct sunlight and after sunset. The source code for EvMAPPER, the high-altitude hardware, and the dataset collected in this paper are available open source11https://evmapper.fcladera.com.

Abstract:
This work presents a novel data-driven path planning algorithm named Instruction-Guided Probabilistic Roadmap (IG-PRM). Despite the recent development and widespread use of mobile robot navigation, the safe and effective travels of mobile robots still require significant engineering effort to take into account the constraints of robots and their tasks. With IG-PRM, we aim to address this problem by allowing robot operators to specify such constraints through natural language instructions, such as “aim for wider paths” or “mind small gaps”. The key idea is to convert such instructions into embedding vectors using large-language models (LLMs) and use the vectors as a condition to predict instruction-guided cost maps from occupancy maps. By constructing a roadmap based on the predicted costs, we can find instruction-guided paths via the standard shortest path search. Experimental results demonstrate the effectiveness of our approach on both synthetic and real-world indoor navigation environments.

Abstract:
Neural motion planners can increase motion planning quality and, by reducing collision detection computations, improve runtime. However, when profiled on an accelerator-rich hardware system, neural planning contributes to more than 50% of the runtime, and 33% of the computation energy consumption, motivating the design of compute- and energy-efficient neural planners. In this work, we propose a neural planner using Binary Encoded Labels (BEL), where a set of binary classifiers are used instead of a typical regression network. Compared to conventional regression-based neural planners, the proposed BEL neural planner reduces neural planning (inference) computation and collision detection checks while maintaining equal or higher motion planning success rate across various motion planning benchmarks. This computation reduction can improve the computation energy efficiency of neural planning by 1.4 ×-21.4 ×. Finally, we demonstrate the trade-offs between collision detection and neural planning computation to maximize energy efficiency for different hardware configurations.

Abstract:
Accurate prediction of pedestrian trajectories is crucial as autonomous vehicles become more prevalent on roads. The dynamic nature of urban environments and the less predictable behavior of pedestrians present significant challenges in developing reliable prediction models. Earlier methods relying on recurrent neural networks (RNNs) and long-shortterm memory (LSTM) networks have shown promise, but often fail to fully take advantage of the rich visual and contextual information available in real-world scenarios. Recent advances in vision-language models (VLMs) offer new opportunities to improve pedestrian trajectory prediction by incorporating multimodal reasoning capabilities. This paper introduces a novel approach that uses a powerful pre-trained VLM to improve the estimation of pedestrian trajectories. Specifically, we first enable learning of semantically useful scene context and high-level reasoning features via vision-language model finetuning on specific prompts using road scenes with pedestrians. Next, with the learned VLM features and the pedestrian's past trajectory history, we predict future trajectories using an encoder-decoder head. Through experiments with first-person datasets JAAD and PIE, we show that utilizing visual-linguistic semantics via a pretrained vision-language model outperforms previous methods in both deterministic and stochastic trajectory prediction setups.

Abstract:
3D multi-object tracking plays a critical role in autonomous driving by enabling the real-time monitoring and prediction of multiple objects' movements. Traditional 3D tracking systems are typically constrained by predefined object categories, limiting their adaptability to novel, unseen objects in dynamic environments. To address this limitation, we introduce open-vocabulary 3D tracking, which extends the scope of 3D tracking to include objects beyond predefined categories. We formulate the problem of open-vocabulary 3D tracking and introduce dataset splits designed to represent various open-vocabulary scenarios. We propose a novel approach that integrates open-vocabulary capabilities into a 3D tracking framework, allowing for generalization to unseen object classes. Our method effectively reduces the performance gap between tracking known and novel objects through strategic adaptation. Experimental results demonstrate the robustness and adaptability of our method in diverse outdoor driving scenarios. To the best of our knowledge, this work is the first to address open-vocabulary 3D tracking, presenting a significant advancement for autonomous systems in real-world settings. Code, trained models, and dataset splits are available at https://github.com/ayesha-ishaq/Open3DTrack.

Abstract:
The reward function is an essential component in robot learning. Reward directly affects the sample and computational complexity of learning, and the quality of a solution. The design of informative rewards requires domain knowledge, which is not always available. We use the properties of the dynamics to produce system-appropriate reward without adding external assumptions. Specifically, we explore an approach to utilize the Lyapunov exponents of the system dynamics to generate a system-immanent reward. We demonstrate that the Sum of the Positive Lyapunov Exponents (SuPLE) is a strong candidate for the design of such a reward. We develop a computational framework for the derivation of this reward, and demonstrate its effectiveness on classical benchmarks for sample-based stabilization of various dynamical systems. It eliminates the need to start the training trajectories at arbitrary states, also known as auxiliary exploration. While the latter is a common practice in simulated robot learning, it is unpractical to consider to use it in real robotic systems, since they typically start from natural rest states such as a pendulum at the bottom, a robot on the ground, etc. and can not be easily initialized at arbitrary states. Comparing the performance of SuPLE to commonly-used reward functions, we observe that the latter fail to find a solution without auxiliary exploration, even for the task of swinging up the double pendulum and keeping it stable at the upright position, a prototypical scenario for multi-linked robots. SuPLE-induced rewards for robot learning offer a novel route for effective robot learning in typical as opposed to highly specialized or fine-tuned scenarios. Our code is publicly available for reproducibility and further research.

Abstract:
We are motivated by the problem of autonomous vehicle performance validation. A key challenge is that an autonomous vehicle requires testing in every kind of driving scenario it could encounter, including rare events, to provide a strong case for safety and show there is no edge-case pathological behavior. Autonomous vehicle companies rely on potentially millions of miles driven in realistic simulation to expose the driving stack to enough miles to estimate rates and severity of collisions. To address scalability and coverage, we propose the use of a behavior foundation model, specifically a masked autoencoder (MAE), trained to reconstruct driving scenarios. We leverage the foundation model in two complementary ways: we (i) use the learned embedding space to group qualitatively similar scenarios together and (ii) fine-tune the model to label scenario difficulty based on the likelihood of a collision upon simulation. We use the difficulty scoring as importance weighting for the groups of scenarios. The result is an approach which can more rapidly estimate the rates and severity of collisions by prioritizing hard scenarios while ensuring exposure to every kind of driving scenario.

Abstract:
This paper focuses on the motion planning problem for serial articulated robots with revolute joints under kinematic constraints. Many motion planners leverage iterative local optimization methods but are often trapped in local minima due to non-convexity of the problem. A key reason for the non-convexity is the trigonometric term when parameterizing the kinematics using joint angles. Recent distance-based formulations can eliminate these trigonometric terms by formulating the kinematics based on distances, and has shown superior performance against classic joint angle based formulations in domains like inverse kinematics (IK). However, distance-based kinematics formulations have not yet been studied for motion planning, and naively applying them for motion planning may lead to poor computational efficiency. In particular, IK seeks one configuration while motion planning seeks a sequence of configurations, which greatly increases the scale of the underlying optimization problem. This paper proposes Propagative Distance Optimization for Motion Planning (PDOMP), which addresses the challenge by (i) introducing a new compact representation that reduces the number of variables in the distance-based formulation, and (ii) leveraging the chain structure to efficiently compute forward kinematics and Jacobians of the robot among waypoints along a path. Test results show that PDOMP runs up to 10 times faster than the sampling-based and angle-based-optimization baseline methods.

Abstract:
The automated generation of diversified training scenarios has been an important ingredient in many complex learning tasks, especially in real-world application domains such as autonomous driving, where auto-curriculum generation is considered vital for obtaining robust and general policies. However, crafting traffic scenarios with multiple, heterogeneous agents is typically considered a tedious and time-consuming task, especially in more complex simulation environments. To this end, we introduce MATS-Gym, a multi-agent training framework for autonomous driving that uses partial-scenario specifications to generate traffic scenarios with a variable number of agents which are executed in CARLA, a high-fidelity driving simulator. MATS-Gym reconciles scenario execution engines, such as Scenic and ScenarioRunner, with established multi-agent training frameworks where the interaction between the environment and the agents is modeled as a partially observable stochastic game. Furthermore, we integrate MATSGym with techniques from unsupervised environment design to automate the generation of adaptive auto-curricula, which is the first application of such algorithms to the domain of autonomous driving. The code is available at https://github.com/AutonomousDrivingExaminer/mats-gym.

Abstract:
Motion sensing and tracking with IMU data is essential for spatial intelligence, which however is challenging due to the presence of time-varying stochastic bias. IMU bias is affected by various factors such as temperature and vibration, making it highly complex and difficult to model analytically. Recent data-driven approaches using deep learning have shown promise in predicting bias from IMU readings. However, these methods often treat the task as a regression problem, overlooking the stochatic nature of bias. In contrast, we model bias, conditioned on IMU readings, as a probabilistic distribution and design a conditional diffusion model to approximate this distribution. Through this approach, we achieve improved performance and make predictions that align more closely with the known behavior of bias.

Abstract:
Robots need to know where they are in the world to operate effectively without human support. One common first step for precise robot localization is visual place recognition. It is a challenging problem, especially when the output is required in an online fashion, and the current state-of-the-art approaches that tackle it usually require either large amounts of labeled training data or rely on parameters that need to be tuned manually, often per dataset. One such parameter often used for sequence-based place recognition is the image similarity threshold that allows to differentiate between pairs of images that represent the same place even in the presence of severe environmental and structural changes, and those that represent different places even if they share a similar appearance. Currently, selecting this threshold is a manual procedure and requires human expertise. We propose an automatic similarity threshold selection technique and integrate it into a complete sequence-based place recognition system. The experiments on a broad range of real-world and simulated data show that our approach is capable of matching image sequences under various illumination, viewpoint and underlying structural changes, runs online, and requires no manual parameter tuning while yielding performance comparable to a manual, dataset-specific parameter tuning. Thus, this paper substantially increases the ease of use of visual place recognition in real-world settings.

Abstract:
Accurate 3d reconstruction capturing the fine details of an object's shape is essential for tasks such as automated assembly, inspection, and quality control. While monocular cameras provide broad visual structure but often miss critical surface details and depth accuracy in underexposed or occluded environments. Tactile sensors offer precise, localized depth information, capturing fine textures, yet exploring varied curvature surfaces with only tactile input remains challenging. To address this, the paper proposes a blind surface exploration method for convex objects using a set of sequential controllers to efficiently guide the manipulator's interaction with surfaces featuring sharp edge changes. This approach ensures precise tactile exploration, leading to highly detailed surface reconstruction. With the controller employed, the algorithm was able to move along the surface while maintaining contact along normal and reconstruct the object with IoU as high as \mathbf9 1 % for objects with sharp edges.

Abstract:
Miniaturizing legged robot platforms is challenging due to hardware limitations that constrain the number, power density, and precision of actuators at that size. By leveraging design principles of quasi-passive walking robots at any scale, stable locomotion and steering can be achieved with simple mechanisms and open-loop control. Here, we present the design and control of “Zippy”, the smallest self-contained bipedal walking robot at only 3.6 cm tall. Zippy has rounded feet, a single motor without feedback control, and is capable of turning, skipping, and ascending steps. At its fastest pace, the robot achieves a forward walking speed of 25 cm/s, which is 10 leg lengths per second, the fastest biped robot of any size by that metric. This work explores the design and performance of the robot and compares it to similar dynamic walking robots at larger scales.

Abstract:
The term safety in robotics is often understood as a synonym for avoidance. Although this perspective has led to progress in path planning and reactive control, a generalization of this perspective is necessary to include task semantics relevant to contact-rich manipulation tasks, especially during teleoperation and to ensure the safety of learned policies. We introduce the semantics-aware distance function and a corresponding computational method based on the Kelvin Transformation. This allows us to compute smooth distance approximations in an unbounded domain by instead solving a Laplace equation in a bounded domain. The semantics-aware distance generalizes signed distance functions by allowing the zero level set to lie inside of the object in regions where contact is allowed, effectively incorporating task semantics, such as object affordances, in an adaptive implicit representation of safe sets. In numerical experiments we show the computational viability of our method for real applications and visualize the computed function on a wrench with various semantic regions.

Abstract:
In their seminal work, Gauci et al. (2014) studied the fundamental task of aggregation, wherein multiple robots need to gather without an a priori agreed-upon meeting location, using extremely limited hardware. That paper considered differential-drive robots that are memoryless and unable to compute. Moreover, the robots cannot communicate with one another and are only equipped with a simple sensor that determines whether another robot is directly in front of them. Despite those severe limitations, Gauci et al. introduced a controller and proved mathematically that it aggregates a system of two robots for any initial state. Unfortunately, for larger systems, the same controller aggregates empirically in many cases but not all. Thus, the question of whether there exists a controller that aggregates for any number of robots remains open. In this paper, we show that no such controller exists by investigating the geometric structure of controllers. In addition, we disprove the aggregation proof of the aforementioned paper for two robots and present an alternative controller alongside a simple and rigorous aggregation proof.

Abstract:
Recently, large language models (LLMs) have shown strong potential in facilitating human-robotic interaction and collaboration. However, existing LLM-based systems often overlook the misalignment between human and robot perceptions, which hinders their effective communication and real-world robot deployment. To address this issue, we introduce SYNERGAI, a unified system designed to achieve both perceptual alignment and human-robot collaboration. At its core, SYNERGAI employs 3D Scene Graph (3DSG) as its explicit and innate representation. This enables the system to leverage LLM to break down complex tasks and allocate appropriate tools in intermediate steps to extract relevant information from the 3DSG, modify its structure, or generate responses. Importantly, SYNERGAI incorporates an automatic mechanism that enables perceptual misalignment correction with users by updating its 3DSG with online interaction. SYNERGAI achieves comparable performance with the data-driven models in ScanQA in a zero-shot manner. Through comprehensive experiments across 10 real-world scenes, SYNERGAI demonstrates its effectiveness in establishing common ground with humans, realizing a success rate of 61.9 % in alignment tasks. It also significantly improves the success rate from 3.7% to 45.68 % on novel tasks by transferring the knowledge acquired during alignment.

Abstract:
This paper presents DynORecon, a Dynamic Object Reconstruction system that leverages the information provided by Dynamic SLAM to simultaneously generate a volumetric map of observed moving entities while estimating free space to support navigation. By capitalising on the motion estimations provided by Dynamic SLAM, DynORecon continuously refines the representation of dynamic objects to eliminate residual artefacts from past observations and incrementally reconstructs each object, seamlessly integrating new observations to capture previously unseen structures. Our system is highly efficient (~20 FPS) and produces accurate (~10 cm) object reconstructions using simulated and real-world outdoor datasets.

Abstract:
Hamilton-Jacobi (HJ) reachability analysis is a powerful framework for ensuring safety and performance in autonomous systems. However, existing methods typically rely on a white-box dynamics model of the system, limiting their applicability in many practical robotics scenarios where only a black-box model of the system is available. In this work, we propose a novel reachability method to compute reachable sets and safe controllers for black-box dynamical systems. Our approach efficiently approximates the Hamiltonian function using samples from the black-box dynamics. This Hamiltonian is then used to solve the HJ Partial Differential Equation (PDE), providing the reachable set of the system. The proposed method can be applied to general nonlinear systems and can be seamlessly integrated with existing reachability toolboxes for white-box systems to extend their use to black-box systems. Through simulation studies on a black-box slip-wheel car and a quadruped robot, we demonstrate the effectiveness of our approach in accurately obtaining the reachable sets for blackbox dynamical systems.

Abstract:
Lane topology extraction involves detecting lanes and traffic elements and determining their relationships, a key perception task for mapless autonomous driving. This task requires complex reasoning, such as determining whether it is possible to turn left into a specific lane. To address this challenge, we introduce neuro-symbolic methods powered by vision-language foundation models (VLMs). Existing approaches have notable limitations: (1) Dense visual prompting with VLMs can achieve strong performance but is costly in terms of both financial resources and carbon footprint, making it impractical for robotics applications. (2) Neuro-symbolic reasoning methods for 3D scene understanding fail to integrate visual inputs when synthesizing programs, making them ineffective in handling complex corner cases. To this end, we propose a fast-slow neuro-symbolic lane topology extraction algorithm, named Chameleon, which alternates between a fast system that directly reasons over detected instances using synthesized programs and a slow system that utilizes a VLM with a chain-of-thought design to handle corner cases. Chameleon leverages the strengths of both approaches, providing an affordable solution while maintaining high performance. We evaluate the method on the OpenLane-V2 dataset, showing consistent improvements across various baseline detectors. Our code, data, and models are publicly available at https://github.com/XR-Lee/neural-symbolic

Abstract:
Learning collaborative behaviors is essential for multi-agent systems. Traditionally, multi-agent reinforcement learning solves this implicitly through a joint reward and centralized observations, assuming collaborative behavior will emerge. Other studies propose to learn from demonstrations of a group of collaborative experts. Instead, we propose an efficient and explicit way of learning collaborative behaviors in multi-agent systems by leveraging expertise from only a single human. Our insight is that humans can naturally take on various roles in a team. We show that agents can effectively learn to collaborate by allowing a human operator to dynamically switch between controlling agents for a short period and incorporating a human-like theory-of-mind model of teammates. Our experiments showed that our method improves the success rate of a challenging collaborative hide-and-seek task by up to 58% with only 40 minutes of single-human guidance. We further demonstrate our findings transfer to the real world by conducting multi-robot experiments.

Abstract:
Scene Change Detection is a challenging task in computer vision and robotics that aims to identify differences between two images of the same scene captured at different times. Traditional change detection methods rely on training models that take these image pairs as input and estimate the changes, which requires large amounts of annotated data, a costly and time-consuming process. To overcome this, we propose ZeroSCD, a zero-shot scene change detection framework that eliminates the need for training. ZeroSCD leverages pre-existing models for place recognition and semantic segmentation, utilizing their features and outputs to perform change detection. In this framework, features extracted from the place recognition model are used to estimate correspondences and detect changes between the two images. These are then combined with segmentation results from the semantic segmentation model to precisely delineate the boundaries of the detected changes. Extensive experiments on benchmark datasets demonstrate that ZeroSCD outperforms several state-of-the-art methods in change detection accuracy, despite not being trained on any of the benchmark datasets, proving its effectiveness and adaptability across different scenarios.

Abstract:
Autonomous robot navigation and manipulation in open environments require reasoning and replanning with closed-loop feedback. In this work, we present COME-robot, the first closed-loop robotic system utilizing the GPT-4V vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios. COME-robot incorporates two key innovative modules: (i) a multi-level open-vocabulary perception and situated reasoning module that enables effective exploration of the 3D environment and target object identification using commonsense knowledge and situated information, and (ii) an iterative closed-loop feedback and restoration mechanism that verifies task feasibility, monitors execution success, and traces failure causes across different modules for robust failure recovery. Through comprehensive experiments involving 8 challenging real-world mobile and tabletop manipulation tasks, COME-robot demonstrates a significant improvement in task success rate (～ 35 %) compared to state-of-the-art methods. We further conduct comprehensive analyses to elucidate how COME-robot's design facilitates failure recovery, free-form instruction following, and long-horizon task planning.

Abstract:
Reinforcement Learning (RL) has the potential to enable robots to learn from their own actions in the real world. Unfortunately, RL can be prohibitively expensive, in terms of on-robot runtime, due to inefficient exploration when learning from a sparse reward signal. Designing dense reward functions is labour-intensive and requires domain expertise. In our work, we propose Goal-Contrastive Rewards (GCR), a dense reward function learning method that can be trained on passive video demonstrations. By using videos without actions, our method is easier to scale, as we can use arbitrary videos. GCR combines two loss functions, an implicit value loss function that models how the reward increases when traversing a successful trajectory, and a goal-contrastive loss that discriminates between successful and failed trajectories. We perform experiments in simulated manipulation environments across RoboMimic and MimicGen tasks, as well as in the real world using a Franka arm and a Spot quadruped. We find that GCR leads to a more-sample efficient RL, enabling model-free RL to solve about twice as many tasks as our baseline reward learning methods. We also demonstrate positive cross-embodiment transfer from videos of people and of other robots performing a task. Website: https://gcr-robot.github.io/.

Abstract:
Comprehensive and consistent dynamic scene understanding from camera input is essential for advanced autonomous systems. Traditional camera-based perception tasks like 3D object tracking and semantic occupancy prediction lack either spatial comprehensiveness or temporal consistency. In this work, we introduce a brand-new task, Camera-based 4D Panoptic Occupancy Tracking, which simultaneously addresses panoptic occupancy segmentation and object tracking from camera-only input. Furthermore, we propose TrackOcc, a cutting-edge approach that processes image inputs in a streaming, end-to-end manner with 4D panoptic queries to address the proposed task. Leveraging the localization-aware loss, TrackOcc enhances the accuracy of 4D panoptic occupancy tracking without bells and whistles. Experimental results demonstrate that our method achieves state-of-the-art performance on the Waymo dataset. The source code will be released at https://github.com/Tsinghua-MARS-Lab/TrackOcc.

Abstract:
Range-only Simultaneous Localisation and Mapping (RO-SLAM) is of interest due to its practical applications in ultra-wideband (UWB) and Bluetooth Low Energy (BLE) localisation in terrestrial and aerial applications and acoustic beacon localisation in submarine applications. In this work, we consider a mobile robot equipped with an inertial measurement unit (IMU) and a range sensor that measures distances to a collection of fixed landmarks. We derive an equivariant filter (EqF) for the RO-SLAM problem based on a symmetry Lie group that is compatible with the range measurements. The proposed filter does not require bootstrapping or initialisation of landmark positions, and demonstrates robustness to the noprior situation. The filter is demonstrated on a real-world dataset, and it is shown to significantly outperform a state-of-the-art EKF alternative in terms of both accuracy and robustness.

Abstract:
Autonomous robots can survey and monitor large environments. However, these robots often have limited compu-tational and power resources, making it crucial to develop an ef-ficient and adaptive informative path planning (IPP) algorithm. Such an algorithm must quickly adapt to environmental data to maximize the information collected while accommodating path constraints, such as distance budgets and boundary limitations. Current approaches to this problem often rely on maximizing mutual information using methods such as greedy algorithms, Bayesian optimization, and genetic algorithms. These methods can be slow and do not scale well to large or 3D environments. We present an adaptive IPP approach that is fully differentiable, significantly faster than previous methods, and scalable to 3D spaces. Our approach also supports continuous sensing robots, which collect data continuously along the entire path, by leveraging streaming sparse Gaussian processes. Benchmark results on two real-world datasets demonstrate that our approach yields solutions that are on par with or better than baseline methods while being up to two orders of magnitude faster. Additionally, we showcase our adaptive IPP approach in a 3D space using a system-on-chip embedded computer with minimal computational resources. Our code is available in the SGP- Tools Python library with a companion ROS 2 package for deployment on ArduPilot-based robots.

Abstract:
Dynamic in-hand manipulation remains challenging for soft robotic systems, which have demonstrated advantages in safe, compliant interactions but struggle with highspeed dynamic tasks. In this work, we present SWIFT, a system for learning dynamic tasks using a soft and compliant robotic hand. Unlike previous works that rely on simulation, quasistatic actions, and precise object models, SWIFT learns to spin a pen through trial and error using only real-world data and without requiring explicit knowledge of the pen's physical attributes. With self-labeled trials sampled from the real world, SWIFT discovers the set of pen grasping and spinning primitive parameters that enables a soft hand to spin a pen reliably. After 130 sampled actions per object, SWIFT achieves 10/10 success rate across three pens with different weights and weight distributions, demonstrating generalizability and robustness to changes in object properties. The results highlight the potential for soft robotic end-effectors to perform dynamic tasks. We also demonstrate generalization to different shapes and weights, such as a brush and a screwdriver, with 10/10 and 5/10 success rates, respectively. Videos, data, and code are available at https://soft-spin.github.io.

Abstract:
In this paper, we leverage self-supervised vision transformer models and their emergent semantic abilities to improve the generalization abilities of imitation learning policies. We introduce DVK, an imitation learning algorithm that leverages rich pre-trained Visual Transformer patch-level embeddings to obtain better generalization when learning through demonstrations. Our learner sees the world by clustering appearance features into groups associated with semantic concepts, forming stable keypoints that generalize across a wide range of appearance variations and object types. We demonstrate how this representation enables generalized behaviour by evaluating imitation learning across a diverse dataset of object manipulation tasks. To facilitate further study of generalization in Imitation Learning, all of our code for the method and evaluation, as well as the dataset, is made available.

Abstract:
While humans can successfully navigate using abstractions, ignoring details that are irrelevant to the task at hand, most of the existing approaches in robotics require detailed environment representations which consume a significant amount of sensing, computing, and storage; these issues become particularly important in resource-constrained settings with limited power budgets. Deep learning methods can learn from prior experience to abstract knowledge from novel environments, and use it to more efficiently execute tasks such as frontier exploration, object search, or scene understanding. We propose BoxMap, a Detection-Transformer-based architecture that takes advantage of the structure of the sensed partial environment to update a topological graph of the environment as a set of semantic entities (rooms and doors) and their relations (connectivity). The predictions from low-level measurements can be leveraged to achieve high-level goals with lower computational costs than methods based on detailed representations. As an example application, we consider a robot equipped with a 2-D laser scanner tasked with exploring a residential building. Our BoxMap representation scales quadratically with the number of rooms (with a small constant), resulting in significant savings over a full geometric map. Moreover, our high-level topological representation results in 30.9 % shorter trajectories in the exploration task with respect to a standard method. Code is available at: bit.ly/3F6w2Yl.

Abstract:
Vehicle-to-vehicle (V2V) based cooperative perception enhances autonomous driving by overcoming single-agent perception limitations such as occlusions, without relying on extensive infrastructure. However, most existing methods have two key limitations. They treat cooperative perception in isolation, with little consideration for downstream tasks such as planning, leading to poor coordination and inefficient planning decisions. They also assume perception model homogeneity across all vehicles, which can be impractical among vehicles from different manufacturers. To bridge such gaps, we propose Scout, an early-fusion framework for planning-oriented cooperative perception among vehicles of heterogeneous models. Specifically, we formalize a notion of \Delta \theta-Risk Increment Distribution (RID) to capture the distribution of the risk increment by incomplete perception to the current trajectory plan, and define a Priority Index (PI) metric for prioritizing cooperative perception on riskier regions. We develop algorithms to estimate \Delta \theta-RID and PI at run-time with theoretical bounds. Empirical results demonstrate that Scout surpasses state-of-the-art methods and strong baselines on challenging benchmarks, achieving higher success rates with only 3-10% of their communication volume.

Abstract:
We develop a coverage approach for heterogeneous agents that leverages the different sensing and motion capabilities of a team. Coverage performance is measured using ergodicity, which when optimized balances exploitation versus exploration, where areas of interest are indicated with an information metric. Prior work uses spectral decomposition of a spatial map of information to guide a set of heterogeneous agents, each with different sensor and motion models, to optimize coverage. This work leverages wavelet transforms to decompose the information map rather than the Fourier transform typically applied to ergodic search and demonstrates the importance of selecting a suitable wavelet family to use, based on the information map being explored. Further a sequence of wavelets is used for decomposition to overcome dependency on selecting one suitable wavelet family. Our experimental results show that using wavelet families well-suited to the specific information map for information map decomposition leads to, on average, 43% improvement over a baseline method in terms of a standard coverage metric (ergodicity), while using a wellsequenced set of wavelets for decomposition leads to a 65% improvement in coverage performance across multiple types of information maps.

Abstract:
Forecasting the future trajectories of surrounding agents is crucial for autonomous vehicles to ensure safe, efficient, and comfortable route planning. While model ensembling has improved prediction accuracy in various fields, its application in trajectory prediction is limited due to the multi-modal nature of predictions. In this paper, we propose a novel sampling method applicable to trajectory prediction based on the predictions of multiple models. We first show that conventional sampling based on predicted probabilities can degrade performance due to missing alignment between models. To address this problem, we introduce a new method that generates optimal trajectories from a set of neural networks, framing it as a risk minimization problem with a variable loss function. By using state-of-the-art models as base learners, our approach constructs diverse and effective ensembles for optimal trajectory sampling. Extensive experiments on the nuScenes prediction dataset demonstrate that our method surpasses current state-of-the-art techniques, achieving top ranks on the leaderboard. We also provide a comprehensive empirical study on ensembling strategies, offering insights into their effectiveness. Our findings highlight the potential of advanced ensembling techniques in trajectory prediction, significantly improving predictive performance and paving the way for more reliable predicted trajectories.

Abstract:
Our work investigates how social robots can efficiently collaborate with human users in a user-aware manner, minimising the generated frustration in human colleagues, thus enhancing their experience. As part of this, we develop a useraware framework for human-robot collaborative learning. We model users' frustration during human-robot interactions based on recent interactions inspired by Psychological principles and develop different frustration-aware interactive preference learning and decision-making models using multi-armed bandit and knapsack methods. Evaluating our approach, 1) we conducted simulated experiments on realistic human-behaviour datasets and 2) a user-study in which participants worked with a TIAGo Steel humanoid robot on a collaboration task using frustration- aware and non frustration-aware (Upper Confidence Bounds and Instruction-based) models. We demonstrate that when collaborating with the frustration-aware robot, users completed the collaboration task 9.04% faster and using 20.54% less number of verbal interactions, with user questionnaire responses reporting less frustration experienced compared to the baseline approaches. Additionally, we create a multimodal dataset containing over 6 hours of human-robot interactions displaying various explicit and implicit user responses.

Abstract:
Most recent successes in robot reinforcement learning involve learning a specialized single-task agent. However, robots capable of performing multiple tasks can be much more valuable in real-world applications. Multi-task reinforcement learning can be very challenging due to the increased sample complexity and the potentially conflicting task objectives. Previous work on this topic is dominated by model-free approaches. The latter can be very sample inefficient even when learning specialized single-task agents. In this work, we focus on model-based multi-task reinforcement learning. We propose a method for learning multi-task visual world models, leveraging pre-trained language models to extract semantically meaningful task representations. These representations are used by the world model and policy to reason about task similarity in dynamics and behavior. Our results highlight the benefits of using language-driven task representations for world models and a clear advantage of model-based multi-task learning over the more common model-free paradigm.

Abstract:
In this paper, we present an approach for controlling a team of agile fixed-wing aerial vehicles in close proximity to one another. Our approach relies on recedinghorizon nonlinear model predictive control (NMPC) to plan maneuvers across an expanded flight envelope to enable interagent collision avoidance. To facilitate robust collision avoidance and characterize the likelihood of inter-agent collisions, we compute a statistical bound on the probability of the system leaving a tube around the planned nominal trajectory. Finally, we propose a metric for evaluating highly dynamic swarms and use this metric to evaluate our approach. We successfully demonstrated our approach through both simulation and hardware experiments, and to our knowledge, this the first time close-quarters swarming has been achieved with physical aerobatic fixed-wing vehicles.

Abstract:
In this work, we address the deployment problem for a team of mobile guards that tries to maintain a line-ofsight with an unpredictable mobile intruder. First, we present a computationally efficient strategy for generating a set of points, called kernel points, that covers the entire polygon. We then introduce a polygon partitioning technique based on the location of the kernel points. Next, we propose control laws for a free guard to track an intruder in general polygonal environments based on the analysis of a pursuit-evasion game around a single corner [1]. Finally, we present several variations of the proposed control laws that include capture and search, and illustrate the improvement in the overall visual footprint of the team of mobile guards based on extensive simulations.

Abstract:
Mobile robots on construction sites require accurate pose estimation to perform autonomous surveying and inspection missions. Localization in construction sites is a particularly challenging problem due to the presence of repetitive features such as flat plastered walls and perceptual aliasing due to apartments with similar layouts inter and intra floors. In this paper, we focus on the global re-positioning of a robot with respect to an accurate scanned mesh of the building solely using LiDAR data. In our approach, a neural network is trained on synthetic LiDAR point clouds generated by simulating a LiDAR in an accurate real-life large-scale mesh. We train a diffusion model with a PointNet++ backbone, which allows us to model multiple position candidates from a single LiDAR point cloud. The resulting model can successfully predict the global position of LiDAR in confined and complex sites despite the adverse effects of perceptual aliasing. The learned distribution of potential global positions can provide multi-modal position distribution. We evaluate our approach across five real-world datasets and show the place recognition accuracy of 77 % (\pm 2 ~\mathrmm) on average while outperforming baselines at a factor of 2 in mean error.

Abstract:
Microrobots and other miniature robots are able to access millimeter-sized spaces and thus have the potential to solve many challenging problems in healthcare. However, clinical adoption of these robots is rare as these systems are often difficult to scale up. One such issue arises from the actuation systems used to remotely control magnetic microrobots, which tend to be bulky and obstruct the surgeons' workspaces. They also do not guarantee wide ranges of magnetic fields and forces in a large patient-sized workspace. In this paper, we present the design of a permanent magnet-based actuation system that fits within a 40 cm cube of space under an operating table. We also formulate a new set function maximization-based approach for efficiently designing E-optimal magnet arrangements with off-the-shelf convex solvers. Our optimization method is evaluated with synthetic data and a proof-of-concept of the system is simulated.

Abstract:
Semi-supervised 3D object detection is a common strategy employed to circumvent the challenge of manually labeling large-scale autonomous driving perception datasets. Pseudo-labeling approaches to semi-supervised learning adopt a teacher-student framework in which machine-generated pseudo-labels on a large unlabeled dataset are used in combination with a small manually-labeled dataset for training. In this work, we address the problem of improving pseudo-label quality through leveraging long- term temporal information captured in driving scenes. More specifically, we leverage pre-trained motion-forecasting models to generate object trajectories on pseudo-labeled data to further enhance the student model training. Our approach improves pseudo-label quality in two distinct manners: first, we suppress false positive pseudo-labels through establishing consistency across multiple frames of motion forecasting outputs. Second, we compensate for false negative detections by directly inserting predicted object tracks into the pseudo-labeled scene. Experiments on the nuScenes dataset demonstrate the effectiveness of our approach, improving the performance of standard semi-supervised approaches in a variety of settings.

Abstract:
Sustaining the gait locomotion in an adversarial environment requires the robot to react to novel experiences adaptively. In Free Energy Principle (FEP), the behavioral reaction is driven by the discrepancy between observation and prediction. Although, for legged robot gait locomotion, the prediction of gait dynamics is challenging as the consequences non-linearly depend on the activity history, the animal gait is robust, adapting to severe motion disruptions seemingly instantly. In biomimetic robotics, the Central Pattern Generator (CPG) relaxes the general dynamics of body-environment interaction to the stable and repetitive dynamics of gait. Based on these observations, we propose self-learning of the gait dynamics model and FEP framework that infers state estimation and gait control. The proposed method is experimentally evaluated on a real hexapod walking robot with 18 controllable degrees of freedom. The robot learns the gait dynamics model indoors and then deploys it in outdoor navigation under various adversarial scenarios. Results show that the developed interpretable gait controller exhibits complex and real-time adaptive behavior when it encounters unknown situations.

Abstract:
Pre-training techniques play a crucial role in deep learning, enhancing models' performance across a variety of tasks. By initially training on large datasets and subsequently fine-tuning on task-specific data, pre-training provides a solid foundation for models, improving generalization abilities and accelerating convergence rates. This approach has seen significant success in the fields of natural language processing and computer vision. However, traditional pre-training methods necessitate large datasets and substantial computational resources, and they can only learn shared features through prolonged training and struggle to capture deeper, task-specific features. In this paper, we propose a task-oriented pre-training method that begins with generating redundant segmentation proposals using the Segment Anything (SAM) model. We then introduce a Specific Category Enhancement Fine-tuning (SCEF) strategy for fine-tuning the Contrastive Language-Image Pre-training (CLIP) model to select proposals most closely related to the drivable area from those generated by SAM. This approach can generate a lot of coarse training data for pre-training models, which are further fine-tuned using manually annotated data, thereby improving model's performance. Comprehensive experiments conducted on the KITTI road dataset demonstrate that our task-oriented pre-training method achieves an all-around performance improvement compared to models without pre-training (as shown in Fig. 1). Moreover, our pre-training method not only surpasses traditional pre-training approach but also achieves the best performance compared to state-of-the-art self-training methods. The open-source project can be found at https://sites.google.com/view/task-oriented-pre-training.

Abstract:
Distributed loitering synchronization is the process whereby a group of fixed-wing Unmanned Aerial Vehicles (UAVs) align with each other while they follow a circular path in the air. This process is essential to establish proper initial conditions for missions in the real world. We evaluate the performance of three synchronization algorithms using a setup of continuously moving fixed-wing drones randomly placed around a loitering circle. We consider the algorithm based on distributed consensus as a baseline. We propose two methods: the Minimum Of Shortest Arc (MOSA) algorithm that outperforms the baseline in this setup and Firefly multi-Pulse Synchronization (FPS), which is inspired by firefly synchronization. The latter method requires 10 times less communication while maintaining a performance comparable to the baseline. These algorithms were first tested in a simple simulation, then a more realistic simulation environment using Gazebo in which fixed-wing dynamics are considered. The proposed algorithms are rigorously tested in simulation through multiple trials involving a group of 10 UAVs, confirming the effectiveness of our approaches. The results were then validated in real flights using 3 fixed-wing drones.

Abstract:
Trajectory planning in robotics aims to generate collision-free pose sequences that can be reliably executed. Recently, vision-to-planning systems have gained increasing attention for their efficiency and ability to interpret and adapt to surrounding environments. However, traditional modular systems suffer from increased latency and error propagation, while purely data-driven approaches often overlook the robot's kinematic constraints. This oversight leads to discrepancies between planned trajectories and those that are executable. To address these challenges, we propose iKap, a novel vision-toplanning system that integrates the robot's kinematic model directly into the learning pipeline. iKap employs a self-supervised learning approach and incorporates the state transition model within a differentiable bi-level optimization framework. This integration ensures the network learns collision-free waypoints while satisfying kinematic constraints, enabling gradient backpropagation for end-to-end training. Our experimental results demonstrate that iKap achieves higher success rates and reduced latency compared to the state-of-the-art methods. Besides the complete system, iKap offers a visual-to-planning network that seamlessly works with various controllers, providing a robust solution for robots navigating complex environments.

Abstract:
Robotic surgery has reached a high level of maturity and has become an integral part of standard surgical care. However, existing surgeon consoles are bulky, take up valuable space in the operating room, make surgical team coordination challenging, and their proprietary nature makes it difficult to take advantage of recent technological advances, especially in virtual and augmented reality. One potential area for further improvement is the integration of modern sensory gloves into robotic platforms, allowing surgeons to control robotic arms intuitively with their hand movements. We propose one such system that combines an HTC Vive tracker, a Manus Meta Prime 3 XR sensory glove, and SCOPEYE wireless smart glasses. The system controls one arm of a da Vinci surgical robot. In addition to moving the arm, the surgeon can use fingers to control the end-effector of the surgical instrument. Hand gestures are used to implement clutching and similar functions. In particular, we introduce clutching of the instrument orientation, a functionality unavailable in the da Vinci system. The vibrotactile elements of the glove are used to provide feedback to the user when gesture commands are invoked. A qualitative and quantitative evaluation has been conducted that compares the current device with the dVRK console. The system is shown to have excellent tracking accuracy, and the new interface allows surgeons to perform common surgical training tasks with minimal practice efficiently.

Affiliations: Department of Automation, Key Laboratory of Sys-tem Control and Information Processing, Ministry of Education of China, Shanghai Jiao Tong University, Shanghai, China; Department of Civil and Environmental Engineering, National University of Singapore, Singapore, Singapore; Department of Automation, Shanghai Jiao Tong University, the School of Mechatronic Engineering and Automation, Shanghai University, and the School of Electrical and Electronic Engineering, Shanghai Institute of Technology, Shanghai, China

Abstract:
The rapid advancement of artificial intelligence has significantly enhanced the intelligence of autonomous vehicles (AVs). However, owing to the complexity of AV behavior and the high dimensionality of driving environments, the objective and practical quantitative evaluation of AV intelligence remains a significant and unresolved challenge. This paper proposes a robust training-based comprehensive evaluation (RTCE) system specifically designed to assess the intelligence of AVs in the time dimension. Beginning with a foundation model, the first generation of AVs is developed by training in the initial naturalistic traffic scenarios. To effectively test the intelligence of the AVs, we propose an adversarial trajectory optimization technique to generate challenging, critical test scenarios that evaluate the learning capabilities of AVs in complex environments. Through robust training in these complex scenarios, the second generation of AVs is obtained. To objectively and effectively quantify the intelligence of AVs, we further propose a comprehensive evaluation metric system encompassing five dimensions and 14 evaluation metrics. The intelligence score of each AV is computed using the objective multi-criteria decision-making approach. The proposed intelligence evaluation method is validated using various self-evolution autonomous driving algorithms. The results demonstrate that the RTCE method can quantitatively and effectively test the intelligence of AVs in a multi-dimensional and automated manner. Furthermore, the proposed method is flexible and generalizable, making it adaptable to different testing platforms and autonomous driving algorithms.

Abstract:
The deployment of autonomous vehicles controlled by machine learning techniques requires extensive testing in diverse real-world environments, robust handling of edge cases and out-of-distribution scenarios, and comprehensive safety validation to ensure that these systems can navigate safely and effectively under unpredictable conditions. Addressing Out-OfDistribution (OOD) driving scenarios is essential for enhancing safety, as OOD scenarios help validate the reliability of the models within the vehicle's autonomy stack. However, generating OOD scenarios is challenging due to their long-tailed distribution and rarity in urban driving datasets. Recently, Large Language Models (LLMs) have shown promise in autonomous driving, particularly for their zero-shot generalization and common-sense reasoning capabilities. In this paper, we leverage these LLM strengths to introduce a framework for generating diverse OOD driving scenarios. Our approach uses LLMs to construct a branching tree, where each branch represents a unique OOD scenario. These scenarios are then simulated in the CARLA simulator using an automated framework that aligns scene augmentation with the corresponding textual descriptions. We evaluate our framework through extensive simulations, and assess its performance via a diversity metric that measures the richness of the scenarios. Additionally, we introduce a new “OOD-ness” metric, which quantifies how much the generated scenarios deviate from typical urban driving conditions. Furthermore, we explore the capacity of modern Vision-Language Models (VLMs) to interpret and safely navigate through the simulated OOD scenarios. Our findings offer valuable insights into the reliability of language models in addressing OOD scenarios within the context of urban driving.

Abstract:
Coordinated multi-robot navigation is an essential ability for a team of robots operating in diverse environments. Robot teams often need to maintain specific formations, such as wedge formations, to enhance visibility, positioning, and efficiency during fast movement. However, complex environments such as narrow corridors challenge rigid team formations, which makes effective formation control difficult in real-world environments. To address this challenge, we introduce a novel Adaptive Formation with Oscillation Reduction (AFOR) approach to improve coordinated multi-robot navigation. We develop AFOR under the theoretical framework of hierarchical learning and integrate a spring-damper model with hierarchical learning to enable both team coordination and individual robot control. At the upper level, a graph neural network facilitates formation adaptation and information sharing among the robots. At the lower level, reinforcement learning enables each robot to navigate and avoid obstacles while maintaining the formations. We conducted extensive experiments using Gazebo in the Robot Operating System (ROS), a high-fidelity Unity3D simulator with ROS, and real robot teams. Results demonstrate that AFOR enables smooth navigation with formation adaptation in complex scenarios and outperforms previous methods. More details of this work are provided on the project website: https://hcrlab.gitlab.io/project/afor.

Abstract:
Robust and efficient local feature matching plays a crucial role in applications such as SLAM and visual localization for robotics. Despite great progress, it is still very challenging to extract robust and discriminative visual features in scenarios with drastic lighting changes, low texture areas, or repetitive patterns. In this paper, we propose a new lightweight network called LiftFeat, which lifts the robustness of raw descriptor by aggregating 3D geometric feature. Specifically, we first adopt a pre-trained monocular depth estimation model to generate pseudo surface normal label, supervising the extraction of 3D geometric feature in terms of predicted surface normal. We then design a 3D geometry-aware feature lifting module to fuse surface normal feature with raw 2D descriptor feature. Integrating such 3D geometric feature enhances the discriminative ability of 2D feature description in extreme conditions. Extensive experimental results on relative pose estimation, homography estimation, and visual localization tasks, demonstrate that our LiftFeat outperforms some lightweight state-of-the-art methods. Code will be released at: https://github.com/lyp-deeplearning/LiftFeat.

Abstract:
Trajectory prediction facilitates effective planning and decision-making, while constrained trajectory prediction integrates regulation into prediction. Recent advances in constrained trajectory prediction focus on structured constraints by constructing optimization objectives. However, handling unstructured constraints is challenging due to the lack of differentiable formal definitions. To address this, we propose a novel method for constrained trajectory prediction using a conditional generative paradigm, named Controllable Trajectory Diffusion (CTD). The key idea is that any trajectory corresponds to a degree of conformity to a constraint. By quantifying this degree and treating it as a condition, a model can implicitly learn to predict trajectories under unstructured constraints. CTD employs a pre-trained scoring model to predict the degree of conformity (i.e., a score), and uses this score as a condition for a conditional diffusion model to generate trajectories. Experimental results demonstrate that CTD achieves high accuracy on the ETH/UCY and SDD benchmarks. Qualitative analysis confirms that CTD ensures adherence to unstructured constraints and can predict trajectories that satisfy combinatorial constraints.

Abstract:
Aerial humanoid robots can enhance the efficiency and safety of rescue operations in disaster scenarios. The control of such complex machines presents many challenges, for instance, the control of the different locomotion strategies and the stabilization of the transition maneuvers. In this article, we present an online nonlinear Model Predictive Controller and the relative prediction model to stabilize walking and flying trajectories. The controller uses a reduced model to generate feasible base link references, thrust profiles, and contact forces while dealing with different locomotion strategies and transition maneuvers. The control algorithm is tested in a simulated environment using our aerial humanoid robot iRonCub under the effect of external disturbances. The proposed control strategy demonstrates to effectively stabilize the desired trajectories while keeping the problem still treatable online.

Abstract:
Ground terrain perception has become the primary visual task for the robust navigation of intelligent systems in unstructured outdoor environments. However, complex ter-rain poses a significant challenge to vision-based perception. This work introduces a novel estimation task using RGB images to facilitate low-cost terrain perception in extracting surface roughness information. The proposed task presents both semantic-aware and edge-aware roughness descriptors at the pixel level instead of a single value for a given image. To promote the research on the proposed novel terrain roughness estimation task, we introduce a multimodal synthetic dataset for terrain perception in outdoor scenes, containing multiple terrain categories, diverse viewpoints, different lighting and weather conditions, as well as semantic and roughness annotations. Additionally, inspired by computer graphics, we introduce TRENet, a roughness estimation architecture to model the intrinsic correlation of depth-normal-roughness. We also perform ablation studies on the effect of each component and diverse types of inputs. Extensive evaluations and comparisons demonstrate that our method can effectively predict pixel-wise terrain surface roughness with high accuracy.

Abstract:
We propose a new method for improving zero-shot ObjectNav that aims to utilize potentially available environmental percepts for navigational assistance. Our approach takes into account that the ground agent may have limited and sometimes obstructed view. Our formulation encourages Generative Communication (GC) between an assistive overhead agent with a global view containing the target object and the ground agent with an obfuscated view; both equipped with Vision-Language Models (VLMs) for vision-to-language translation. In this assisted setup, the embodied agents communicate environmental information before the ground agent executes actions towards a target. Despite the overhead agent having a global view with the target, we note a drop in performance (−13% in OSR and −13% in SPL) of a fully cooperative assistance scheme over an unassisted baseline. In contrast, a selective assistance scheme where the ground agent retains its independent exploratory behaviour shows a 10% OSR and 7.65% SPL improvement. To explain navigation performance, we analyze the GC for unique traits, quantifying the presence of hallucination and cooperation. Specifically, we identify the novel linguistic trait of preemptive hallucination in our embodied setting, where the overhead agent assumes that the ground agent has executed an action in the dialogue when it is yet to move, and note its strong correlation with navigation performance. We conduct real-world experiments and present some qualitative examples where we mitigate hallucinations via prompt finetuning to improve ObjectNav performance.

Abstract:
With the increasing use of wireless technologies in robotics for communication, sensing, and localization, the potential benefits of how robotics can complement and enhance wireless systems remain underexplored. This paper explores a novel application of the existing inflatable robots for wireless communication systems by forming a shape-programming, reflective waveguide that enhances the received signal quality for wireless devices. Our primary target is enhancing Low-Power Wide-Area Networks (LP-WANs) - where 10-year batterypowered client devices (e.g. energy meters or smart home sensors) connect to cellular-like base stations to deliver data. Devices in these networks often experience significant seasonal variability in battery life - even simple obstructions between the device and base station (e.g. due to construction) can shave off years of battery life. We propose MetaMorph, a programmable robotic reflector attached to base stations that enhances signal quality from client devices by enhancing received signal energy with controlled reflections. We investigate the design of the reflector, and our experiments show the ability to improve the signal quality for LP-WAN (LoRa) communication systems demonstrating signal quality and battery-benefits. To our best knowledge, MetaMorph is the first paper to explore how flexible robotics can serve as virtuous reflectors for wireless communication systems.

Abstract:
Clothes manipulation is a critical capability for household robots; yet, existing methods are often confined to specific tasks, such as folding or flattening, due to the complex high-dimensional geometry of deformable fabric. This paper presents CLothes mAnipulation with Semantic keyPoints (CLASP) for general-purpose clothes manipulation, which enables the robot to perform diverse manipulation tasks over different types of clothes. The key idea of CLASP is semantic keypoints-e.g., “right shoulder”, “left sleeve”, etc.-a sparse spatial-semantic representation that is salient for both perception and action. Semantic keypoints of clothes can be effectively extracted from depth images and are sufficient to represent a broad range of clothes manipulation policies. CLASP leverages semantic keypoints to bridge LLM-powered task planning and low-level action execution in a two-level hierarchy. Extensive simulation experiments show that CLASP outperforms baseline methods across diverse clothes types in both seen and unseen tasks. Further, experiments with a Kinova dual-arm system on four distinct tasks-folding, flattening, hanging, and placing-confirm CLASP's performance on a real robot.

Abstract:
We propose MFSeg, an efficient multi-frame 3D semantic segmentation framework. By aggregating point cloud sequences at the feature level and regularizing the feature extraction and aggregation process, MFSeg reduces computational overhead while maintaining high accuracy. Moreover, by employing a lightweight MLP-based point decoder, our method eliminates the need to upsample redundant points from past frames. Experiments on the nuScenes and Waymo datasets show that MFSeg outperforms existing methods, demonstrating its effectiveness and efficiency.

Abstract:
This study focuses on the robotic capability of ungrasping, or releasing, an object in a grasp from the gripper to the robot's environment. The presented technique enables the delicate release of a grasped object using non-static contacts, allowing for rolling and/or sliding. This dexterous manipulation capability is particularly relevant when ungrasping thin or slender objects, as will be demonstrated with real examples. We initially discuss the establishment of three-dimensional stability during ungrasping manipulation, ensuring robustness. Subsequently, we present a planning and control solution for three-dimensional ungrasping, building upon our previous planar version. A series of experiments across various test scenarios, ranging from precision placement to puzzle tiling, showcase the viability and effectiveness of our approach.

Abstract:
Accurate localization of graspable regions within a single object point cloud is critical to enable task-based robot grasps. State-of-the-art task-based robot grasp synthesis methods fit over-approximated 3D bounding boxes that, in some cases, fail to isolate graspable regions even if they exist. While deep learning or geometrical shape decomposition methods can offer improved approximations, they lack guarantees for the graspability of segmented regions, require prior knowledge of the object, and/or demand large annotated datasets for fine-tuning. In this paper, we overcome these limitations to introduce ITSI (Iterative Slicing). ITSI is a complete, taskoriented grasp synthesis approach that functions independently of object-specific knowledge. ITSI effectively segments multiple graspable regions that conform to the constraints of robot grippers, thereby enabling compatibility with any object a robot seeks to grasp and any robot gripper size. Our extensive realworld and simulation experiments on diverse object datasets demonstrate how ITSI dramatically increases the number of discoverable robot grasps by up to 44 % when compared to the state-of-the-art. We also expand ITSI's capabilities beyond task-based robot grasp synthesis to highlight its performance in human affordance segmentation, where our performance is comparable to fully supervised deep-learning based methods (in fact, we outperform them by 1 %).

Abstract:
We would like a robot to navigate to a goal location while minimizing state uncertainty. To aid the robot in this endeavor, maps provide a prior belief over the location of objects and regions of interest. To localize itself within the map, a robot identifies mapped landmarks using its sensors. However, as the time between map creation and robot deployment increases, portions of the map can become stale, and landmarks, once believed to be permanent, may disappear. We refer to the propensity of a landmark to disappear as landmark evanescence. Reasoning about landmark evanescence during path planning, and the associated impact on localization accuracy, requires analyzing the presence or absence of each landmark, leading to an exponential number of possible outcomes of a given motion plan. To address this complexity, we develop BRULE, an extension of the Belief Roadmap. During planning, we replace the belief over future robot poses with a Gaussian mixture which is able to capture the effects of landmark evanescence. Furthermore, we show that belief updates can be made efficient, and that maintaining a random subset of mixture components is sufficient to find high quality solutions. We demonstrate performance in simulated and real-world experiments. Software is available at https://bit.ly/BRULE.

Abstract:
Achieving human-like dexterity is a longstanding challenge in robotics, in part due to the complexity of planning and control for contact-rich systems. In reinforcement learning (RL), one popular approach has been to use massively-parallelized, domain-randomized simulations to learn a policy offline over a vast array of contact conditions, allowing robust sim-to-real transfer. Inspired by recent advances in real-time parallel simulation, this work considers instead the viability of online planning methods for contact-rich manipulation by studying the well-known in-hand cube reorientation task. We propose a simple architecture that employs a sampling-based predictive controller and vision-based pose estimator to search for contact-rich control actions online. We conduct thorough experiments to assess the real-world performance of our method, architectural design choices, and key factors for robustness, demonstrating that our simple sampling-based approach achieves performance comparable to prior RL-based works. Supplemental material: https://caltech-amber.github.io/drop.

Abstract:
Learning-based methods for constructing control barrier functions (CBFs) are gaining popularity for ensuring safe robot control. A major limitation of existing methods is their reliance on extensive sampling over the state space or online system interaction in simulation. In this work we propose a novel framework for learning neural CBFs through a fixed, sparsely-labeled dataset collected prior to training. Our approach introduces new annotation techniques based on out-of-distribution analysis, enabling efficient knowledge propagation from the limited labeled data to the unlabeled data. We also eliminate the dependency on a high-performance expert controller, and allow multiple sub-optimal policies or even manual control during data collection. We evaluate the proposed method on real-world platforms. With limited amount of offline data, it achieves state-of-the-art performance for dynamic obstacle avoidance, demonstrating statistically safer and less conservative maneuvers compared to existing methods.

Abstract:
We present a method for synthesizing navigation policies for multi-robot teams that interpret and follow natural language instructions. We condition these policies on embeddings from pretrained Large Language Models (LLMs), and train them via offline reinforcement learning with as little as 20 minutes of randomly-collected real-world data. Experiments on a team of five real robots show that these policies generalize well to unseen commands, indicating an understanding of the LLM latent space. Our method requires no simulators or environment models, and produces low-latency control policies that can be deployed directly to real robots without finetuning. We provide videos of our experiments at https://sites.google.com/view/llm-marl.

Abstract:
Visual localization refers to the process of determining camera poses and orientation within a known scene representation. This task is often complicated by factors such as changes in illumination and variations in viewing angles. In this paper, we propose HGSLoc, a novel lightweight plug-and-play pose optimization framework, which integrates 3D reconstruction with a heuristic refinement strategy to achieve higher pose estimation accuracy. Specifically, we introduce an explicit geometric map for 3D representation and high-fidelity rendering, allowing the generation of high-quality synthesized views to support accurate visual localization. Our method demonstrates higher localization accuracy compared to NeRFbased neural rendering localization approaches. We introduce a heuristic refinement strategy, its efficient optimization capability can quickly locate the target node, while we set the steplevel optimization step to enhance the pose accuracy in the scenarios with small errors. With carefully designed heuristic functions, it offers efficient optimization capabilities, enabling rapid error reduction in rough localization estimations. Our method mitigates the dependence on complex neural network models while demonstrating improved robustness against noise and higher localization accuracy in challenging environments, as compared to neural network joint optimization strategies. The optimization framework proposed in this paper introduces novel approaches to visual localization by integrating the advantages of 3D reconstruction and the heuristic refinement strategy, which demonstrates strong performance across multiple benchmark datasets, including 7Scenes and Deep Blending dataset. The implementation of our method has been released at https://github.com/anchang699/HGSLoc.

Abstract:
Collision detection is a critical functionality for robotics. The degree to which objects collide cannot be represented as a continuously differentiable function for any shapes other than spheres. This paper proposes a framework for handling collision detection between polyhedral shapes. We frame the signed distance between two polyhedral bodies as the optimal value of a convex optimization, and consider constraining the signed distance in a bilevel optimization problem. To avoid relying on specialized bilevel solvers, our method exploits the fact that the signed distance is the minimal point of a convex region related to the two bodies. Our method enumerates the values obtained at all extreme points of this region and lists them as constraints in the higher-level problem. We compare our formulation to existing methods in terms of reliability and speed when solved using the same mixed complementarity problem solver. We demonstrate that our approach more reliably solves difficult collision detection problems with multiple obstacles than other methods, and is faster than existing methods in some cases.

Abstract:
Diffusion models have been successfully applied to robotics problems such as manipulation and vehicle path planning. In this work, we explore their application to end-to-end navigation - including both perception and planning - by considering the problem of jointly performing global localization and path planning in known but arbitrary 2D environments. In particular, we introduce a diffusion model which produces collision-free paths in a global reference frame given an egocentric LIDAR scan, an arbitrary map, and a desired goal position. To this end, we implement diffusion in the space of paths in \textSE(2), and describe how to condition the denoising process on both obstacles and sensor observations. In our evaluation, we show that the proposed conditioning techniques enable generalization to realistic maps of considerably different appearance than the training environment, demonstrate our model's ability to accurately describe ambiguous solutions, and run extensive simulation experiments showcasing our model's use as a real-time, end-to-end localization and planning stack.

Abstract:
In recent years roboticists have achieved remarkable progress in solving increasingly general tasks on dexterous robotic hardware by leveraging high capacity Transformer network architectures and generative diffusion models. Unfortunately, combining these two orthogonal improvements has proven surprisingly difficult, since there is no clear and well-understood process for making important design choices. In this paper, we identify, study and improve key architectural design decisions for high-capacity diffusion transformer policies. The resulting models can efficiently solve diverse tasks on multiple robot embodiments, without the excruciating pain of per-setup hyper-parameter tuning. By combining the results of our investigation with our improved model components, we are able to present a novel architecture, named DiT-Block Policy, that significantly outperforms the state of the art in solving long-horizon (1500+ \texttime-steps) dexterous tasks on a bi-manual ALOHA robot. In addition, we find that our policies show improved scaling performance when trained on 10 hours of highly multi-modal, language annotated ALOHA demonstration data. We hope this work will open the door for future robot learning techniques that leverage the efficiency of generative diffusion modeling with the scalability of large scale transformer architectures. Code, robot dataset, and videos are available at: https://dit-policy.github.io

Abstract:
Generative policies trained with human demonstrations can autonomously accomplish multimodal, longhorizon tasks. However, during inference, humans are often removed from the policy execution loop, limiting the ability to guide a pre-trained policy towards a specific sub-goal or trajectory shape among multiple predictions. Naive human intervention may inadvertently exacerbate distribution shift, leading to constraint violations or execution failures. To better align policy output with human intent without inducing out-of-distribution errors, we propose an Inference-Time Policy Steering (ITPS) framework that leverages human interactions to bias the generative sampling process, rather than finetuning the policy on interaction data. We evaluate ITPS across three simulated and real-world benchmarks, testing three forms of human interaction and associated alignment distance metrics. Among six sampling strategies, our proposed stochastic sampling with diffusion policy achieves the best trade-off between alignment and distribution shift. Videos are available at https://yanweiw.github.io/itps/.

Abstract:
In both animals and robots, locomotion capabilities are determined by the physical structure of the system. The majority of legged animals and robots are bilaterally symmetric, which facilitates locomotion with consistent headings and obstacle traversal, but leads to constraints in their turning ability. On the other hand, radially symmetric animals have demonstrated rapid turning abilities enabled by their omnidirectional body plans. Radially symmetric tripedal robots are able to turn instantaneously, but are commonly constrained by needing to change direction with every step, resulting in inefficient and less stable locomotion. Inspired by the radial symmetry and maneuverability of brittle stars and octopuses, we introduce a novel design for a tripedal robot that has both frictional and rolling contacts. Additionally, a freely rotating central sphere provides an added contact point so the robot can retain a stable tripod base of support while lifting and pushing with any one of its legs. The SKating, OmniOriented, Tripedal Robot (SKOOTR) is more versatile and stable than existing tripedal robots. It is capable of multiple forward gaits, multiple turning maneuvers, obstacle traversal, and stair climbing. SKOOTR has been designed to facilitate customization for diverse applications: it is fully open-source, is constructed with 3D printed or off-the-shelf parts, and costs approximately 500 USD to build. A project page with CAD files, assembly guide, and links to the github repository is posted at https://www.embirlab.com/skootr.

Abstract:
This paper introduces MotionGlot, a model that can generate motion across multiple embodiments with different action dimensions, such as quadruped robots and human bodies. By leveraging the well-established training procedures commonly used in large language models (LLMs), we introduce an instruction-tuning template specifically designed for motionrelated tasks. Our approach demonstrates that the principles underlying LLM training can be successfully adapted to learn a wide range of motion generation tasks across multiple embodiments with different action dimensions. We demonstrate the various abilities of MotionGlot on a set of 6 tasks and report an average improvement of \mathbf3 5. 3 % across tasks. Additionally, we contribute two new datasets: (1) a dataset of expert-controlled quadruped locomotion with approximately 48,000 trajectories paired with direction-based text annotations, and (2) a dataset of over 23,000 situational text prompts for human motion generation tasks. Finally, we conduct hardware experiments to validate the capabilities of our system in real-world applications.

Abstract:
For human operators to effectively task teams of robots, it is critical that they maintain situational awareness about the status of those robots. However, maintaining this situational awareness becomes particularly difficult when there are dynamic changes not only in the members of the robot team, but also in the capabilities of those robots. Prior work has shown that situational awareness can be supported through interfaces that effectively visualize task-relevant information. As such, in this work, we introduce a Capability-Level System for Tracking Robots (CLSTR), a new visualization for supporting operators to maintain an appropriate level of situational awareness over the capabilities of a dynamic robot team. In evaluating CLSTR through an online human-subject study (\mathbfn \boldsymbol= \mathbf1 2 3), we found that a combination of different visual elements within an interface like the use of icons to summarize robot capabilities and animations to indicate team changes can help operators maintain awareness over robot teams.

Abstract:
Cloud robotics enables robots to offload computationally intensive tasks to cloud servers for performance, cost, and ease of management. However, the network and cloud computing infrastructure are not designed for reliable timing guarantees, due to fluctuating Quality-of-Service (QoS). In this work, we formulate an impossibility triangle theorem for: Latency reliability, Singleton server, and Commodity hardware. The LSC theorem suggests that providing replicated servers with uncorrelated failures can exponentially reduce the probability of missing a deadline. We present FogROS2-Probabilistic Latency Reliability (PLR) that uses multiple independent network interfaces to send requests to replicated cloud servers and uses the first response back. We design routing mechanisms to discover, connect, and route through non-default network interfaces on robots. FogROS2-PLR optimizes the selection of interfaces to servers to minimize the probability of missing a deadline. We conduct a cloud-connected driving experiment with two 5 G service providers, demonstrating FogROS2-PLR effectively provides smooth service quality even if one of the service providers experiences low coverage and base station handover. We use 99 Percentile (P99) latency to evaluate anomalous long-tail latency behavior. In one experiment, FogROS2-PLR improves P99 latency by up to 3.7 x compared to using one service provider. We deploy FogROS2-PLR on a physical Stretch 3 robot performing an indoor human-tracking task. Even in a fully covered \textWi-\textFi and 5 G environment, FogROS2-PLR improves the responsiveness of the robot reducing mean latency by 36% and P99 latency by 33%. Code and supplementary can be found on website 11https://github.com/data-capsule/rt-fogros2.

Abstract:
Achieving human-level performance on real world tasks is a north star for the robotics community. We present the first learned robot agent that reaches amateur humanlevel performance in competitive table tennis. Table tennis is a physically demanding sport that takes humans years to master. We contribute (1) a hierarchical and modular policy architecture consisting of (i) low level controllers with their skill descriptors that model their capabilities and (ii) a high level controller that chooses the low level skills, (2) techniques for enabling zero-shot sim-to-real and curriculum building, including an iterative approach (train in sim, deploy in real), and (3) real time adaptation to unseen opponents. Policy performance was assessed through 29 robot vs. human matches of which the robot won 45 % (13/29). All humans were unseen players and their skill level varied from beginner to tournament level. Whilst the robot lost all matches vs. the most advanced players it won 100 % matches vs. beginners and 55 % matches vs. intermediate players, demonstrating solidly amateur humanlevel performance. Videos of the matches can be viewed here1.See sites https://google.com/view/competitive-robot-table-tennis.

Abstract:
Recent results suggest that very large datasets of teleoperated robot demonstrations can be used to train transformer-based models that have the potential to generalize to new scenes, robots, and tasks. However, curating, distributing, and loading large datasets of robot trajectories, which typically consist of video, textual, and numerical modalities - including streams from multiple cameras - remains challenging. We propose Robo-DM, an efficient open-source cloud-based data management toolkit for collecting, sharing, and learning with robot data. With Robo-DM, robot datasets are stored in a self-contained format with Extensible Binary Meta Language (EBML). Robo-DM can significantly reduce the size of robot trajectory data, transfer costs, and data load time during training. Compared to the RLDS format used in OXE datasets, Robo-DM's compression saves space by up to 70x (lossy) and 3.5x (lossless). Robo-DM also accelerates data retrieval by load-balancing video decoding with memory-mapped decoding caches. Compared to LeRobot, a framework that also uses lossy video compression, Robo-DM is up to 50x faster when decoding sequentially. We physically evaluate a model trained by Robo-DM with lossy compression, a pick-and-place task, and In-Context Robot Transformer. Robo-DM uses 75x compression of the original dataset and does not suffer reduction in downstream task accuracy. Code and evaluation scripts can be found on website https://github.com/BerkeleyAutomation/fog_x.

Abstract:
Tactile sensors play a crucial role in enabling robots to interact effectively and safely with objects in everyday tasks. In particular, visuotactile sensors have seen increasing usage in two and three-fingered grippers due to their high-quality feedback. However, a significant gap remains in the development of sensors suitable for humanoid robots, especially five-fingered dexterous hands. One reason is because of the challenges in designing and manufacturing sensors that are compact in size. In this paper, we propose HumanFT, a multimodal visuotactile sensor that replicates the shape and functionality of a human fingertip. To bridge the gap between human and robotic tactile sensing, our sensor features real-time force measurements, high-frequency vibration detection, and overtemperature alerts. To achieve this, we developed a suite of fabrication techniques for a new type of elastomer optimized for force propagation and temperature sensing. Besides, our sensor integrates circuits capable of sensing pressure and vibration. These capabilities have been validated through experiments. The proposed design is simple and cost-effective to fabricate. We believe HumanFT can enhance humanoid robots' perception by capturing and interpreting multimodal tactile information.

Abstract:
This paper presents a novel autonomous drone-based smoke plume tracking system capable of navigating and tracking plumes in highly unsteady atmospheric conditions. The system integrates advanced hardware and software and a comprehensive simulation environment to ensure robust performance in controlled and real-world settings. The quadrotor, equipped with a high-resolution imaging system and an advanced onboard computing unit, performs precise maneuvers while accurately detecting and tracking dynamic smoke plumes under fluctuating conditions. Our software implements a two-phase flight operation: descending into the smoke plume upon detection and continuously monitoring the smoke's movement during in-plume tracking. Leveraging Proportional Integral-Derivative (PID) control and a Proximal Policy Optimization (PPO) based Deep Reinforcement Learning (DRL) controller enables adaptation to plume dynamics. Unreal Engine simulation evaluates performance under various smoke-wind scenarios, from steady flow to complex, unsteady fluctuations, showing that while the PID controller performs adequately in simpler scenarios, the DRL-based controller excels in more challenging environments. Field tests corroborate these findings. This system opens new possibilities for drone-based monitoring in areas like wildfire management and air quality assessment. The successful integration of DRL for real-time decision-making advances autonomous drone control for dynamic environments.

Abstract:
In medical image segmentation tasks, diffusion models have exhibited significant potential. However, mainstream diffusion models show drawbacks including multiple sampling times and slow prediction results. Recently, as a standalone generative network, consistency models have resolved the existing issue. Compared to diffusion models, consistency models can lower the sampling times to once, not only achieving similar generative effects but also significantly accelerating training and prediction. However, they are not suitable for image segmentation tasks. Meanwhile, their application in the medical imaging field has not yet been investigated. Therefore, this study employs the consistency model to perform medical image segmentation tasks, designing multi-scale feature signal supervision modes and loss function guidance to realize model convergence. Experiments have demonstrated that the CTS model is capable of obtaining better medical image segmentation results with a single sampling during the test phase.

Abstract:
Accurate inertial parameter identification is crucial for the simulation and control of robots encountering intermittent contacts with the environment. Classically, robots' inertial parameters are obtained from CAD models that are not precise (and sometimes not available, e.g., Spot from Boston Dynamics), hence requiring identification. To do that, existing methods require access to contact force measurement, a modality not present in modern quadruped and humanoid robots. This paper presents an alternative technique that utilizes joint current/torque measurements -a standard sensing modality in modern robots- to identify inertial parameters without requiring direct contact force measurements. By projecting the whole-body dynamics into the null space of contact constraints, we eliminate the dependency on contact forces and reformulate the identification problem as a linear matrix inequality that can handle physical and geometrical constraints. We compare our proposed method against a common black-box identification method using a deep neural network and show that incorporating physical consistency significantly improves the sample efficiency and generalizability of the model. Finally, we validate our method on the Spot quadruped robot across various locomotion tasks, showcasing its accuracy and generalizability in real-world scenarios over different gaits.

Abstract:
Events cameras are ideal sensors for enabling robots to detect and track objects in highly dynamic environments due to their low latency output, high temporal resolution, and high dynamic range. In this paper, we present the Asynchronous Event Multi-Object Tracking (AEMOT) algorithm for detecting and tracking multiple objects by processing individual raw events asynchronously. AEMOT detects salient event blob features by identifying regions of consistent optical flow using a novel Field of Active Flow Directions built from the Surface of Active Events. Detected features are tracked as candidate objects using the recently proposed Asynchronous Event Blob (AEB) tracker in order to construct small intensity patches of each candidate object. A novel learnt validation stage promotes or discards candidate objects based on classification of their intensity patches, with promoted objects having their position, velocity, size, and orientation estimated at their event rate. We evaluate AEMOT on a new Bee Swarm Dataset, where it tracks dozens of small bees with precision and recall performance exceeding that of alternative event-based detection and tracking algorithms by over 37%. Source code and the labelled event Bee Swarm Dataset will be open sourced. 11https://github.com/angus-apps/AEMOT

Abstract:
The development of prosthetic feet that closely replicate the natural biomechanics of the human foot remains a significant challenge in prosthetics engineering. This paper presents the design and testing of a novel agonist-antagonist architecture for the ankle joint of a passive prosthetic foot featuring an adaptive sole. The ankle mechanism, inspired by the dynamics of the human leg-ankle-foot complex, utilizes compliant elements in an agonist-antagonist configuration to passively achieve an ankle torque close to that of a sound ankle without the need for external actuation. Concurrently, the adaptive sole adjusts its shape in response to different terrains, potentially improving stability and comfort for the user. The theoretical model underlying the proposed design is presented, followed by a preliminary validation through simulations. Finally, a prototype based on the new architecture is tested by a healthy subject using customized walking boots, demonstrating its potential to improve the functional performance of prosthetic feet in diverse environments.

Abstract:
In recent years, imitation learning from large-scale human demonstrations has emerged as a promising paradigm for training robot policies. However, the burden of collecting large quantities of human demonstrations is significant in terms of collection time and the need for access to expert operators. We introduce a new data collection paradigm, RoboCrowd, which distributes the workload by utilizing crowdsourcing principles and incentive design. RoboCrowd helps enable scalable data collection and facilitates more efficient learning of robot policies. We build RoboCrowd on top of ALOHA [1]—a bimanual platform that supports data collection via puppeteering—to explore the design space for crowdsourcing in-person demonstrations in a public environment. We propose three classes of incentive mechanisms to appeal to users' varying sources of motivation for interacting with the system: material rewards, intrinsic interest, and social comparison. We instantiate these incentives through tasks that include physical rewards, engaging or challenging manipulations, as well as gamification elements such as a leaderboard. We conduct a large-scale, two-week field experiment in which the platform is situated in a university café. We observe significant engagement with the system—over 200 individuals independently volunteered to provide a total of over 800 interaction episodes. Our findings validate the proposed incentives as mechanisms for shaping users' data quantity and quality. Further, we demonstrate that the crowdsourced data can serve as useful pre-training data for policies fine-tuned on expert demonstrations—boosting performance up to 20 % compared to when this data is not available. These results suggest the potential for RoboCrowd to reduce the burden of robot data collection by carefully implementing crowdsourcing and incentive design principles. Videos are available at https://robocrowd.github.io.

Abstract:
Personal mobile robotic assistants are expected to find wide applications in industry and healthcare. For example, people with limited mobility can benefit from robots helping with daily tasks, or construction workers can have robots perform precision monitoring tasks on-site. However, manually steering a robot while in motion requires significant concentration from the operator, especially in tight or crowded spaces. This reduces walking speed, and the constant need for vigilance increases fatigue and, thus, the risk of accidents. This work presents a virtual leash with which a robot can naturally follow an operator. We use a sensor fusion based on a custom-built RF transponder, RGB cameras, and a LiDAR. In addition, we customize a local avoidance planner for legged platforms, which enables us to navigate dynamic and narrow environments. We successfully validate on the ANYmal platform [1] the robustness and performance of our entire pipeline in real-world experiments. The video is available at: obstacle-avoidant-leader-following.

Abstract:
This article focuses on predicting how an object can be transformed to a semantically meaningful pose relative to another object, given only one or few examples. Current pose correspondence methods rely on vast 3D object datasets and do not actively consider semantic information, which limits the objects to which they can be applied. We present a novel method for learning cross-object pose correspondence. The proposed method detects interacting object parts, performs one-shot part correspondence, and uses geometric and visual-semantic features. Given one example of two objects posed relative to each other, the model can learn how to transfer the demonstrated relations to unseen object instances. Supplementary details can be found at https://sites.google.com/view/semantic-pose-correspondence

Abstract:
Lifelong imitation learning for manipulation tasks poses significant challenges due to distribution shifts that occur in incremental learning steps. Existing methods often rely on unsupervised skill discovery to construct an ever-growing skill library or distillation from multiple policies, which can lead to scalability issues as diverse manipulation tasks are continually introduced and may fail to ensure a consistent latent space throughout the learning process, leading to catastrophic forgetting of previously learned skills. In this paper, we introduce M2Distill, a multimodal distillation-based method for lifelong imitation learning focusing on preserving consistent latent space across vision, language, and action distributions throughout the learning process. By regulating the shifts in latent representations across different modalities from previous to current steps, and reducing discrepancies in Gaussian Mixture Model (GMM) policies between consecutive learning steps, we ensure that the learned policy retains its ability to perform previously learned tasks while seamlessly integrating new skills. Evaluations on the LIBERO lifelong imitation learning benchmark suites, including LIBERO-OBJECT, LIBERO-GOAL, and LIBERO-SPATIAL, demonstrate that our method consistently outperforms prior state-of-the-art methods across all evaluated metrics.

Abstract:
3D dense captioning aims to describe the crucial regions in 3D visual scenes in the form of natural language. Recent prevailing approaches achieve promising results by leveraging complicated structures incorporated with large-scale models, which necessitate abundant parameters and pose challenges regarding its practical applications. Besides, with limited training data, 3D dense captioners are often susceptible to overfitting, directly degrading caption generation performance. Drawing inspiration from the recent advancements in knowledge distillation, we propose a novel approach termed Prototypical Momentum Distillation (PMD) to prompt the model to generate more detailed captions. PMD incorporates Momentum Distillation (MD) with an Uncertainty-aware Prototype-anchored Clustering (UPC) strategy to transfer knowledge by considering the uncertainty of the teacher knowledge. Specifically, we employ the original captioner as the student model and maintain an Exponential Moving Average (EMA) copy of the captioner as the teacher model to impart knowledge as the auxiliary supervision of the student. To abate the misleading caused by uncertain knowledge, we present an Uncertainty-aware Prototype-anchored Clustering (UPC) strategy to cluster the distilled knowledge according to its confidence. We then transfer the rearranged knowledge from the teacher to guide the training route of the student. We conduct extensive experiments and ablation studies on two widely used benchmark datasets, ScanRefer and Nr3D. Experimental results demonstrate that PMD outperforms all state-of-the-art approaches on the benchmarks with MLE training, highlighting its effectiveness.

Abstract:
Underwater object-level mapping requires incorporating visual foundation models to handle the uncommon and often previously unseen object classes encountered in marine scenarios. In this work, a metric of semantic uncertainty for open-set object detections produced by visual foundation models is calculated and then incorporated into an object-level uncertainty tracking framework. Object-level uncertainties and geometric relationships between objects are used to enable robust object-level loop closure detection for unknown object classes. The above loop closure detection problem is formulated as a graph matching problem. While graph matching, in general, is NP-Complete, a solver for an equivalent formulation of the proposed graph matching problem as a graph editing problem is tested on multiple challenging underwater scenes. Results for this solver as well as three other solvers demonstrate that the proposed methods are feasible for real-time use in marine environments for the robust, open-set, multi-object, semantic-uncertainty-aware loop closure detection. Further experimental results on the KITTI dataset demonstrate that the method generalizes to large-scale terrestrial scenes.

Abstract:
The development of novel autonomous underwater gliders has been hindered by limited shape diversity, primarily due to the reliance on traditional design tools that depend heavily on manual trial and error. Building an automated design framework is challenging due to the complexities of representing glider shapes and the high computational costs associated with modeling complex solid-fluid interactions. In this work, we introduce an AI-enhanced automated computational framework designed to overcome these limitations by enabling the creation of underwater robots with non-trivial hull shapes. Our approach involves an algorithm that cooptimizes both shape and control signals, utilizing a reducedorder geometry representation and a differentiable neural-network-based fluid surrogate model. This end-to-end design workflow facilitates rapid iteration and evaluation of hydrodynamic performance, leading to the discovery of optimal and complex hull shapes across various control settings. We validate our method through wind tunnel experiments and swimming pool gliding tests, demonstrating that our computationally designed gliders surpass manually designed counterparts in terms of energy efficiency. By addressing challenges in efficient shape representation and neural fluid surrogate models, our work paves the way for the development of highly efficient underwater gliders, with implications for long-range ocean exploration and environmental monitoring.

Abstract:
Exoskeleton controllers have recently employed machine learning (ML) techniques to provide appropriate assistance throughout the terrains of the real world. One successful approach has been to learn a mapping between an exoskeleton wearer's kinematic measurements and a gait state vector that encodes how the wearer is currently walking (i.e. gait phase, speed), and then dynamically update the assistance based on the gait state. However, these methods require paired datasets of input kinematics to output gait states, which usually involves manual, time-consuming labeling of data from participants wearing specific exoskeletons and thus limits the scalability of these ML methods. A prior solution to this challenge—leveraging large pre-labeled datasets of normative human walking—introduces another problem, in that networks trained on these datasets learn only normative locomotion patterns, and thus may deteriorate when the data are changed by wearing the exoskeleton itself. In this context, we present an unsupervised-learning-based approach to both bypass the requirement of labeled data for gait state prediction and address the difficulty of domain adaptation from normative to exoskeleton-assisted walking. We validate our method in a set of walking simulations that featured exoskeleton data from 14 participants. This model showed significant improvements in state estimation relative to a model trained solely on pre-labeled normative walking, while also not requiring ground truth labels. This work presents a foundation that demonstrates labeled, device-specific data may not be required for predicting walking behavior in real time.

Abstract:
Observing that the key for robotic action planning is to understand the target-object motion when its associated part is manipulated by the end effector, we propose to generate the 3D object-part scene flow and extract its transformations to solve the action trajectories for diverse embodiments. The advantage of our approach is that it derives the robot action explicitly from object motion prediction, yielding a more robust policy by understanding the object motions. Also, beyond policies trained on embodiment-centric data, our method is embodiment-agnostic, generalizable across diverse embodiments, and being able to learn from human demonstrations. Our method comprises three components: an object-part predictor to locate the part for the end effector to manipulate, an RGBD video generator to predict future RGBD videos, and a trajectory planner to extract embodiment-agnostic transformation sequences and solve the trajectory for diverse embodiments. Trained on videos even without trajectory data, our method still outperforms existing works significantly by 27.7% and 26.2% on the prevailing virtual environments MetaWorld and Franka-Kitchen, respectively. Furthermore, we conducted real-world experiments, showing that our policy, trained only with human demonstration, can be deployed to various embodiments.

Abstract:
Recent advancements in robotics, control, and machine learning have facilitated progress in the challenging area of object manipulation. These advancements include, among others, the use of deep neural networks to represent dynamics that are partially observed by robot sensors, as well as effective control using sparse control signals. In this work, we explore a more general problem: the manipulation of acoustic waves, which are partially observed by a robot capable of influencing the waves through spatially sparse actuators. This problem holds great potential for the design of new artificial materials, ultrasonic cutting tools, energy harvesting, and other applications. We develop an efficient data-driven method for robot learning that is applicable to either focusing scattered acoustic energy in a designated region or suppressing it, depending on the desired task. The proposed method is better in terms of a solution quality and computational complexity as compared to a state-of-the-art learning based method for manipulation of dynamical systems governed by partial differential equations. Furthermore our proposed method is competitive with a classical semi-analytical method in acoustics research on the demonstrated tasks. We have made the project code publicly available, along with a web page featuring video demonstrations: https://gladisor.github.io/waves/.

Abstract:
Pose estimation is a crucial problem in simultaneous localization and mapping (SLAM). However, developing a robust and consistent state estimator remains a significant challenge, as the traditional extended Kalman filter (EKF) struggles to handle the model nonlinearity, especially for inertial measurement unit (IMU) and light detection and ranging (LiDAR). To provide a consistent and efficient solution of pose estimation, we propose Eq-LIO, a robust state estimator for tightly coupled LIO systems based on an equivariant filter (EqF). Compared with the invariant Kalman filter based on the SE2 (3) group structure, the EqF uses the symmetry of the semi-direct product group to couple the system state including IMU bias, navigation state, and LiDAR extrinsic calibration state, thereby suppressing linearization error and improving the behavior of the estimator in the event of unexpected state changes. The proposed Eq-LIO owns natural consistency and higher robustness, which is theoretically proven with mathematical derivation and experimentally verified through a series of tests on both public and private datasets.

Abstract:
Self-mixing interferometry (SMI) has been lauded for its sensitivity in detecting microvibrations, while requiring no physical contact with its target. In robotics, microvibrations have traditionally been interpreted as a marker for object slip, and recently as a salient indicator of extrinsic contact. We present the first-ever robotic fingertip making use of SMI for slip and extrinsic contact sensing. The design is validated through measurement of controlled vibration sources, both before and after encasing the readout circuit in its fingertip package. Then, the SMI fingertip is compared to acoustic sensing through four experiments. The results are distilled into a technology decision map. SMI was found to be more sensitive to subtle slip events and significantly more resilient against ambient noise. We conclude that the integration of SMI in robotic fingertips offers a new, promising branch of tactile sensing in robotics. Design and data files are available at https://github.com/RemkoPr/icra2025-SMI-tactile-sensing.

Abstract:
Current autonomous driving technology typically relies on high-precision (HD) maps to ensure safe, reliable, and accurate navigation in urban environments. While these maps provide essential road information, their creation and maintenance are costly, limiting their widespread application. To mitigate this reliance, we propose a novel system, Language-Assisted Continuous Navigation in Structured Spaces (LACNS). LACNS facilitates autonomous driving without the need for HD maps by integrating vehicle-centric local perception with real-time language instructions from map software or human navigators. LACNS begins by generating a BEV map using the vehicle's front-facing camera. Simultaneously, a pretrained Visual Language Model (VLM) detects intersections from the camera images, assigning a score to each. Road elements are then extracted from the BEV map and combined with the intersection scores to identify potential navigation frontiers. Language instructions, processed by a pretrained Large Language Model(LLM), are used to select the most suitable frontier. Finally, the chosen frontier and BEV map are employed to plan a safe route and control the vehicle's movement. We evaluated LACNS using the Carla simulator to validate its navigation capabilities in continuous spaces. Initial experiments involved navigating through four intersections with varying directional instructions, where LACNS demonstrated high and consistent success rates across multiple trials. Further simulations in real-time navigation scenarios revealed that LACNS consistently maintained a high success rate across three progressively challenging routes. These results highlight the effectiveness of our novel autonomous driving navigation method without HD maps.

Abstract:
Safe navigation of self-driving cars and robots requires a precise understanding of their environment. Training data for perception systems cannot cover the wide variety of objects that may appear during deployment. Thus, reliable identification of unknown objects, such as wild animals and untypical obstacles, is critical due to their potential to cause serious accidents. Significant progress in semantic segmentation of anomalies has been facilitated by the availability of out-of-distribution (OOD) benchmarks. However, a comprehensive understanding of scene dynamics requires the segmentation of individual objects, and thus the segmentation of instances is essential. Development in this area has been lagging, largely due to the lack of dedicated benchmarks. The situation is similar in object detection. While there is interest in detecting and potentially tracking every anomalous object, the availability of dedicated benchmarks is clearly limited. To address this gap, this work extends some commonly used anomaly segmentation benchmarks to include the instance segmentation and object detection tasks. Our evaluation of anomaly instance segmentation and object detection methods shows that both of these challenges remain unsolved problems. We provide a competition and benchmark website under https://vision.rwth-aachen.de/oodis.

Abstract:
This paper studies high-speed online planning in dynamic environments. The problem requires finding time-optimal trajectories that conform to system dynamics, meeting computational constraints for real-time adaptation, and accounting for uncertainty from environmental changes. To address these challenges, we propose a sampling-based online planning algorithm that leverages neural network inference to replace time-consuming nonlinear trajectory optimization, enabling rapid exploration of multiple trajectory options under uncertainty. The proposed method is applied to the drone interception problem, where a defense drone must intercept a target while avoiding collisions and handling imperfect target predictions. The algorithm efficiently generates trajectories toward multiple potential target drone positions in parallel. It then assesses trajectory reachability by comparing traversal times with the target drone's predicted arrival time, ultimately selecting the minimum-time reachable trajectory. Through extensive validation in both simulated and real-world environments, we demonstrate our method's capability for high-rate online planning and its adaptability to unpredictable movements in unstructured settings.

Abstract:
Centipede-like robots offer an effective and robust solution to navigation over complex terrain with minimal sensing. However, when climbing over obstacles, such multi-legged robots often elevate their center-of-mass into unstable configurations, where even moderate terrain uncertainty can cause tipping. Robust mechanisms for such elongate multi-legged robots to self-right remain unstudied. Here, we use a comparative biological and robophysical approach to investigate self-righting strategies. We first released S. polymorpha upside down from a 10 cm height and recorded their self-righting behaviors using top and side view high-speed cameras. Using kinematic analysis, we hypothesize that these behaviors can be prescribed by two traveling waves superimposed in the body's lateral and vertical planes, respectively. We tested our hypothesis on an elongate robot with static (non-actuated) limbs, and we successfully reconstructed these self-righting behaviors. We further evaluated how wave parameters affect self-righting effectiveness. We identified two key wave parameters: the spatial frequency, which characterizes the sequence of body-rolling, and the wave amplitude, which characterizes body curvature. By empirically obtaining a behavior diagram of spatial frequency and amplitude, we identify effective and versatile self-righting strategies for general elongate multi-legged robots, which greatly enhances these robots' mobility and robustness in practical applications such as agricultural terrain inspection and search-and-rescue.

Abstract:
This paper delves into the fusion of quantum computing and robotics, focusing on motion planning in cluttered environments. Traditional algorithms struggle with complex problems where many constraints need to be satisfied. Hence, optimization-based approaches such as Constrained Quadratic Models (CQM) have become increasingly popular. Our work presents a 3D tracking algorithm based on CQM uniquely adapted for quantum computers to address computational challenges. With their parallel processing capabilities, Quantum computers offer a groundbreaking approach to optimizing complex problems. We formulate the CQM problem for efficient resolution on the D-Wave quantum computer, showcasing its superiority over classical counterparts. Our application centers on real-time planning in a target-chaser tracking scenario, highlighting the quantum advantage in handling the computation complexity of constrained problems. This paper bridges the quantum-robotics gap and sets the stage for future research in quantum-enhanced robotic motion planning.

Abstract:
Swarmalators move as a function of their pairwise phase interactions, and control their phase as a function of their relative position or motion to other agents. This enables dual sync and swarm behaviors that mimic those exhibited by diverse natural and artificial swarms; these behaviors have almost entirely been explored only through computational simulations. Here, we realize through a 15-robot collective many of the predicted swarmalator behaviors when agents are chiral and non-chiral, when there is frequency coupling, and when the natural frequency distribution is homogeneous and heterogeneous. This work presents an experimental platform that can realize many theoretically predicted collective behaviors, it sheds light on the differences between the simulations and experiments, and it will serve in future studies to realize swarmalator and active matter collective behaviors.

Abstract:
Pneumatic Artificial Muscles (PAMs) are complex nonlinear systems characterized by hysteresis, making them challenging to model with classical system identification methods. While deep learning has emerged as a powerful tool for modeling nonlinear systems from data, purely neural networkbased models often lack interpretability and are prone to overfitting. To address these challenges, this study explores several hybrid approaches that combine analytical models with neural networks to model PAM behavior more effectively. The results demonstrate that hybrid models significantly outperform both purely analytical and black-box neural network models, particularly in terms of generalization and dynamic accuracy. Among the approaches, the Physics-Informed Neural Network (PINN) unsupervised model shows the most robust performance, capturing complex PAM dynamics while maintaining computational efficiency. These findings suggest that hybrid modeling is a promising and scalable solution for accurately representing the intricate behavior of PAMs.

Abstract:
Traditional visible light cameras are prone to performance degradation under varying weather and lighting conditions. To address this challenge, we introduce an eventbased camera and propose a novel hierarchical spatiotemporal fusion approach for event-visible object detection. Our method enhances detection performance by integrating data from both event-based and visible light cameras. We have designed three key modules: The Gated Event Accumulation Representation module (GEAR), the Temporal Feature Selection module (TFS), and the Adaptive Fusion module (AF). GEAR and TFS enhance temporal feature fusion at both image and feature levels, while AF effectively integrates multi-modal features with low computational complexity. Our approach has been trained and validated on the publicly available DSEC-Detection dataset, achieving mAP50 and mAP50-95 scores of 67.2% and 45.6%, respectively, demonstrating superior detection performance and validating the effectiveness of the proposed method.

Abstract:
Building models responsive to input prompts represents a transformative shift in machine learning. This paradigm holds significant potential for robotics problems, such as targeted manipulation amidst clutter. In this work, we present a novel approach to combine promptable foundation models with reinforcement learning (RL), enabling robots to perform dexterous manipulation tasks in a prompt-responsive manner. Existing methods struggle to link high-level commands with fine-grained dexterous control. We address this gap with a memory-augmented student-teacher learning framework. We use the Segment-Anything 2 (SAM2) model as a perception backbone to infer an object of interest from user prompts. While detections are imperfect, their temporal sequence provides rich information for implicit state estimation by memory-augmented models. Our approach successfully learns prompt-responsive policies, demonstrated in picking objects from cluttered scenes. Videos and code are available at https://memory-student-teacher.github.io

Abstract:
Neural Radiance Fields (NeRF) presented a novel way to represent scenes, allowing for high-quality 3D reconstruction from 2D images. Following its remarkable achievements, global localization within NeRF maps is an essential task for enabling a wide range of applications. Recently, Loc-NeRF demonstrated a localization approach that combines traditional Monte Carlo Localization with NeRF, showing promising results for using NeRF as an environment map. However, despite its advancements, Loc-NeRF encounters the challenge of a time-intensive ray rendering process, which can be a significant limitation in practical applications. To address this issue, we introduce Fast Loc-NeRF, which enhances efficiency and accuracy in NeRF map-based global localization. We propose a particle rejection weighting strategy that estimates the uncertainty of particles by leveraging NeRF's inherent characteristics and incorporates them into the particle weighting process to reject abnormal particles. Additionally, Fast Loc-NeRF employs a coarse-to-fine approach, matching rendered pixels and observed images across multiple resolutions from low to high. As a result, it speeds up the costly particle update process while enhancing precise localization results. Our Fast Loc-NeRF establishes new state-of-the-art localization performance on several bench-marks, demonstrating both its accuracy and efficiency. The code is available at this url.

Abstract:
Perception using whisker-inspired tactile sensors currently faces a major challenge: the lack of active control in robots based on direct contact information from the whisker. To accurately reconstruct object contours, it is crucial for the whisker sensor to continuously follow and maintain an appropriate relative touch pose on the surface. This is especially important for localization based on tip contact, which has a low tolerance for sharp surfaces and must avoid slipping into tangential contact. In this paper, we first construct a magnetically transduced whisker sensor featuring a compact and robust suspension system composed of three flexible spiral arms. We develop a method that leverages a characterized whisker deflection profile to directly extract the tip contact position using gradient descent, with a Bayesian filter applied to reduce fluctuations. We then propose an active motion control policy to maintain the optimal relative pose of the whisker sensor against the object surface. A B-Spline curve is employed to predict the local surface curvature and determine the sensor orientation. Results demonstrate that our algorithm can effectively track objects and reconstruct contours with sub-millimeter accuracy. Finally, we validate the method in simulations and real-world experiments where a robot arm drives the whisker sensor to follow the surfaces of three different objects.

Abstract:
The use of robots in high-risk and extreme environments is crucial for tasks that are dangerous or inaccessible to humans and require high precision. Particularly in scenarios where the cost of failure is high, remote human teleoperation can be the preferred method of robot control due to the adaptability and high-level decision making of humans. Teleoperation brings many challenges including lack of accurate prior knowledge about the environment, limited views of the environment by on-board sensors, and especially inconsistent latency. 7-DOF (degrees of freedom) manipulators provide redundancy which can be utilized for increased flexibility in manipulation, and may be preferred to 6-DOF manipulators in many scenarios. The redundancy, however, must be considered by the teleoperation system. We present an extension to an existing Interactive Planning and Supervised Execution (IPSE) system that enables full teleoperation of a 7-DOF robot by encoding the redundant degree of freedom with a Shoulder-Elbow-Wrist (SEW) angle, which is user-manipulable via an SEW angle graph. Additionally, we introduce a novel user interface feature that encodes robot state information into a 2D image which is displayed directly on the SEW angle graph. We conduct a user-study which demonstrates that the addition of this SEW graph significantly reduces task completion time.

Abstract:
Conventional robot programming methods are complex and time-consuming for users. In recent years, alternative approaches such as mixed reality have been explored to address these challenges and optimize robot programming. While the findings of the mixed reality robot programming methods are convincing, most existing methods rely on gesture interaction for robot programming. Since controller-based interactions have proven to be more reliable, this paper examines three controller-based programming methods within a mixed reality scenario: 1) Classical Jogging, where the user positions the robot's end effector using the controller's thumbsticks, 2) Direct Control, where the controller's position and orientation directly corresponds to the end effector's, and 3) Gripper Control, where the controller is enhanced with a 3D-printed gripper attachment to grasp and release objects. A within-subjects study (n = 30) was conducted to compare these methods. The findings indicate that the Gripper Control condition outperforms the others in terms of task completion time, user experience, mental demand, and task performance, while also being the preferred method. Therefore, it demonstrates promising potential as an effective and efficient approach for future robot programming. Video available at https://youtu.be/83kWr8zUFIQ.

Abstract:
To support sustainable infrastructure on the Moon, NASA is developing the In-Situ Resource Utilization (ISRU) Pilot Excavator (IPEx) to extract and transport lunar regolith for processing and construction. During its mission, IPEx will execute various driving patterns, primarily cycling between excavation and unloading sites, with additional ma-neuvers such as circular traverses around the lander and raster scans for environmental mapping. In this work, dynamic move-ment primitives (DMPs) are used to represent these patterns. We augment the DMPs with a vision-based real-time obstacle avoidance system to navigate surface hazards, such as rocks, encountered during traversal. Our approach is evaluated in a high-fidelity simulation replicating the challenging environment of the lunar south pole to demonstrate IPEx's ability to adapt to surface hazards while fulfilling its operational tasks.

Abstract:
Vision-and-Language Navigation (VLN) empowers agents to associate time-sequenced visual observations with corresponding instructions to make sequential decisions. However, dealing with visually diverse scenes or transitioning from simulated environments to real-world deployment is still challenging. In this paper, we address the mismatch between human-centric instructions and quadruped robots with a lowheight field of view, proposing a Ground-level Viewpoint Navigation (GVNav) approach to mitigate this issue. This work represents the first attempt to highlight the generalization gap in VLN across varying heights of visual observation in realistic robot deployments. Our approach leverages weighted historical observations as enriched spatiotemporal contexts for instruction following, effectively managing feature collisions within cells by assigning appropriate weights to identical features across different viewpoints. This enables low-height robots to overcome challenges such as visual obstructions and perceptual mismatches. Additionally, we transfer the connectivity graph from the HM3D and Gibson datasets as an extra resource to enhance spatial priors and a more comprehensive representation of real-world scenarios, leading to improved performance and generalizability of the waypoint predictor in real-world environments. Extensive experiments demonstrate that our Groundlevel Viewpoint Navigation (GVnav) approach significantly improves performance in both simulated environments and real-world deployments with quadruped robots.

Abstract:
We present an image blending pipeline, IBURD, that creates realistic synthetic images to assist in the training of deep detectors for use on underwater autonomous vehicles (AUVs) for marine debris detection tasks. Specifically, IBURD generates both images of underwater debris and their pixel-level annotations, using source images of debris objects, their annotations, and target background images of marine environments. With Poisson editing and style transfer techniques, IBURD is even able to robustly blend transparent objects into arbitrary backgrounds and automatically adjust the style of blended images using the blurriness metric of target background images. These generated images of marine debris in actual underwater backgrounds address the data scarcity and data variety problems faced by deep-learned vision algorithms in challenging underwater conditions, and can enable the use of AUVs for environmental cleanup missions. Both quantitative and robotic evaluations of IBURD demonstrate the efficacy of the proposed approach for robotic detection of marine debris.

Abstract:
In biomechanics and robotics, elasticity plays a crucial role in enhancing locomotion efficiency and stability. Traditional approaches in legged robots often employ series elastic actuators (SEA) with discrete rigid components, which, while effective, add weight and complexity. This paper presents an innovative alternative by integrating continuously compliant structures into the lower legs of a bipedal robot, fundamentally transforming the SEA concept. Our approach replaces traditional rigid segments with lightweight, deformable materials, reducing overall mass and simplifying the actuation design. This novel design introduces unique challenges in modeling, sensing, and control, due to the infinite dimensionality of continuously compliant elements. We address these challenges through effective approximations and control strategies. The paper details the design and modeling of the compliant leg structure, presents low-level force and kinematics controllers, and introduces a high-level posture controller with a gait scheduler. Experimental results demonstrate successful bipedal walking using this new design.

Abstract:
AI-controlled robotic systems can introduce significant risks to both humans and the environment. Traditional reliability assessment methods fall short in addressing the complexities of these systems, particularly when dealing with black-box or dynamically changing control policies. The traditional approaches are applied manually and do not consider frequent software updates. In this paper, we present RelAIBotiX, a new methodology that enables dynamic and continuous reliability assessment, specifically tailored for robotic systems controlled by AI algorithms. RelAIBotiX combines four methods: (i) Skill Detection that automatically identifies executed skills using deep learning techniques, (ii) Behavioral Analysis that creates an operational profile of the robotic system containing information about the skill execution sequence, active components for each skill, and their utilization intensity that influence their failure rate, (iii) Reliability Model Generation that automatically transforms the operational profile and reliability data of robotic hardware components into quantitative hybrid reliability models, and (iv) Reliability Model Solver for the numerical evaluation of the generated reliability models. Our evaluation included computing the reliability of the system, the probability of failure of individual skills, and component sensitivity analysis. We validated the applicability of the proposed framework across five simulative and real-world setups.

Abstract:
In general, humans would grasp an object differently for different tasks, e.g., “grasping the handle of a knife to cut” vs. “grasping the blade to hand over”. In the field of robotic grasp pose detection research, some existing works consider this task-oriented grasping and made some progress, but they are generally constrained by low-DoF gripper type or non-cluttered setting, which is not applicable for human assistance in real life. With an aim to get more general and practical grasp models, in this paper, we investigate the problem named Task-Oriented 6-DoF Grasp Pose Detection in Clutters (TO6DGC), which extends the task-oriented problem to a more general 6-DOF Grasp Pose Detection in Cluttered (multi-object) scenario. To this end, we construct a large-scale 6-DoF task-oriented grasping dataset, 6-DoF Task Grasp (6DTG), which features 4391 cluttered scenes with over 2 million 6-DoF grasp poses. Each grasp is annotated with a specific task, involving 6 tasks and 198 objects in total. Moreover, we propose One-Stage TaskGrasp (OSTG), a strong baseline to address the TO6DGC problem. Our OSTG adopts a task-oriented point selection strategy to detect where to grasp, and a task-oriented grasp generation module to decide how to grasp given a specific task. To evaluate the effectiveness of OSTG, extensive experiments are conducted on 6DTG. The results show that our method outperforms various baselines on multiple metrics. Real robot experiments also verify that our OSTG has a better perception of the task-oriented grasp points and 6-DoF grasp poses.

Abstract:
Robot design has traditionally been costly and labor-intensive. Despite advancements in automated processes, it remains challenging to navigate a vast design space while producing physically manufacturable robots. We introduce Text2Robot, a framework that converts user text specifications and performance preferences into physical quadrupedal robots. Within minutes, Text2Robot can use text-to-3D models to provide strong initializations of diverse morphologies. Within a day, our geometric processing algorithms and body-control co-optimization produce a walking robot by explicitly considering real-world electronics and manufacturability. Text2Robot enables rapid prototyping and opens new opportunities for robot design with generative models. Our website is at http://generalroboticslab.com/Text2Robot/.

Abstract:
CAD models are widely used in industry and are essential for robotic automation processes. However, these models are rarely considered in novel AI-based approaches, such as the automatic synthesis of robot programs, as there are no readily available methods that would allow CAD models to be incorporated for the analysis, interpretation, or extraction of information. To address these limitations, we propose QueryCAD, the first system designed for CAD question answering, enabling the extraction of precise information from CAD models using natural language queries. QueryCAD incorporates SegCAD, an open-vocabulary instance segmentation model we developed to identify and select specific parts of the CAD model based on part descriptions. We further propose a CAD question answering benchmark to evaluate QueryCAD and establish a foundation for future research. Lastly, we integrate QueryCAD within an automatic robot program synthesis framework, validating its ability to enhance deep-learning solutions for robotics by enabling them to process CAD models. https://claudius-kienle.github.com/querycad.

Abstract:
In-context imitation learning is the capability to perform novel tasks when prompted with task demonstration examples. In-Context Robot Transformer (ICRT) is a causal transformer that performs autoregressive prediction on sensorimotor trajectories, which include images, proprioceptive states, and actions. This approach supports flexible and training-free execution of new tasks at test time. Experiments with a Franka Emika robot demonstrate that ICRT can adapt to new environment configurations that differ from both the prompt and the training data. In a multi-task environment setup, ICRT significantly outperforms current state-of-the-art robot foundation models on generalization to unseen tasks. Code, data, and appendix are available on https://icrt.dev.

Abstract:
End-to-end autonomous driving with vision-only is not only more cost-effective compared to LiDAR-vision fusion but also more reliable than traditional methods. To achieve a economical and robust purely visual autonomous driving system, we propose RenderWorld, a vision-only end-to-end autonomous driving framework, which generates 3D occupancy labels using a self-supervised gaussian-based Img2Occ Module, then encodes the labels by AM-VAE, and uses world model for forecasting and planning. RenderWorld employs Gaussian Splatting to represent 3D scenes and render 2D images greatly improves segmentation accuracy and reduces GPU memory consumption compared with NeRF-based methods. By applying AM-VAE to encode air and non-air separately, RenderWorld achieves more fine-grained scene element representation, leading to state-of-the-art performance in both 4D occupancy forecasting and motion planning from autoregressive world model.

Abstract:
The efficient planning of stacking boxes, especially in the online setting where the sequence of item arrivals is unpredictable, remains a critical challenge in modern warehouse and logistics management. Existing solutions often address box size variations, but overlook their intrinsic and physical properties, such as density and rigidity, which are crucial for real-world applications. We use reinforcement learning (RL) to solve this problem by employing action space masking to direct the RL policy toward valid actions. Unlike previous methods that rely on heuristic stability assessments which are difficult to assess in physical scenarios, our framework utilizes online learning to dynamically train the action space mask, eliminating the need for manual heuristic design. Extensive experiments demonstrate that our proposed method outperforms existing state-of-the-arts. Furthermore, we deploy our learned task planner in a real-world robotic palletizer, validating its practical applicability in operational settings. The code is available at https://github.com/tianqi-zh/palletization.

Abstract:
Compared to conventional 3D radar, the 4D imaging radar provides additional height data and finer resolution measurements. Moreover, compared to LiDAR sensors, 4D imaging radar is more cost-effective and offers enhanced durability against challenging weather conditions. Despite these advantages, radar-based localization systems face several challenges, including limited resolution, leading to scattered object recognition and less precise localization. Additionally, existing methods that form submaps from filtered results can accumulate errors, leading to blurred submaps and reducing the accuracy of the SLAM and odometry. To address these challenges, this paper introduces Radar4VoxMap, a novel approach designed to enhance radar-only odometry. The method includes an RCS-weighted voxel distribution map that improves registration accuracy. Furthermore, fixed-lag optimization with the graph is used to optimize both the submap and pose, effectively reducing cumulative errors. The proposed method has shown strong performance on open datasets. The code is available at: https://github.com/ailab-hanyang/Radar4VoxMap

Abstract:
Astract-Recent advances in foundation models are significantly expanding the capabilities of AI models. As part of this progress, this paper introduces a robot design framework that uses a diffusion model approach for generating 3D mesh structures. Specifically, we focus on generating directly fabri-cable robot structures that require no post-processing guided by human-imposed design constraints. Our approach can find the optimal design of the robot by optimizing or composing embedding vectors of the model. The efficacy of the framework is validated through an application to design, fabricate, and evaluate a jumping robot. Our solution is an optimized jumping robot with a 41% increase in jump height compared to the state-of-the-art design. Additionally, when the robot is augmented with an optimized foot, it can land reliably with a success ratio of 88% in contrast to the 4% success ratio of the base robot.

Abstract:
This paper addresses the problem of optimizing communicated information among heterogeneous, resourceaware robot teams to facilitate their navigation. In such operations, a mobile robot compresses its local map to assist another robot in reaching a target within an uncharted environment. The primary challenge lies in ensuring that the map compression step balances network load while transmitting only the most essential information for effective navigation. We propose a communication framework that sequentially selects the optimal map compression in a task-driven, communicationaware manner. It introduces a decoder capable of iterative map estimation, handling noise through Kalman filter techniques. The computational speed of our decoder allows for a larger compression template set compared to previous methods, and enables applications in more challenging environments. Specifically, our simulations demonstrate a remarkable 98% reduction in communicated information, compared to a framework that transmits the raw data, on a large Mars inclination map and an Earth map, all while maintaining similar planning costs. Furthermore, our method significantly reduces computational time compared to the state-of-the-art approach.

Abstract:
We propose GAGrasp, a novel framework for dexterous grasp generation that leverages geometric algebra representations to enforce equivariance to S E(3) transformations. By encoding the S E(3) symmetry constraint directly into the architecture, our method improves data and parameter efficiency while enabling robust grasp generation across diverse object poses. Additionally, we incorporate a differentiable physics-informed refinement layer, which ensures that generated grasps are physically plausible and stable. Extensive experiments demonstrate the model's superior performance in generalization, stability, and adaptability compared to existing methods. Additional details at gagrasp.github.io

Abstract:
Existing one-stream trackers have attracted widespread attention. However, they are not applicable in real-time aerial robot tracking systems due to substantial computational overhead, especially when dynamic templates are introduced. To address this issue, we propose a novel Dynamic Compact Consensus Tracker (DC2T), constructed by stacking blocks that each consists of a Compact Token Encoder (CTE) and Dynamic Consensus Attention (DCA). Unlike traditional methods that convert images into a large number of tokens, the CTE, inspired by “superpixel”, extracts a compact set of representative tokens from both initial and dynamic templates, eliminating the need for a large token set. This strategic reduction in the number of compact tokens markedly decreases the computational load of CTE, enhancing the efficiency of subsequent attention operations. To achieve linear complexity of the DCA, compact dynamic template tokens (as keys) are requeried by search tokens (as queries) to perform dynamic consensus on the aggregated tokens (as values). This arrangement seamlessly incorporates dynamic spatio-temporal features into the DCA while avoiding the computational burden typically associated with dynamic templates. With the aim of further enhancing the system's responsiveness and accuracy, a direct control network is crafted to seamlessly incorporate the prediction of high-level control values into the tracking network, ensuring a cohesive and efficient interaction with the controller. Comprehensive experiments and real-world evaluations have proven DC2T's superior performance, accompanied by a significant reduction in FLOPs. Furthermore, we have conducted experiments that demonstrate the tracker's ability to integrate seamlessly with other technologies such as SLAM and detection, enabling precise tracking of arbitrary objects. The tracker code will be released in the github. com/xiaolousun/ refine-pytracking.

Abstract:
We introduce a novel method for handling endpoint constraints in constrained differential dynamic programming (DDP). Unlike existing approaches, our method guarantees quadratic convergence and is exact, effectively managing rank deficiencies in both endpoint and stagewise equality constraints. It is applicable to both forward and inverse dynamics formulations, making it particularly well-suited for model predictive control (MPC) applications and for accelerating optimal control (OC) solvers. We demonstrate the efficacy of our approach across a broad range of robotics problems and provide a userfriendly open-source implementation within Crocoddyl.

Abstract:
Robots often need to convey information to human users. For example, robots can leverage visual, auditory, and haptic interfaces to display their intent or express their internal state. In some scenarios there are socially agreed upon conventions for what these signals mean: e.g., a red light indicates an autonomous car is slowing down. But as robots develop new capabilities and seek to convey more complex data, the meaning behind their signals is not always mutually understood: one user might think a flashing light indicates the autonomous car is an aggressive driver, while another user might think the same signal means the autonomous car is defensive. In this paper we enable robots to adapt their interfaces to the current user so that the human's personalized interpretation is aligned with the robot's meaning. We start with an information theoretic end-to-end approach, which automatically tunes the interface policy to optimize the correlation between human and robot. But to ensure that this learning policy is intuitive - and to accelerate how quickly the interface adapts to the human - we recognize that humans have priors over how interfaces should function. For instance, humans expect interface signals to be proportional and convex. Our approach biases the robot's interface towards these priors, resulting in signals that are adapted to the current user while still following social expectations. Our simulations and user study results across 15 participants suggest that these priors improve robot-to-human communication. See videos here: https://youtu.be/mO_bz5updDc

Abstract:
In unknown environments, navigating a robot by a given image to a specific location or instance is critical and challenging. The existing end-to-end approaches require simultaneous implicit learning of multiple subtasks, and modular approaches depend on metric information. Both approaches face high computational demands, often leading to difficulties in real-time updates and limited generalization, making them challenging to implement on resource-constrained devices. To address these challenges, we propose Dual Graph Navigation (DGN), a knowledge-driven, lightweight image instance navigation framework. DGN builds an External Knowledge Graph (EKG) from small-scale datasets to capture prior object correlations, efficiently guiding target exploration. During exploration, DGN builds an Internal Knowledge Graph (IKG) using an instance-aware module, which records explored objects based on reachability relationships rather than precise metric information. The IKG dynamically updates the EKG, enhancing the robot's adaptability to the current environment. Together, they realize topological perception and reduce computational overhead. Furthermore, unlike approaches characterized by over-dependence between components, DGN employs a plug-and-play modular design that allows independent training and flexible replacement of functional modules, effectively enhancing generalization performance while reducing training and deployment costs. Experiments illustrate that DGN generalizes well in different simulation environments (AI2-THOR, Habitat), achieving state-of-the-art performance on the ProcTHOR-10K dataset. It is compatible with three distinct real-world robot platforms, including edge computing devices without CUDA support. It exhibits a decision-making speed of 3.8 to 5.5 times over baseline methods. Further details can be found on the project page: https://dogplanningloyo.github.io/DGN/.

Abstract:
General-purpose navigation in challenging environments remains a significant problem in robotics, with current state-of-the-art approaches facing myriad limitations. Classical approaches struggle with cluttered settings and require extensive tuning, while learning-based methods face difficulties generalizing to out-of-distribution environments. This paper introduces X-Mobility, an end-to-end generalizable navigation model that overcomes existing challenges by leveraging three key ideas. First, X-Mobility employs an auto-regressive world modeling architecture with a latent state space to capture world dynamics. Second, a diverse set of multi-head decoders enables the model to learn a rich state representation that correlates strongly with effective navigation skills. Third, by decoupling world modeling from action policy, our architecture can train effectively on a variety of data sources, both with and without expert policies-off-policy data allows the model to learn world dynamics, while on-policy data with supervisory control enables optimal action policy learning. Through extensive experiments, we demonstrate that X-Mobility not only generalizes effectively but also surpasses current state-of-the-art navigation approaches. Additionally, X-Mobility also achieves zero-shot Sim2Real transferability and shows strong potential for crossembodiment generalization. Project page: https://nvlabs.github.io/X-MOBILITY.

Abstract:
Bimanual manipulation is challenging due to precise spatial and temporal coordination required between two arms. While there exist several real-world bimanual systems, there is a lack of simulated benchmarks with a large task diversity for systematically studying bimanual capabilities across a wide range of tabletop tasks. This paper addresses the gap by presenting a benchmark for bimanual manipulation. A key functionality is the ability to autonomously generate training data without the necessity of human demonstrations to the robot. We open-source our code and benchmark, which comprises 13 new tasks with 23 unique task variations, each requiring a high degree of coordination and adaptability. To initiate the benchmark, we extended multiple state-of-the-art techniques to the domain of bimanual manipulation. The project website with code is available at: http://bimanual.github.io.

Abstract:
The advent of simulation engines has revolutionized learning and operational efficiency for robots, offering cost-effective and swift pipelines. However, the lack of a universal simulation platform tailored for chemical scenarios impedes progress in robotic manipulation and visualization of reaction processes. Addressing this void, we present Chemistry3D, an innovative toolkit that integrates extensive chemical and robotic knowledge. Chemistry3D not only enables robots to perform chemical experiments but also provides real-time visualization of temperature, color, and pH changes during reactions. Built on the NVIDIA Omniverse platform, Chemistry3D offers interfaces for robot operation, visual inspection, and liquid flow control, facilitating the simulation of special objects such as liquids and transparent entities. Leveraging this toolkit, we have devised RL tasks, object detection, and robot operation scenarios. Additionally, to discern disparities between the rendering engine and the real world, we conducted transparent object detection experiments using Sim2Real, validating the toolkit's exceptional simulation performance. The source code is available at https://github.com/huangyan28/Chemistry3D, and a related tutorial can be found at https://www.omni-chemistry.com.

Abstract:
We introduce LUMOS, a language-conditioned multi-task imitation learning framework for robotics. LUMOS learns skills by practicing them over many long-horizon rollouts in the latent space of a learned world model and transfers these skills zero-shot to a real robot. By learning on-policy in the latent space of the learned world model, our algorithm mitigates policy-induced distribution shift which most offline imitation learning methods suffer from. LUMOS learns from unstructured play data with fewer than 1 % hindsight language annotations but is steerable with language commands at test time. We achieve this coherent long-horizon performance by combining latent planning with both image-and language-based hindsight goal relabeling during training, and by optimizing an intrinsic reward defined in the latent space of the world model over multiple time steps, effectively reducing covariate shift. In experiments on the difficult long-horizon CALVIN benchmark, LUMOS outperforms prior learning-based methods with com-parable approaches on chained multi-task evaluations. To the best of our knowledge, we are the first to learn a language-conditioned continuous visuomotor control for a real-world robot within an offline world model. Videos, dataset and code are available at http://lumos.cs.uni-freiburg.de.

Abstract:
3D Gaussian Splatting (3DGS) allows flexible adjustments to scene representation, enabling continuous optimization of scene quality during dense visual simultaneous localization and mapping (SLAM) in static environments. However, 3DGS faces challenges in handling environmental disturbances from dynamic objects with irregular movement, leading to degradation in both camera tracking accuracy and map reconstruction quality. To address this challenge, we develop an RGB-D dense SLAM which is called Gaussian Splatting SLAM in Dynamic Environments (Gassidy). This approach calculates Gaussians to generate rendering loss flows for each environmental component based on a designed photometricgeometric loss function. To distinguish and filter environmental disturbances, we iteratively analyze rendering loss flows to detect features characterized by changes in loss values between dynamic objects and static components. This process ensures a clean environment for accurate scene reconstruction. Compared to state-of-the-art SLAM methods, experimental results on open datasets show that Gassidy improves camera tracking precision by up to 97.9 % and enhances map quality by up to 6 %. Video of experiments is available here: https://www.wixsite.com.com/wen-Gassidy.

Abstract:
Robots operating in an open world will encounter novel objects with unknown physical properties, such as mass, friction, or size. These robots will need to sense these properties through interaction prior to performing downstream tasks with the objects. We propose a method that autonomously learns tactile exploration policies by developing a generative world model that is leveraged to 1) estimate the object's physical parameters using a differentiable Bayesian filtering algorithm and 2) develop an exploration policy using an information-gathering model predictive controller. We evaluate our method on three simulated tasks where the goal is to estimate a desired object property (mass, height or toppling height) through physical interaction. We find that our method is able to discover policies that efficiently gather information about the desired property in an intuitive manner. Finally, we validate our method on a real robot system for the height estimation task, where our method is able to successfully learn and execute an information-gathering policy from scratch.

Abstract:
Neuromorphic Computing (NC) and Spiking Neural Networks (SNNs) in particular are often viewed as the next generation of Neural Networks (NNs). NC is a novel bio-inspired paradigm for energy efficient neural computation, often relying on SNNs in which neurons communicate via spikes in a sparse, event-based manner. This communication via spikes can be exploited by neuromorphic hardware implementations very effectively and results in a drastic reductions of power consumption and latency in contrast to regular GPU-based NNs. In recent years, neuromorphic hardware has become more accessible, and the support of learning frameworks has improved. However, available hardware is partially still experimental, and it is not transparent what these solutions are effectively capable of, how they integrate into real-world robotics applications, and how they realistically benefit energy efficiency and latency. In this work, we provide the robotics research community with an overview of what is possible with SNNs on neuromorphic hardware focusing on real-time processing. We introduce a benchmark of three popular neuromorphic hardware devices for the task of event-based object detection. Moreover, we show that an SNN on a neuromorphic hardware is able to run in a challenging table tennis robot setup in real-time.

Abstract:
LiDAR-based Vehicle-to-Everything (V2X) cooperative perception has demonstrated its impact on the safety and effectiveness of autonomous driving. Since current cooperative perception algorithms are trained and tested on the same dataset, the generalization ability of cooperative perception systems remains underexplored. This paper is the first work to study the Domain Generalization problem of LiDAR-based V2X cooperative perception (V2X-DG) for 3D detection based on four widely-used open source datasets: OPV2V, V2XSet, V2V4Real and DAIR-V2X. Our research seeks to sustain high performance not only within the source domain but also across other unseen domains, achieved solely through training on source domain. To this end, we propose Cooperative Mixup Augmentation based Generalization (CMAG) to improve the model generalization capability by simulating the unseen cooperation, which is designed compactly for the domain gaps in cooperative perception. Furthermore, we propose a constraint for the regularization of the robust generalized feature representation learning: Cooperation Feature Consistency (CFC), which aligns the intermediately fused features of the generalized cooperation by CMAG and the early fused features of the original cooperation in source domain. Extensive experiments demonstrate that our approach achieves significant performance gains when generalizing to other unseen datasets while it also maintains strong performance on the source dataset.

Abstract:
In this paper, we address the problem of reducing the computational burden of Model Predictive Control (MPC) for real-time robotic applications. We propose TransformerMPC, a method that enhances the computational efficiency of MPC algorithms by leveraging the attention mechanism in transformers for both online constraint removal and better warm start initialization. Specifically, TransformerMPC accelerates the computation of optimal control inputs by selecting only the active constraints to be included in the MPC problem, while simultaneously providing a warm start to the optimization process. This approach ensures that the original constraints are satisfied at optimality. TransformerMPC is designed to be seamlessly integrated with any solver, irrespective of its implementation. To guarantee constraint satisfaction after removing inactive constraints, we perform an offline verification to ensure that the optimal control inputs generated by the solver meet all constraints. The effectiveness of TransformerMPC is demonstrated through extensive numerical simulations on complex robotic systems, achieving up to 35 × improvement in runtime without any loss in performance.

Abstract:
We study the problem of sampling robot trajectories and introduce the notion of C-Uniformity. As opposed to the standard method of uniformly sampling control inputs (which lead to biased samples of the configuration space), C-Uniform trajectories are generated by control actions which lead to uniform sampling of the configuration space. After presenting an intuitive closed-form solution to generate C-Uniform trajectories for the 1D random-walker, we present a network-flow based optimization method to precompute action probabilities which lead to C-Uniform trajectories for general robot systems. We apply the notion of C-Uniformity to the design of Model Predictive Path Integral controllers. Through simulation experiments, we show that using C-Uniform trajectories significantly improvs the performance of MPPI-style controllers, achieving up to 40 % coverage performance gain compared to the best baseline. We demonstrate the practical applicability of our method with an implementation on a 1 / 10th scale racer.

Abstract:
Image and video generative models that are pretrained on Internet-scale data can greatly increase the generalization capacity of robot learning systems. These models can function as high-level planners, generating intermediate sub-goals for low-level goal-conditioned policies to reach. However, the performance of these systems can be greatly bottlenecked by the interface between generative models and low-level controllers. For example, generative models may predict photo-realistic yet physically infeasible frames that confuse low-level policies. Low-level policies may also be sensitive to subtle visual artifacts in generated goal images. This paper addresses these two facets of generalization, providing an interface to effectively “glue together” language-conditioned image or video prediction models with low-level goal-conditioned policies. Our method, Generative Hierarchical Imitation Learning-Glue (GHIL-Glue), filters out subgoals that do not lead to task progress and improves the robustness of goal-conditioned policies to generated subgoals with harmful visual artifacts. We find in extensive experiments in both simulated and real environments that GHIL-Glue achieves a 25% improvement across several hierarchical models that leverage generative subgoals, achieving a new state-of-the-art on the CALVIN simulation benchmark for policies using observations from a single RGB camera. GHIL-Glue also outperforms other generalist robot policies across 3/4 language-conditioned manipulation tasks testing zero-shot generalization in physical experiments. Code, model checkpoints, videos, and supplementary materials can be found at https://ghil-glue.github.io.

Abstract:
Metric depth estimation from visual sensors is crucial for robots to perceive, navigate, and interact with their environment. Traditional range imaging setups, such as stereo or structured light cameras, face hassles including calibration, occlusions, and hardware demands, with accuracy limited by the baseline between cameras. Single- and multi-view monocular depth offers a more compact alternative, but is constrained by the unobservability of the metric scale. Light field imaging provides a promising solution for estimating metric depth by using a unique lens configuration through a single device. However, its application to single-view dense metric depth is under-addressed mainly due to the technology's high cost, the lack of public benchmarks, and proprietary geometrical models and software. Our work explores the potential of focused plenoptic cameras for dense metric depth. We propose a novel pipeline that predicts metric depth from a single plenoptic camera shot by first generating a sparse metric point cloud using a neural network, which is then used to scale and align a dense relative depth map regressed by a foundation depth model, resulting in a dense metric depth. To validate it, we curated the Light Field & Stereo Image Dataset11Dataset available at https://zenodo.org/records/14224205. (LFS) of real-world light field images with stereo depth labels, filling a current gap in existing resources. Experimental results show that our pipeline produces accurate metric depth predictions, laying a solid groundwork for future research in this field.22Work partially supported by the DLR Impulse Project SaiNSOR.

Abstract:
Autonomous driving currently lacks robust evidence of energy efficiency when using energy-model-agnostic trajectory planning. To address this, we explore how differential energy models can be effectively utilized under varying driving conditions to enhance energy efficiency. Furthermore, we propose an online nonlinear programming approach that optimizes polynomial trajectories generated by the Frenet polynomial method while incorporating traffic trajectory data and road slope predictions. Through case studies, quantitative analyses, and ablation studies conducted on both sedan and truck models, we demonstrate the effectiveness of the proposed method. [Video]

Abstract:
In contrast to quadruped robots that can navigate diverse terrains using a “blind” policy, humanoid robots require accurate perception for stable locomotion due to their high degrees of freedom and inherently unstable morphology. However, incorporating perceptual signals often introduces additional disturbances to the system, potentially reducing its robustness, generalizability, and efficiency. This paper presents the Perceptive Internal Model (PIM), which relies on onboard, continuously updated elevation maps centered around the robot to perceive its surroundings. We train the policy using ground-truth obstacle heights surrounding the robot in simulation, optimizing it based on the Hybrid Internal Model (HIM), and perform inference with heights sampled from the constructed elevation map. Unlike previous methods that directly encode depth maps or raw point clouds, our approach allows the robot to perceive the terrain beneath its feet clearly and is less affected by camera movement or noise. Furthermore, since depth map rendering is not required in simulation, our method introduces minimal additional computational costs and can train the policy in 3 hours on an RTX 4090 GPU. We verify the effectiveness of our method across various humanoid robots, various indoor and outdoor terrains, stairs, and various sensor configurations. Our method can enable a humanoid robot to continuously climb stairs and has the potential to serve as a foundational algorithm for the development of future humanoid control methods.

Abstract:
‘In-the-wild’ mobile manipulation aims to deploy robots in diverse real-world environments, which requires the robot to (1) have skills that generalize across object configurations; (2) be capable of long-horizon task execution in diverse environments; and (3) perform complex manipulation beyond pick-and-place. Quadruped robots with manipulators hold promise for extending the workspace and enabling robust locomotion, but existing results do not investigate such a capability. This paper proposes WildLMa with three components to address these issues: (1) adaptation of learned low-level controller for VR-enabled whole-body teleoperation and traversability; (2) WildLMa-Skill - a library of generalizable visuomotor skills acquired via imitation learning or heuristics and (3) WildLMa-Planner - an interface of learned skills that allow LLM planners to coordinate skills for long-horizon tasks. We demonstrate the importance of high-quality training data by achieving higher grasping success rate over existing RL baselines using only tens of demonstrations. WildLMa exploits CLIP for language-conditioned imitation learning that empirically generalizes to objects unseen in training demonstrations. Besides extensive quantitative evaluation, we qualitatively demonstrate practical robot applications, such as cleaning up trash in university hallways or outdoor terrains, operating articulated objects, and rearranging items on a bookshelf.

Abstract:
Building LiDAR generative models holds promise as powerful data priors for restoration, scene manipulation, and scalable simulation in autonomous mobile robots. In recent years, approaches using diffusion models have emerged, significantly improving training stability and generation quality. Despite their success, diffusion models require numerous iterations of running neural networks to generate high-quality samples, making the increasing computational cost a potential barrier for robotics applications. To address this challenge, this paper presents R2Flow, a fast and high-fidelity generative model for LiDAR data. Our method is based on rectified flows that learn straight trajectories, simulating data generation with significantly fewer sampling steps compared to diffusion models. We also propose an efficient Transformer-based model architecture for processing the image representation of LiDAR range and reflectance measurements. Our experiments on unconditional LiDAR data generation using the KITTI-360 dataset demonstrate the effectiveness of our approach in terms of both efficiency and quality.

Abstract:
Stereo matching is vital in 3D computer vision, with most algorithms assuming symmetric visual properties between binocular visions. However, the rise of asymmetric multi-camera systems (e.g., tele-wide cameras) challenges this assumption and complicates stereo matching. Visual asymmetry disrupts stereo matching by affecting the crucial cost volume computation. To address this, we explore the matching cost distribution of two established cost volume construction methods in asymmetric stereo. We find that each cost volume experiences distinct information distortion, indicating that both should be comprehensively utilized to solve the issue. Based on this, we propose the two-phase Iterative Volume Fusion network for Asymmetric Stereo matching (IVF-AStereo). Initially, the aggregated concatenation volume refines the correlation volume. Subsequently, both volumes are fused to enhance fine details. Our method excels in asymmetric scenarios and shows robust performance against significant visual asymmetry. Extensive comparative experiments on benchmark datasets, along with ablation studies, confirm the effectiveness of our approach in asymmetric stereo with resolution and color degradation.

Abstract:
The joint optimization of Neural Radiance Fields (NeRF) and camera trajectories has been widely applied in SLAM tasks due to its superior dense mapping quality and consistency. NeRF-based SLAM learns camera poses using constraints by implicit map representation. A widely observed phenomenon that results from the constraints of this form is jerky and physically unrealistic estimated camera motion, which in turn affects the map quality. To address this deficiency of current NeRF-based SLAM, we propose in this paper TS-SLAM (TS for Trajectory Smoothness). It introduces smoothness constraints on camera trajectories by representing them with uniform cubic B-splines with continuous acceleration that guarantees smooth camera motion. Benefiting from the differentiability and local control properties of B-splines, TS-SLAM can incrementally learn the control points end-to-end using a sliding window paradigm. Additionally, we regularize camera trajectories by exploiting the dynamics prior to further smooth trajectories. Experimental results demonstrate that TS-SLAM achieves superior trajectory accuracy and improves mapping quality versus NeRF-based SLAM that does not employ the above smoothness constraints.

Abstract:
Terrain awareness is an essential milestone to enable truly autonomous off-road navigation. Accurately predicting terrain characteristics allows optimizing a vehicle's path against potential hazards. Recent methods use deep neural networks to predict terrain properties in a self-supervised manner, relying on proprioception as a training signal. However, onboard cameras are inherently limited by their point-ofview relative to the ground, suffering from occlusions and vanishing pixel density with distance. This paper introduces a novel approach for self-supervised terrain characterization using an aerial perspective from a hovering drone. We capture terrain-aligned images while sampling the environment with a ground vehicle, effectively training a simple predictor for vibrations, bumpiness, and energy consumption. Our dataset includes 2.8 km of off-road data collected in forest environment, comprising 13484 ground-based images and 12935 aerial images. Our findings show that drone imagery improves terrain property prediction by 21.37% on the whole dataset and 37.35% in high vegetation, compared to ground robot images. We conduct ablation studies to identify the main causes of these performance improvements. We also demonstrate the realworld applicability of our approach by scouting an unseen area with a drone, planning and executing an optimized path on the ground.

Abstract:
This paper explores the feasibility of using magnetoresponsive silicone as the primary mechanism for generating vibrotactile feedback in haptic interfaces. The distinctive feature of this research lies in the integration of magnetoresponsive silicone, a flexible material that responds to electromagnetic fields to produce localized vibrations. Preliminary experiments evaluate the performance of these actuators, focusing on their ability to produce controlled vibrations across a range of frequencies and amplitudes relevant to human tactile perception. Building on this foundation, we introduce the VibroFlex Pad, a haptic interface featuring a magnetoresponsive silicone sheet and an array of electromagnets. The VibroFlex Pad demonstrates its versatility in generating varied tactile effects and simulating dynamic wavelike movements across its surface. To assess the VibroFlex Pad's effectiveness, a user study was conducted, separately evaluating tactile accuracy, overall performance, and user comfort. The findings suggest that the VibroFlex Pad offers reliable and precise vibrotactile feedback, highlighting its potential to enhance wearable haptic technologies and improve the user experience in a variety of applications.

Abstract:
This paper explores automated robotic construction of clocháin2, a type of corbelled rock shelter, traditionally crafted by skilled workers. While robots have been employed for simple dry-stacking tasks in the past, such as construction of stone walls or vertical stone towers, the question of whether robots possess the capacity to construct more functional structures remains unanswered. This study presents a significant step forward in robotic dry-stacking of functional structures: the assembly of natural stones into freestanding clocháin structures. We also present a set of stackability measures to aid stone selection, which significantly improves the stability of the planned structures. Our sequential filtering approach, originally designed for planning stone walls, plays a foundational role in achieving stable clochán construction. Experimental results validate the effectiveness of the stackability measures and demonstrate the physical execution of dry-stacking clocháin, The progress demonstrated in this paper opens the door to robotic construction of a wide range of utility structures in unstructured environments.

Abstract:
Legged locomotion over various terrains is challenging and requires precise perception of the robot and its surroundings from both proprioception and vision. However, learning directly from high-dimensional visual input is often data-inefficient and intricate. To address this issue, traditional methods attempt to learn a teacher policy with access to privileged information first and then learn a student policy to imitate the teacher's behavior with visual input. Despite some progress, this imitation framework prevents the student policy from achieving optimal performance due to the information gap between inputs. Furthermore, the learning process is unnatural since animals intuitively learn to traverse different terrains based on their understanding of the world without privileged knowledge. Inspired by this natural ability, we propose a simple yet effective method, World Model-based Perception (WMP), which builds a world model of the environment and learns a policy based on the world model. We illustrate that though completely trained in simulation, the world model can make accurate predictions of real-world trajectories, thus providing informative signals for the policy controller. Extensive simulated and real-world experiments demonstrate that WMP outperforms state-of-the-art baselines in traversability and robustness. Videos and Code are available at: https://wmp-loco.github.io/.

Abstract:
Research in visual perception has shown that the human visual system utilizes high-level feedback information to guide lower-level processing, enabling adaptation to signals of varying characteristics. Inspired by this, we propose the Feedback multi-Level feature Extractor (Flex) to dynamically adjust feature selection in object detection based on image-wise and instance-level feedback information. This is particularly beneficial for applications such as aerial object detection, UAV-based target recognition and autonomous vehicle navigation, where global image quality issues like sensor degradation, foggy, or rainy conditions can impact detection performance. Flex adapts to variations in image quality, refining the feature extraction process to improve robustness against these challenges. Experimental results demonstrate that Flex consistently enhances a range of state-of-the-art methods on challenging aerial object detection datasets, including DOTA-v1.0, DOTA-v1.5, and HRSC2016. Furthermore, additional experiments on MS COCO confirm the module's effectiveness in general object detection tasks. Our quantitative and qualitative analyses reveal that the improvements are strongly correlated with image quality, aligning with our original motivation to address global image quality issues in real-world scenarios.

Abstract:
In order to obtain a good tactile sensing, traditional dexterous hands always enable all the sensing units installed on them all the time, even if just a few sensor units are actually used, which make the tactile sensing system resource-wasting and energy consuming. In order to reduce their complexities by placing the tactile sensing units only at critical locations, this work proposes an embodied tactile dexterous hand (ET-Hand) and a novel multimodal sensor placement framework that learns multiple tasks to generate optimal placement proposal. Furthermore, our ET-Hand can dynamically adjust the perceived tactile sensor positions, types and numbers during robotic manipulation, providing novel tools and methods for investigating the tactile channels and placement scale required for robot exploration. In the object recognition and slip detection tasks, the results show that our proposed method performs close to or even better than traditional sensing way with large-scale placement.

Abstract:
Autonomous robot exploration requires a robot to efficiently explore and map unknown environments. Compared to conventional methods that can only optimize paths based on the current robot belief, learning-based methods show the potential to achieve improved performance by drawing on past experiences to reason about unknown areas. In this paper, we propose DARE, a novel generative approach that leverages diffusion models trained on expert demonstrations, which can explicitly generate an exploration path through one-time inference. We build DARE upon an attention-based encoder and a diffusion model, and introduce ground truth optimal demonstrations for training to learn better patterns for exploration. The trained planner can reason about the partial belief to recognize the potential structure in unknown areas and consider these areas during path planning. Our experiments demonstrate that DARE achieves on-par performance with both conventional and learning-based state-of-the-art exploration planners, as well as good generalizability in both simulations and real-life scenarios. Our code is available at github.com/marmotlab/DARE.

Abstract:
Visual navigation, a fundamental challenge in mobile robotics, demands versatile policies to handle diverse environments. Classical methods leverage geometric solutions to minimize specific costs, offering adaptability to new scenarios but are prone to system errors due to their multi-modular design and reliance on hand-crafted rules. Learning-based methods, while achieving high planning success rates, face difficulties in generalizing to unseen environments beyond the training data and often require extensive training. To address these limitations, we propose a hybrid approach that combines the strengths of learning-based methods and classical approaches for RGB-only visual navigation. Our method first trains a conditional diffusion model on diverse path-RGB observation pairs. During inference, it integrates the gradients of differentiable scene-specific and task-level costs, guiding the diffusion model to generate valid paths that meet the constraints. This approach alleviates the need for retraining, offering a plug-and-play solution. Extensive experiments in both indoor and outdoor settings, across simulated and real-world scenarios, demonstrate zero-shot transfer capability of our approach, achieving higher success rates and fewer collisions compared to baseline methods. Code will be released at https://github.com/SYSU-RoboticsLab/NaviD.

Abstract:
Navigating unfamiliar environments presents significant challenges for household robots, requiring the ability to recognize and reason about novel decoration and layout. Existing reinforcement learning methods cannot be directly transferred to new environments, as they typically rely on extensive mapping and exploration, leading to time-consuming and inefficient. To address these challenges, we try to transfer the logical knowledge and the generalization ability of pretrained foundation models to zero-shot navigation. By integrating a large vision-language model with a diffusion network, our approach named NavigateDiff constructs a visual predictor that continuously predicts the agent's potential observations in the next step which can assist robots generate robust actions. Furthermore, to adapt the temporal property of navigation, we introduce temporal historical information to ensure that the predicted image is aligned with the navigation scene. We then carefully designed an information fusion framework that embeds the predicted future frames as guidance into goalreaching policy to solve downstream image navigation tasks. This approach enhances navigation control and generalization across both simulated and real-world environments. Through extensive experimentation, we demonstrate the robustness and versatility of our method, showcasing its potential to improve the efficiency and effectiveness of robotic navigation in diverse settings. Project Page: https://21styouth.github.io/NavigateDiff/.

Abstract:
Imitation learning from human demonstrations enables robots to perform complex manipulation tasks and has recently witnessed huge success. However, these techniques often struggle to adapt behavior to new preferences or changes in the environment. To address these limitations, we propose Fine-tuning Diffusion Policy with Human Preference (FDPP). FDPP learns a reward function through preference-based learning. This reward is then used to fine-tune the pre-trained policy with reinforcement learning (RL), resulting in alignment of pre-trained policy with new human preferences while still solving the original task. Our experiments across various robotic tasks and preferences demonstrate that FDPP effectively customizes policy behavior without compromising performance. Additionally, we show that incorporating Kullback-Leibler (KL) regularization during fine-tuning prevents over-fitting and helps maintain the competencies of the initial policy.

Abstract:
We present an algorithmic framework for a multi-robot modular assembly system. Motivated by the prospects of in-space assembly, we focus on the NASA Automated Reconfigurable Mission Adaptive Digital Assembly Systems (AR-MADAS) framework, in which multiple types of robots work together in a team to build large structures. Unlike with other multi-robot construction systems, the geometry of structures that ARMADAS robots can build is not limited to the class of histogram shapes. To address the intractability of path planning for a robot system with the exponentially growing number of dimensions, we present a decoupled planning approach, where the assembly and path planning is performed iteratively by one robot team at a time. We present a number of data structures which help us avoid collisions and deadlocks in the resulting robot schedule.

Abstract:
Unordered grasping in industrial robotic manipulation requires precise six-degree-of-freedom (6D) pose estimation. However, existing methods often struggle with unknown objects and require retraining, limiting their practicality. Traditional 3D point-pair feature methods, while training-free, perform poorly with textured symmetric objects. We propose a generalizable approach for zero-shot 6 D pose estimation without retraining. Our method consists of two steps: generating CAD-based templates through real-time rendering for coarse pose estimation, and refining poses using semantic point-pair features aligned with the camera viewpoint. We conducted experiments on seven core datasets from the Benchmark for 6D Object Pose Estimation (BOP) challenge, and the results are publicly available on the BOP website. Integration into a robotic grasping system further highlights its high precision and fast execution, making it ideal for applications such as bin-picking. (GZS6D-BP) https://bop.felk.cvut.cz/leaderboards/.

Abstract:
The scale and diversity of demonstration data required for imitation learning is a significant challenge. We present EgoMimic, a full-stack framework which scales manipulation via human embodiment data, specifically egocentric human videos paired with 3D hand tracking. EgoMimic achieves this through: (1) a system to capture human embodiment data using the ergonomic Project Aria glasses, (2) a low-cost bimanual manipulator that minimizes the kinematic gap to human data, (3) cross-domain data alignment techniques, and (4) an imitation learning architecture that co-trains on human and robot data. Compared to prior works that only extract high-level intent from human videos, our approach treats human and robot data equally as embodied demonstration data and learns a unified policy from both data sources. EgoMimic achieves significant improvement on a diverse set of long-horizon, single-arm and bimanual manipulation tasks over state-of-the-art imitation learning methods and enables generalization to entirely new scenes. Finally, we show a favorable scaling trend for EgoMimic, where adding 1 hour of additional hand data is significantly more valuable than 1 hour of additional robot data. Videos and additional information can be found at https://egomimic.github.io/

Abstract:
Mobile manipulation typically entails the base for mobility, the arm for accurate manipulation, and the camera for perception. The principle of Distant Mobility, Close Grasping(DMCG) is essential for holistic control. We propose Embod-ied Holistic Control for Mobile Manipulation(EHC-MM) with the embodied function of sig( \omega ): By formulating the DMCG principle as a Quadratic Programming (QP) problem, sig( \omega ) dynamically balances the robot's emphasis between movement and manipulation with the consideration of the robot's state and environment. In addition, we propose the Monitor-Position-Based Servoing (MPBS) with sig( \omega ), enabling the tracking of the target during the operation. This approach enables coordinated control among the robot's base, arm, and camera, enhancing task efficiency. Through extensive simulations and real-world experiments, our approach significantly improves both the success rate and efficiency of mobile manipulation tasks, achieving a 95.6% success rate in real-world scenarios and a 52.8% increase in time efficiency.

Abstract:
Learning control policies using deep reinforcement learning has shown great success for a variety of applications, including robotics and automated driving. A key area limiting the adaptation of RL in the real world is the lack of trust in the decision-making process of such policies. Therefore, explainability is a requirement of any RL agent operating in the real world. In this work, we propose a family of control policies that are explainable-by-design regarding individual observation components on object-based scene representations. By estimating diagonal squashed Gaussian and categorical mixture distributions on sub-spaces of the decomposed observations, we develop stochastic policies with easy-to-read explanations of the decision-making process. Our design is generally applicable to any RL algorithm using stochastic policies. We showcase the explainability on an extensive suite of single-and multi-agent simulations, set-and sequence-based high-level scenes, and discrete and continuous action spaces, with performance at least on-par or better compared to standard policy architectures. In additional experiments, we analyze the robustness of our approach to its single additional hyper-parameter and examine its potential for very low computational requirements with tiny policies.

Abstract:
The multi-objective coverage control problem requires a robot swarm to collaboratively provide sensor coverage to multiple heterogeneous importance density fields (IDFs) simultaneously. We pose this as an optimization problem with constraints and study two different formulations: (1) Fair coverage, where we minimize the maximum coverage cost for any field, promoting equitable resource distribution among all fields; and (2) Constrained coverage, where each field must be covered below a certain cost threshold, ensuring that critical areas receive adequate coverage according to predefined importance levels. We study the decentralized setting where robots have limited communication and local sensing capabilities, making the system more realistic, scalable, and robust. Given the complexity, we propose a novel decentralized constrained learning approach that combines primal-dual optimization with a Learnable Perception-Action-Communication (LPAC) neural network architecture. We show that the Lagrangian of the dual problem can be reformulated as a linear combination of the IDFs, enabling the LPAC policy to serve as a primal solver. We empirically demonstrate that the proposed method (i) significantly outperforms state-of-the-art decentralized controllers by 30% on average in terms of coverage cost, (ii) transfers well to larger environments with more robots, and (iii) is scalable in the number of IDFs and robots in the swarm.

Abstract:
In this paper, we implement the GPU-accelerated subsystem-based Alternating Direction Method of Multipliers (SubADMM) for interactive simulation. The challenging objective for interactive simulations is to deliver realistic results under tight performance, even for large-scale scenarios. We aim to achieve this by exploiting the parallelizable nature of SubADMM to the fullest extent. We introduce a new subsystem division strategy to make SubADMM ‘GPU friendly' along with custom kernel designs and optimization regarding efficient memory access patterns. We successfully implement the GPUaccelerated SubADMM and show the accuracy and speed of the framework for large-scale scenarios, highlighted with an interactive ‘Hand demo’ scenario. We also show improved robustness and accuracy compared to other state-of-the-art interactive simulators with several challenging scenarios that introduce large-scale ill-conditioned dynamics problems.

Abstract:
In decentralized multiagent trajectory planners, agents need to communicate and exchange their positions to generate collision-free trajectories. However, due to localization errors/uncertainties, trajectory deconfliction can fail even if trajectories are perfectly shared between agents. To address this issue, we first present PARM and PARM, perception-aware, decentralized, asynchronous multiagent trajectory planners that enable a team of agents to navigate uncertain environments while deconflicting trajectories and avoiding obstacles using perception information. PARM differs from PARM as it is less conservative, using more computation to find closer-to-optimal solutions. While these methods achieve state-of-the-art performance, they suffer from high computational costs as they need to solve large optimization problems onboard, making it difficult for agents to replan at high rates. To overcome this challenge, we present our second key contribution, PRIMER, a learning-based planner trained with imitation learning (IL) using PARM as the expert demonstrator. PRIMER leverages the low computational requirements at deployment of neural networks and achieves a computation speed up to 5614 times faster than optimization-based approaches.

Abstract:
Planning contact-rich interactions for multi-finger manipulation is challenging due to the high-dimensionality and hybrid nature of dynamics. Recent advances in data-driven methods have shown promise, but are sensitive to the quality of training data. Combining learning with classical methods like trajectory optimization and search adds additional structure to the problem and domain knowledge in the form of constraints, which can lead to outperforming the data on which models are trained. We present Diffusion-Informed Probabilistic Contact Search (DIPS), which uses an A search to plan a sequence of contact modes informed by a diffusion model. We train the diffusion model on a dataset of demonstrations consisting of contact modes and trajectories generated by a trajectory optimizer given those modes. In addition, we use a particle filter-inspired method to reason about variability in diffusion sampling arising from model error, estimating likelihoods of trajectories using a learned discriminator. We show that our method outperforms ablations that do not reason about variability and can plan contact sequences that outperform those found in training data across multiple tasks. We evaluate on simulated tabletop card sliding and screwdriver turning tasks, as well as the screwdriver task in hardware to show that our combined learning and planning approach transfers to the real world.

Abstract:
Robots in the real world need to perceive and move to goals in complex environments without collisions. Avoiding collisions is especially difficult when relying on sensor perception and when goals are among clutter. Diffusion policies and other generative models have shown strong performance in solving local planning problems, but often struggle at avoiding all of the subtle constraint violations that characterize truly challenging global motion planning problems. In this work, we propose an approach for learning global motion planning using diffusion policies, allowing the robot to generate full trajectories through complex scenes and reasoning about multiple obstacles along the path. Our approach uses cascaded hierarchical models which unify global prediction and local refinement together with online plan repair to ensure the trajectories are collision free. Our method outperforms (\approx 5 %) a wide variety of baselines on challenging tasks in multiple domains including navigation and manipulation.

Abstract:
This paper introduces an noval autonomous mobile robot, IOMT, designed for indoor and outdoor multi-terrain environments, with a particular focus on stair-climbing capabilities. The robot features a four-wheel independent drive and steering system (4WID-4WIS), allowing it to maintain high maneuverability on smooth surfaces. Furthermore, based on reducing the complexity of the control, the IOMT addresses the challenges associated with stair climbing by controlling the stable pitch angle, effectively reducing the impact of stairs on the robot's posture, such as the pitch angle minimized to an angle smaller than the inclination of the stairs. The design also incorporates a special mechanism that reduces energy consumption through its worm gear system with self-locking characteristics, and combines steering with shock absorption to simplify both the mechanism complexity. This paper not only briefly proposes a stair climbing strategy for the IOMT, but also explores the impact of various design parameters on the robot pitch angle, ultimately validating the feasibility of the design for the stair climbing ability.

Abstract:
Recently, quadrupedal locomotion has achieved significant success, but their manipulation capabilities, particularly in handling large objects, remain limited, restricting their usefulness in demanding real-world applications such as search and rescue, construction, industrial automation, and room or-ganization. This paper tackles the task of obstacle-aware, long-horizon pushing by multiple quadrupedal robots. We propose a hierarchical multi-agent reinforcement learning framework with three levels of control. The high-level controller integrates an RRT planner and a centralized adaptive policy to generate subgoals, while the mid-level controller uses a decentralized goal-conditioned policy to guide the robots toward these sub-goals. A pre-trained low-level locomotion policy executes the movement commands. We evaluate our method against several baselines in simulation, demonstrating significant improvements over baseline approaches, with 36.0% higher success rates and 24.5% reduction in completion time than the best baseline. Our framework successfully enables long-horizon, obstacle-aware manipulation tasks like Push-Cuboid and Push-Ton Gol robots in the real world.

Abstract:
We propose visual-inertial simultaneous localization and mapping that tightly couples sparse reprojection errors, inertial measurement unit pre-integrals, and relative pose factors with dense volumetric occupancy mapping. Hereby depth predictions from a deep neural network are fused in a fully probabilistic manner. Specifically, our method is rigorously uncertainty-aware: first, we use depth and uncertainty predictions from a deep network not only from the robot's stereo rig, but we further probabilistically fuse motion stereo that provides depth information across a range of baselines, therefore drastically increasing mapping accuracy. Next, predicted and fused depth uncertainty propagates not only into occupancy probabilities but also into alignment factors between generated dense submaps that enter the probabilistic non-linear least squares estimator. This submap representation offers globally consistent geometry at scale. Our method is thoroughly evaluated in two benchmark datasets, resulting in localization and mapping accuracy that exceeds the state of the art, while simultaneously offering volumetric occupancy directly usable for downstream robotic planning and control in real-time.

Abstract:
Adversarial attacks have been recently investigated in LiDAR perception problems for autonomous driving, where a small perturbation of source inputs can result in incorrect predictions. However, most previous studies focus on attacks on single-frame perception modules, lacking explorations of attacks on consecutive-frame tasks, i.e. the LiDAR odometry. In this paper, we propose a gradient optimization-based adversarial attack towards deep LiDAR odometry networks. To generate point clouds consistent with real-world scenarios, we constrain adversarial points within the range of a small object, e.g. a traffic cone, and render new points to simulate real LiDAR measurements. By incorporating such adversarial points in consecutive frames, we demonstrate a significant decrease in pose estimation accuracy of current popular LiDAR odometry networks. In addition, we also evaluate traditional geometric odometry approaches and report their robustness against adversarial points. Extensive experiments on the KITTI and Waymo datasets illustrate the effectiveness of the proposed attack method and the vulnerability of deep LiDAR odometry networks against adversarial points.

Abstract:
There is significant interest in creating compliant modular robots that can change their volume. Inspired by how biological cells move, these systems can potentially combine the resilience of modular robotics with the increased environmental interactions of soft robotics. However, current versions have limited speed, expansion, and portability. In this paper, we address these concerns through AuxSwarm, a compliant system composed of auxetic-based robotic voxels. These voxels control their volume through a scissor-like bi-layer auxetic design, growing up to 1.57 times their original size in 0.2 seconds. This combination of speed and expansion is unique across modular soft robots, enabling dynamic locomotion capabilities. We characterize the voxels and demonstrate the versatility of this approach through case studies of 2D bending and 3D cube flipping. AuxSwarm provides a first step towards addressable voxel-based smart materials, while simultaneously addressing the robustness and actuation challenges faced by soft robots.

Abstract:
This paper presents the key principles of eViper-2D - a thin large-area soft robotics platform - as a new development of the previous extendable Vibrating Intelligent Piezo-Electric Robot (eViper) platform. We first introduce the mechanical, electrical, and control framework of eViper-2D, and then develop systematic and scalable methods to study the impact of diverse actuation patterns on robotic motion dynamics and energy efficiency. By integrating power electronics, communication circuits, piezoelectric actuators, and batteries onboard, the eViper-2D platform enables rapid design iteration and quick evaluation of different control strategies for the multi-actuator soft robot. The platform supports data-driven modeling via automated data acquisition. We show that eViper-2D can provide rich insights into optimizing actuation patterns to achieve agile motion and minimal cost of transport (COT).

Abstract:
Diffuse Reflectance Spectroscopy (DRS) is a wellestablished optical technique for tissue composition assessment which has been clinically evaluated for tumour detection to ensure the complete removal of cancerous tissue. While pointwise assessment has many potential applications, incorporating automated large-area scanning would enable holistic tissue sampling with higher consistency. We propose a robotic system to facilitate autonomous DRS scanning with hybrid visual servoing control. A specially designed height compensation module enables precise contact condition control. The evaluation results show that the system can accurately execute the scanning command and acquire consistent DRS spectra with comparable results to the manual collection, which is the current gold standard protocol. Integrating the proposed system into surgery lays the groundwork for autonomous intra-operative DRS tissue assessment with high reliability and repeatability. This could reduce the need for manual scanning by the surgeon while ensuring complete tumor removal in clinical practice.

Abstract:
Robotic assembly remains a significant challenge due to complexities in visual perception, functional grasping, contact-rich manipulation, and performing high-precision tasks. Simulation-based learning and sim-to-real transfer have led to recent success in solving assembly tasks in the presence of object pose variation, perception noise, and control error; however, the development of a generalist (i.e., multi-task) agent for a broad range of assembly tasks has been limited by the need to manually curate assembly assets, which greatly constrains the number and diversity of assembly problems that can be used for policy learning. Inspired by recent success of using generative AI to scale up robot learning, we propose Match-Maker, a pipeline to automatically generate diverse, simulation-compatible assembly asset pairs to facilitate learning assembly skills. Specifically, MatchMaker can 1) take a simulation-incompatible, interpenetrating asset pair as input, and automatically convert it into a simulation-compatible, interpenetration-free pair, 2) take an arbitrary single asset as input, and generate a geometrically-mating asset to create an asset pair, 3) automatically erode contact surfaces from (1) or (2) according to a user-specified clearance parameter to generate realistic parts. We demonstrate that data generated by MatchMaker outperforms previous work in terms of diversity and effectiveness for downstream assembly skill learning. Project page: https://wangyian-me.github.io/MatchMaker/.

Abstract:
Visual servo based on traditional image matching methods often requires accurate keypoint correspondence for high precision control. However, keypoint detection or matching tends to fail in challenging scenarios with inconsistent illuminations or textureless objects, resulting significant performance degradation. Previous approaches, including our proposed Correspondence encoded Neural image Servo policy (CNS), attempted to alleviate these issues by integrating neural control strategies. While CNS shows certain improvement against error correspondence over conventional image-based controllers, it could not fully resolve the limitations arising from poor keypoint detection and matching. In this paper, we continue to address this problem and propose a new solution: Probabilistic Correspondence Encoded Neural Image Servo (CNSv2). CNSv2 leverages probabilistic feature matching to improve robustness in challenging scenarios. By redesigning the architecture to condition on multimodal feature matching, CNSv2 achieves high precision, improved robustness across diverse scenes and runs in real-time. We validate CNSv2 with simulations and real-world experiments, demonstrating its effectiveness in overcoming the limitations of detector-based methods in visual servo tasks.

Abstract:
Recent studies have successfully integrated large vision-language models (VLMs) into low-level robotic control by supervised fine-tuning (SFT) with expert robotic datasets, resulting in what we term vision-language-action (VLA) models. Although the VLA models are powerful, how to improve these large models during interaction with environments remains an open question. In this paper, we explore how to further improve these VLA models via Reinforcement Learning (RL), a commonly used fine-tuning technique for large models. However, we find that directly applying online RL to large VLA models presents significant challenges, including training instability that severely impacts the performance of large models, and computing burdens that exceed the capabilities of most local machines. To address these challenges, we propose iRe-VLA framework, which iterates between Reinforcement Learning and Supervised Learning to effectively improve VLA models, leveraging the exploratory benefits of RL while maintaining the stability of supervised learning. Experiments in two simulated benchmarks and a real-world manipulation suite validate the effectiveness of our method.

Abstract:
Correspondence identification (CoID) is an essential capability in multi-robot collaborative perception, which enables a group of robots to consistently refer to the same objects within their respective fields of view. In real-world applications, such as connected autonomous driving, vehicles face challenges in directly sharing raw observations due to limited communication bandwidth. In order to address this challenge, we propose a novel approach for bandwidth-adaptive spatiotemporal CoID in collaborative perception. This approach allows robots to progressively select partial spatiotemporal observations and share with others, while adapting to communication constraints that dynamically change over time. We evaluate our approach across various scenarios in connected autonomous driving simulations. Experimental results validate that our approach enables CoID and adapts to dynamic communication bandwidth changes. In addition, our approach achieves 8%-56% overall improvements in terms of covisible object retrieval for CoID and data sharing efficiency, which outperforms previous techniques and achieves the state-of-the-art performance. More information is available at: https://gaopeng5.github.io/acoid.

Abstract:
In the field of collaborative robotics, the ability to communicate spatial information like planned trajectories and shared environment information is crucial. When no global position information is available (e.g., indoor or GPS-denied environments), agents must align their coordinate frames before shared spatial information can be properly expressed and interpreted. Coordinate frame alignment is particularly difficult when robots have no initial alignment and are affected by odometry drift. To this end, we develop a novel multiple hypothesis algorithm, called TCAFF, for aligning the coordinate frames of neighboring robots. TCAFF considers potential alignments from associating sparse open-set object maps and leverages temporal consistency to determine an initial alignment and correct for drift, all without any initial knowledge of neighboring robot poses. We demonstrate TCAFF being used for frame alignment in a collaborative object tracking application on a team of four robots tracking six pedestrians and show that TCAFF enables robots to achieve a tracking accuracy similar to that of a system with ground truth localization. The code and hardware dataset are available at https://github.com/mit-acl/tcaff.

Abstract:
Nowadays, unmanned aerial vehicles (UAVs) are commonly used in search and rescue scenarios to gather information in the search area. The automatic identification of the person searched for in aerial footage could increase the autonomy of such systems, reduce the search time, and thus increase the missed person's chances of survival. In this paper, we present a novel approach to perform semantically conditioned open vocabulary object tracking that is specifically designed to cope with the limitations of UAV hardware. Our approach has several advantages: It can run with verbal descriptions of the missing person, e.g., the color of the shirt, it does not require dedicated training to execute the mission, and can efficiently track a potentially moving person. Our experimental results demonstrate the versatility and efficacy of our approach. We publish the methods source code at https://github.com/utn-blei/CloudTrack.

Abstract:
We present a framework for learning a single policy capable of producing all quadruped gaits and transitions. The framework consists of a policy trained with deep reinforcement learning (DRL) to modulate the parameters of a system of abstract oscillators (i.e. Central Pattern Generator), whose output is mapped to joint commands through a pattern formation layer that sets the gait style, i.e. body height, swing foot ground clearance height, and foot offset. Different gaits are formed by changing the coupling between different oscillators, which can be instantaneously selected at any velocity by a user. With this framework, we systematically investigate which gait should be used at which velocity, and when gait transitions should occur from a Cost of Transport (COT), i.e. energy-efficiency, point of view. Additionally, we note how gait style changes as a function of locomotion speed for each gait to keep the most energy-efficient locomotion. While the currently most popular gait (trot) does not result in the lowest COT, we find that considering different co-dependent metrics such as mean base angular velocity and joint acceleration result in different 'optimal' gaits than those that minimize COT. We deploy our controller in various hardware experiments, focusing on 9 quadruped animal gaits, and demonstrate generalizability to novel and unseen gaits during training, and robustness to leg failures.

Abstract:
Object 6D pose estimation is a critical challenge in robotics, particularly for manipulation tasks. While prior research combining visual and tactile (visuotactile) information has shown promise, these approaches often struggle with generalization due to the limited availability of visuotactile data. In this paper, we introduce ViTa-Zero, a zero-shot visuotactile pose estimation framework. Our key innovation lies in leveraging a visual model as its backbone and performing feasibility checking and test-time optimization based on physical constraints derived from tactile and proprioceptive observations. Specifically, we model the gripper-object interaction as a spring-mass system, where tactile sensors induce attractive forces, and proprioception generates repulsive forces. We validate our framework through experiments on a real-world robot setup, demonstrating its effectiveness across representative visual backbones and manipulation scenarios, including grasping, object picking, and bimanual handover. Compared to the visual models, our approach overcomes some drastic failure modes while tracking the in-hand object pose. In our experiments, our approach shows an average increase of 55% in AUC of ADD-S and 60% in ADD, along with an 80% lower position error compared to FoundationPose.

Abstract:
Reactive collision avoidance is essential for agile robots navigating complex and dynamic environments, enabling real-time obstacle response. However, this task is inherently challenging because it requires a tight integration of perception, planning, and control, which traditional methods often handle separately, resulting in compounded errors and delays. This paper introduces a novel approach that unifies these tasks into a single reactive framework using solely onboard sensing and computing. Our method combines nonlinear model predictive control with adaptive control barrier functions, directly linking perception-driven constraints to real-time planning and control. Constraints are determined by using a neural network to refine noisy RGB-D data, enhancing depth accuracy, and selecting points with the minimum time-to-collision to prioritize the most immediate threats. To maintain a balance between safety and agility, a heuristic dynamically adjusts the optimization process, preventing overconstraints in real time. Extensive experiments with an agile quadrotor demonstrate effective collision avoidance across diverse indoor and outdoor environments, without requiring environment-specific tuning or explicit mapping.

Abstract:
In this article, an optimal controller for achieving constrained admittance control is proposed. This controller strictly adheres to the constraint boundaries while ensuring minimal variations in kinematic energy. The proposed method integrates admittance control for human-robot interaction with the Udwadia-Kalaba equations for constrained motion into a unified framework. The proposed architecture has been tested and validated both with simulations and real tests on a 6-DoF UR5e robot. The results demonstrate that the proposed architecture outperforms virtual fixtures, one of the most commonly used techniques to implement effective path-following control.

Abstract:
While tactile sensing is widely accepted as an important and useful sensing modality, its use pales in comparison to other sensory modalities like vision and proprioception. AnySkin addresses the critical challenges that impede the use of tactile sensing - versatility, replaceability, and data reusability. Building on the simplistic design of ReSkin, and decoupling the sensing electronics from the sensing interface, AnySkin simplifies integration making it as straightforward as putting on a phone case and connecting a charger. Furthermore, AnySkin is the first uncalibrated tactile-sensor to report crossinstance generalizability of learned manipulation policies. To summarize, this work makes three key contributions: first, we introduce a streamlined fabrication process and a design tool for creating an adhesive-free, durable and easily replaceable magnetic tactile sensor; second, we characterize slip detection and policy learning with the AnySkin sensor; third, we demonstrate zero-shot generalization of models trained on one instance of AnySkin to new instances, and compare it with popular existing tactile solutions like DIGIT and ReSkin. Code, design files, and videos of policy experiments can be found on https://any-skin.github.io

Abstract:
Acoustic transmission, or sound, can effectively communicate information over distances through various media. We focus on generating acoustic transmission using pneumatically driven resonators for wireless tactile sensing without the need for any electronics at the end-effector or contact point. We explore the relationship between emitted frequency and the geometry of the resonance chamber. When a normal compressive force is applied to the end cap, the compliant resonant cavity deforms, leading to an increase in frequency measurable by an external microphone. Prior work uses tube resonators with fipple attachments. In the present work, we study whether a different smaller audible cylindrical resonator with air blown across the entryway can be utilized instead. We test the utility of the Helmholtz resonator model in predicting the experimental frequency response. Resonance is often modeled for rigid cavities, presenting unique challenges in predicting resonance for the design of soft resonating taxels.

Abstract:
Autonomous robot navigation in off-road environments presents a number of challenges due to its lack of structure, making it difficult to handcraft robust heuristics for diverse scenarios. While learned methods using hand labels or self-supervised data improve generalizability, they often require a tremendous amount of data and can be vulnerable to domain shifts. To improve generalization in novel environments, recent works have incorporated adaptation and self-supervision to develop autonomous systems that can learn from their own experiences online. However, current works often rely on significant prior data, for example minutes of human teleoperation data for each terrain type, which is difficult to scale with more environments and robots. To address these limitations, we propose SALON, a perception-action framework for fast adaptation of traversability estimates with minimal human input. SALON rapidly learns online from experience while avoiding out of distribution terrains to produce adaptive and risk-aware cost and speed maps. Within seconds of collected experience, our results demonstrate comparable navigation performance over kilometer-scale courses in diverse off-road terrain as methods trained on 100-1000x more data. We additionally show promising results on significantly different robots in different environments. Our code is available at https://theairlab.org/SALON

Abstract:
In recent years, Embodied Artificial Intelligence (Embodied AI) has advanced rapidly, yet the increasing size of models conflicts with the limited computational capabilities of Embodied AI platforms. To address this challenge, we aim to achieve both high model performance and practical deployability. Specifically, we focus on Vision-and-Language Navigation (VLN), a core task in Embodied AI. This paper introduces a two-stage knowledge distillation framework, producing a student model, MiniVLN, and showcasing the significant potential of distillation techniques in developing lightweight models. The proposed method aims to capture fine-grained knowledge during the pretraining phase and navigation-specific knowledge during the fine-tuning phase. Our findings indicate that the two-stage distillation approach is more effective in narrowing the performance gap between the teacher model and the student model compared to single-stage distillation. On the public R2R and REVERIE benchmarks, MiniVLN achieves performance on par with the teacher model while having only about 12 % of the teacher model's parameter count.

Abstract:
[1] Visual Simultaneous Localization and Mapping (SLAM) helps robots estimate their poses and perceive the environment in unknown settings. Recent work has demonstrated that implicit neural radiance fields and 3D Gaussian Splatting (3DGS) offer higher fidelity scene representation than traditional map representations. We propose VSS-SLAM, which utilizes voxelized surfels as the map representation for incremental mapping in unknown environments. This representation effectively addresses the issue of redundant and disordered primitives encountered in previous methods, thereby enhancing geometric accuracy during reconstruction. Specifically, our approach divides the scene using voxels and stores geometric and appearance information in feature vectors at the voxel vertices. Before rendering, these feature vectors are decoded to generate the corresponding surfels. Additionally, we align camera poses through image and depth rendering. Extensive experiments on the Replica and TUM-RGBD datasets demonstrate that VSS-SLAM delivers high-fidelity reconstruction and accurate pose estimation in both simulated and real-world environments. Source code will soon be available.

Abstract:
This paper presents a three-dimensional discrete search path planner for fixed-wing aircraft emergency landing planning that manages state-space complexity by incorporating cost gradients to assure descent flight path angle and runway heading alignment constraints are met. Our approach incorporates steady wind and maximizes margin from flight envelope boundaries to accommodate wind variation in a manner commensurate with a loss of thrust condition. A novel multi-objective cost function that combines gradient-based path guidance and population risk metrics is implemented to efficiently enable discrete search to find a robust solution. The proposed method is demonstrated through use cases with population data for a region of Long Island, New York that highlight our algorithm's effectiveness.

Abstract:
This paper develops a Bayesian optimal experimental design for robot kinematic calibration on \mathbbS^3 × \mathbbR^3. Our method builds upon a Gaussian process approach that incorporates a geometry-aware kernel based on Riemannian Matérn kernels over \mathbbS^3. To learn the forward kinematics errors via Bayesian optimization with a Gaussian process, we define a geodesic distance-based objective function. Pointwise values of this function are sampled via noisy measurements taken using fiducial markers on the end-effector using a camera and computed pose with the nominal kinematics. The corrected Denavit-Hartenberg parameters are obtained using an efficient quadratic program that operates on the collected data sets. The effectiveness of the proposed method is demonstrated via simulations and calibration experiments on NASA's ocean world lander autonomy testbed (OWLAT).

Abstract:
Control Lyapunov functions are traditionally used to design a controller which ensures convergence to a desired state, yet deriving these functions for nonlinear systems remains a complex challenge. This paper presents a novel, sample-efficient method for neural approximation of nonlinear Lyapunov functions, leveraging self-supervised Reinforcement Learning (RL) to enhance training data generation, particularly for inaccurately represented regions of the state space. The proposed approach employs a data-driven World Model to train Lyapunov functions from off-policy trajectories. The method is validated on both standard and goal-conditioned robotic tasks, demonstrating faster convergence and higher approximation accuracy compared to the state-of-the-art neural Lyapunov approximation baseline. The code is available at: https://github.com/CAV-Research-Lab/SACLA.git

Abstract:
This paper proposes a novel control method for an autonomous wheel loader, enabling time-efficient navigation to an arbitrary goal pose. Unlike prior works which combine high-level trajectory planners with Model Predictive Control (MPC), we directly enhance the planning capabilities of MPC by incorporating a cost function derived from Actor-Critic Reinforcement Learning (RL). Specifically, we first train an RL agent to solve the pose reaching task in simulation, then transfer the learned planning knowledge to an MPC by incorporating the trained neural network critic as both the stage and terminal cost. We show through comprehensive simulations that the resulting MPC inherits the time-efficient behavior of the RL agent, generating trajectories that compare favorably against those found using trajectory optimization. We also deploy our method on a real-world wheel loader, where we demonstrate successful navigation in various scenarios.

Abstract:
This work introduces a framework to diagnose the strengths and shortcomings of Autonomous Vehicle (AV) collision avoidance technology with synthetic yet realistic potential collision scenarios adapted from real-world, collision-free data. Our framework generates counterfactual collisions with diverse crash properties, e.g., crash angle and velocity, between an adversary and a target vehicle by adding perturbations to the adversary's predicted trajectory from a learned AV behavior model. Our main contribution is to ground these adversarial perturbations in realistic behavior as defined through the lens of data-alignment in the behavior model's parameter space. Then, we cluster these synthetic counterfactuals to identify plausible and representative collision scenarios to form the basis of a test suite for downstream AV system evaluation. We demonstrate our framework using two state-of-the-art behavior prediction models as sources of realistic adversarial perturbations, and show that our scenario clustering evokes interpretable failure modes from a baseline AV policy under evaluation.

Abstract:
There is now a large body of techniques, many based on formal methods, for describing and realizing complex robotics tasks, including those involving a variety of rich goals and time-extended behavior. This paper explores the limits of what sorts of tasks are specifiable, examining how the precise grounding of specifications -that is, whether the specification is given in terms of the robot's states, its actions and observations, its knowledge, or some other information-is crucial to whether a given task can be specified. While prior work included some description of particular choices for this grounding, our contribution treats this aspect as a first-class citizen: we introduce notation to deal with a large class of problems, and examine how the grounding affects what tasks can be posed. The results demonstrate that certain classes of tasks are specifiable under different combinations of groundings.

Abstract:
Robots in the real world experience wear and tear, leading to changing system dynamics. This challenge is particularly exacerbated for non-rigid systems such as soft robots or robotic systems made of metamaterials with hysteresis. This setting results in a challenging problem for most learning-based controllers that typically rely on the assumption that the system dynamics remain fixed over time. In the absence of explicit mechanisms to account for this change in dynamics, learning-based control algorithms show considerable degradation in performance over time. In this work, we consider a particular class of dynamics shift in under-actuated systems, that is localized to the dynamics of the fully actuated robot itself, while independently leaving the dynamics of the environment unchanged. This captures real-world phenomena such as fatigue or hysteresis in robotic systems. In this setting, we propose an efficient algorithm that can account for dynamics shift. Using a simple calibration procedure, we propose a technique for learning a non-linear “action-translation” model that can capture the localized shift in dynamics. This enables continual learning and transfer despite considerable dynamics shift during the learning process. We demonstrate the efficacy of this procedure on several tasks in simulation, as well as a real-world robotic system - a 4 DoF electrically driven handed shearing auxetic (HSA) platform.

Abstract:
Forests are vital to our ecosystems, acting as carbon sinks, climate stabilizers, biodiversity centers, and wood sources. Due to their scale, monitoring and managing forests takes a lot of work. Forestry robotics offers the potential for enabling efficient and sustainable foresting practices through automation. Despite increasing interest in this field, the scarcity of robotics datasets and benchmarks in forest environments is hampering progress in this domain. In this paper, we present a real-world, longitudinal dataset for forestry robotics that enables the development and comparison of approaches for various relevant applications, ranging from semantic interpretation to estimating traits relevant to forestry management. The dataset consists of multiple recordings of the same plots in a forest in Switzerland during three different growth periods. We recorded the data with a mobile 3D LiDAR scanning setup. Additionally, we provide semantic annotations of trees, shrubs, and ground, instance-level annotations of trees, as well as more fine-grained annotations of tree stems and crowns. Furthermore, we provide reference field measurements of traits relevant to forestry management for a subset of the trees. Together with the data, we also provide open-source baseline panoptic segmentation and tree trait estimation approaches to enable the community to bootstrap further research and simplify comparisons in this domain.

Abstract:
Efficient, collision-free motion planning is essential for automating large-scale manipulators like timber cranes. They come with unique challenges such as hydraulic actuation constraints and passive joints-factors that are seldom addressed by current motion planning methods. This paper introduces a novel approach for time-optimal, collision-free hybrid motion planning for a hydraulically actuated timber crane with passive joints. We enhance the via-point-based stochastic trajectory optimization (VP-STO) algorithm to in-clude pump flow rate constraints and develop a novel collision cost formulation to improve robustness. The effectiveness of the enhanced VP-STO as an optimal single-query global planner is validated by comparison with an informed RRT algorithm using a time-optimal path parameterization (TOPP). The over-all hybrid motion planning is formed by combination with a gradient-based local planner that is designed to follow the global planner's reference and to systematically consider the passive joint dynamics for both collision avoidance and sway damping.

Abstract:
We demonstrate the capabilities of an attentionbased end-to-end approach for high-speed vision-based quadrotor obstacle avoidance in dense, cluttered environments, with comparison to various state-of-the-art learning architectures. Quadrotor unmanned aerial vehicles (UAVs) have tremendous maneuverability when flown fast; however, as flight speed increases, traditional model-based approaches to navigation via independent perception, mapping, planning, and control modules breaks down due to increased sensor noise, compounding errors, and increased processing latency. Thus, learning-based, end-to-end vision-to-control networks have shown to have great potential for online control of these fast robots through cluttered environments. We train and compare convolutional, U-Net, and recurrent architectures against vision transformer (ViT) models for depth image-to-control in high-fidelity simulation, observing that ViT models are more effective than others as quadrotor speeds increase and in generalization to unseen environments, while the addition of recurrence further improves performance while reducing quadrotor energy cost across all tested flight speeds. We assess performance at speeds of up to 7m/s in simulation and hardware. To the best of our knowledge, this is the first work to utilize vision transformers for end-to-end vision-based quadrotor control.

Abstract:
An inherent fragility of quadrotor systems stems from model inaccuracies and external disturbances. These factors hinder performance and compromise the stability of the system, making precise control challenging. Existing model-based approaches either make deterministic assumptions, utilize Gaussian-based representations of uncertainty, or rely on nominal models, all of which often fall short in capturing the complex, multimodal nature of real-world dynamics. This work introduces DroneDiffusion, a novel framework that leverages conditional diffusion models to learn quadrotor dynamics, formulated as a sequence generation task. DroneDiffusion achieves superior generalization to unseen, complex scenarios by capturing the temporal nature of uncertainties and mitigating error propagation. We integrate the learned dynamics with an adaptive controller for trajectory tracking with stability guarantees. Extensive experiments in both simulation and real-world flights demonstrate the robustness of the framework across a range of scenarios, including unfamiliar flight paths and varying payloads, velocities, and wind disturbances. Project page: https://sites.google.com/view/dronediffusion.

Abstract:
We propose Graph2Nav, a real-time 3D object-relation graph generation framework, for autonomous navigation in the real world. Our framework fully generates and exploits both 3D objects and a rich set of semantic relationships among objects in a 3D layered scene graph, which is applicable to both indoor and outdoor scenes. It learns to generate 3D semantic relations among objects, by leveraging and advancing state-of-the-art 2D panoptic scene graph works into the 3D world via 3D semantic mapping techniques. This approach avoids previous training data constraints in learning 3D scene graphs directly from 3D data. We conduct experiments to validate the accuracy in locating 3D objects and labeling object-relations in our 3D scene graphs. We also evaluate the impact of Graph2Nav via integration with SayNav, a state-of-the-art planner based on large language models, on an unmanned ground robot to object search tasks in real environments. Our results demonstrate that modeling object relations in our scene graphs improves search efficiency in these navigation tasks.

Abstract:
Shortest-path roadmaps, also known as reduced visibility graphs, provide a highly efficient multi-query method for computing optimal paths in two-dimensional environments. Combined with Minkowski sum computations, shortest-path roadmaps can compute optimal paths for a translating robot in 2D. In this study, we explore the intuitive idea of stacking up a set of reduced visibility graphs at different orientations for a polygonal holonomic robot to support the fast computation of near-optimal paths, allowing simultaneous 2D translation and rotation. The resulting algorithm, rotation-stacked visibility graph (RVG), is shown to be resolution-complete and asymptotically optimal. Extensive computational experiments show RVG significantly outperforms state-of-the-art single- and multiquery sampling-based methods on both computation time and solution optimality fronts. Source code and supplementary materials are available at https://github.com/arc-1/rvg.

Abstract:
Teams of people coordinate to perform complex tasks by forming abstract mental models of world and agent dynamics. The use of abstract models contrasts with much recent work in robot learning that uses a high-fidelity simulator and reinforcement learning (RL) to obtain policies for physical robots. Motivated by this difference, we investigate the extent to which so-called abstract simulators can be used for multiagent reinforcement learning (MARL) and the resulting policies successfully deployed on teams of physical robots. An abstract simulator models the robot's target task at a high-level of abstraction and discards many details of the world that could impact optimal decision-making. Policies are trained in an abstract simulator then transferred to the physical robot by making use of separately-obtained low-level perception and motion control modules. We identify three key categories of modifications to the abstract simulator that enable policy transfer to physical robots: simulation fidelity enhancements, training optimizations and simulation stochasticity. We then run an empirical study with extensive ablations to determine the value of each modification category for enabling policy transfer in cooperative robot soccer tasks. We also compare the performance of policies produced by our method with a well-tuned non-learning-based behavior architecture from the annual RoboCup competition and find that our approach leads to a similar level of performance. Broadly we show that MARL can be use to train cooperative physical robot behaviors using highly abstract models of the world.

Abstract:
High-resolution visuotactile sensors provide detailed contact information that is promising to infer the physical properties of objects in contact. This paper introduces a novel technique for high-resolution stiffness estimation of heterogeneous deformable objects using the Punyo bubble sensor. We developed an observation model for dense contact forces to estimate object stiffness using a visuotactile sensor and a dense force estimator. Additionally, we propose a neural Volumetric Stiffness Field (VSF) formulation that represents stiffness as a continuous function, which allows dynamic point sampling at visuotactile sensor observation resolution. The neural VSF significantly reduces artifacts commonly associated with traditional point-based methods, particularly in stiff inclusion estimation and heterogeneous stiffness estimation. We further apply our method in a blind localization task, where objects within opaque bags are accurately modeled and localized, demonstrating the superior performance of neural VSF compared to existing techniques. Project page: https://hjh371.github.io/Neural-VSF/.

Abstract:
Deep Reinforcement Learning (RL) has shown remarkable success in robotics with complex and heterogeneous dynamics. However, its vulnerability to unknown disturbances and adversarial attacks remains a significant challenge. In this paper, we propose a robust policy training framework that integrates model-based control principles with adversarial RL training to improve robustness without the need for external black-box adversaries. Our approach introduces a novel Hamilton-Jacobi reachability-guided disturbance for adversarial RL training, where we use interpretable worst-case or near-worst-case disturbances as adversaries against the robust policy. We evaluated its effectiveness across three distinct tasks: a reach-avoid game in both simulation and real-world settings, and a highly dynamic quadrotor stabilization task in simulation. We validate that our learned critic network is consistent with the ground-truth HJ value function, while the policy network shows comparable performance with other learning-based methods.

Abstract:
We address the problem of stable and robust control of vehicles with lateral error dynamics for the application of lane keeping. Lane departure is the primary reason for half of the fatalities in road accidents, making the development of stable, adaptive and robust controllers a necessity. Any disturbance or uncertainty introduced to the steering-angle input can be catastrophic for the vehicle. Therefore, controllers must be developed to actively handle such uncertainties. In this work, we introduce a Neural \mathcalL_1 Adaptive controller (Neural-L1) which learns the uncertainties in the lateral error dynamics of a front-steered Ackermann vehicle and guarantees stability and robustness. Our contributions are threefold: i) We extend the theoretical results for guaranteed stability and robustness of conventional \mathcalL_1 Adaptive controllers to Neural-L1; ii) We implement a Neural-L1 for the lane keeping application which learns uncertainties in the dynamics accurately; iii) We evaluate the performance of Neural-L1 on a physics-based simulator, PyBullet, and conduct extensive real-world experiments with the FlTENTH platform to demonstrate superior reference trajectory tracking performance of Neural-L1 compared to other state-of-the-art controllers, in the presence of uncertainties. Our project page, including supplementary material and videos, can be found at https://mukhe027.github.io/Neural-Adaptive-Control/

Abstract:
We present the BETTY dataset, a large-scale, multi-modal dataset collected on several autonomous racing vehicles, targeting supervised and self-supervised state estimation, dynamics modeling, motion forecasting, perception, and more. Existing large-scale datasets, especially autonomous vehicle datasets, focus primarily on supervised perception, planning, and motion forecasting tasks. Our work enables multi-modal, data-driven methods by including all sensor inputs and the outputs from the software stack, along with semantic metadata and ground truth information. The dataset encompasses 4 years of data, currently comprising over 13 hours and 32 TB, collected on autonomous racing vehicle platforms. This data spans 6 diverse racing environments, including high-speed oval courses, for single and multi-agent algorithm evaluation in feature-sparse scenarios, as well as high-speed road courses with high longitudinal and lateral accelerations and tight, GPSdenied environments. It captures highly dynamic states, such as 63 \mathrmm / \mathrms crashes, loss of tire traction, and operation at the limit of stability. By offering a large breadth of cross-modal and dynamic data, the BETTY dataset enables the training and testing of full autonomy stack pipelines, pushing the performance of all algorithms to the limits. The current dataset is available at https://pitt-mit-iac.github.io/betty-dataset/.

Abstract:
Hip exoskeletons are increasing in popularity due to their effectiveness across various scenarios and their ability to adapt to different users. However, personalizing the assistance often requires lengthy tuning procedures and computationally intensive algorithms, and most existing methods do not incorporate user feedback. In this work, we propose a novel approach for rapidly learning users' preferences for hip exoskeleton assistance. We perform pairwise comparisons of distinct randomly generated assistive profiles, and collect participants preferences through active querying. Users' feed-back is integrated into a preference-learning algorithm that updates its belief, learns a user-dependent reward function, and changes the assistive torque profiles accordingly. Results from eight healthy subjects display distinct preferred torque profiles, and users' choices remain consistent when compared to a perturbed profile. A comprehensive evaluation of users' preferences reveals a close relationship with individual walking strategies. The tested torque profiles do not disrupt kinematic joint synergies, and participants favor assistive torques that are synchronized with their movements, resulting in lower negative power from the device. This straightforward approach enables the rapid learning of users preferences and rewards, grounding future studies on reward-based human-exoskeleton interaction.

Abstract:
Redesigning and remanufacturing robots are infeasible for resource-constrained environments like space or undersea. This work thus studies how to evaluate and repurpose existing, complementary, quadruped legs for new tasks. We implement this approach on 15 robot designs generated from combining six pre-selected leg designs. The performance maps for force-based locomotion tasks like pulling, pushing, and carrying objects are constructed via a learned policy that works across all designs and adapts to the limits of each. Performance predictions agree well with real-world validation results. The robot can locomote at 0.5 body lengths per second while exerting a force that is almost 60% of its weight.

Abstract:
LiDAR-based 3D object detection presents significant challenges due to the inherent sparsity of LiDAR points. A common solution involves long-term temporal LiDAR data to densify the inputs. However, efficiently leveraging spatial-temporal information remains an open problem. In this paper, we propose a novel Semantic-Supervised Spatial-Temporal Fusion (ST-Fusion) method, which introduces a novel fusion module to relieve the spatial misalignment caused by the object motion over time and a feature-level semantic supervision to sufficiently unlock the capacity of the proposed fusion module. Specifically, the ST- Fusion consists of a Spatial Aggregation (SA) module and a Temporal Merging (TM) module. The SA module employs a convolutional layer with progressively expanding receptive fields to aggregate the object features from the local regions to alleviate the spatial misalignment, the TM module dynamically extracts object features from the preceding frames based on the attention mechanism for a comprehensive sequential presentation. Besides, in the semantic supervision, we propose a Semantic Injection method to enrich the sparse LiDAR data via injecting the point-wise semantic labels, using it for training a teacher model and providing a reconstruction target at the feature level supervised by the proposed object-aware loss. Extensive experiments on various LiDAR-based detectors demonstrate the effectiveness and universality of our proposal, yielding an improvement of approximately +2.8% in NDS based on the nuScenes benchmark.

Abstract:
This work establishes the concept of commonsense scene composition, with a focus on extending Belief Scene Graphs by estimating the spatial distribution of unseen objects. Specifically, the commonsense scene composition capability refers to the understanding of the spatial relationships among related objects in the scene, which in this article is modeled as a joint probability distribution for all possible locations of the semantic object class. The proposed framework includes two variants of a Correlation Information (CECI) model for learning probability distributions: (i) a baseline approach based on a Graph Convolutional Network, and (ii) a neuro-symbolic extension that integrates a spatial ontology based on Large Language Models (LLMs). Furthermore, this article provides a detailed description of the dataset generation process for such tasks. Finally, the framework has been validated through multiple runs on simulated data, as well as in a real-world indoor environment, demonstrating its ability to spatially interpret scenes across different room types. For a video of the article, showcasing the experimental demonstration, please refer to the following link: https://youtu.be/f0tqtPVFZ2A

Abstract:
We introduce Berkeley Humanoid, a reliable and low-cost mid-scale humanoid research platform for learningbased control. Our lightweight, in-house-built robot is designed specifically for learning algorithms with accurate simulation, low simulation complexity, anthropomorphic motion, and high reliability against falls. The narrow sim-to-real gap enables agile and robust locomotion across various terrains in outdoor environments, achieved with a simple reinforcement learning controller using light domain randomization. Furthermore, we demonstrate the robot traversing for hundreds of meters, walking on a steep unpaved trail, and hopping with single and double legs as a testimony to its high performance in dynamic walking. Capable of omnidirectional locomotion and withstanding large perturbations with a compact setup, our system aims for rapid sim-to-real deployment of learningbased humanoid systems. Please check our website https:// berkeley-humanoid.com/ and code https://github. com/HybridRobotics/isaac_berkeley_humanoid/.

Abstract:
Assume that a target is known to be present at an unknown point among a finite set of locations in the plane. We search for it using a mobile robot that has imperfect sensing capabilities. It takes time for the robot to move between locations and search a location; we have a total time budget within which to conduct the search. We study the problem of computing a search path/strategy for the robot that maximizes the probability of detection of the target. Considering non-uniform travel times between points (e.g., based on the distance between them) is crucial for search and rescue applications; such problems have been investigated to a limited extent due to their inherent complexity. In this paper, we describe fast algorithms with performance guarantees for this search problem and some variants, complement them with complexity results, and perform experiments to characterize their performance.

Abstract:
Modern incarnations of tactile sensors produce high-dimensional raw sensory feedback such as images, making it challenging to efficiently store, process, and generalize across sensors. To address these concerns, we introduce a novel implicit function representation for tactile sensor feedback. Rather than directly using raw tactile images, we propose neural implicit functions trained to reconstruct the tactile dataset, producing compact representations that capture the underlying structure of the sensory inputs. These representations offer several advantages over their raw counterparts: they are compact, enable probabilistically interpretable inference, and facilitate generalization across different sensors. We demonstrate the efficacy of this representation on the downstream task of in-hand object pose estimation, achieving improved performance over image-based methods while simplifying downstream models. We release code, demos and datasets at https://www.mmintlab.com/tactile-functasets.

Abstract:
Online object detection and tracking are crucial for embodied intelligence systems, including autonomous vehicles and robotics. Traditional approaches employ a pipeline structure to perform detection and tracking separately, which can not fully leverage information from the detector. Moreover, most prior tracking methods rely on motion models such as constant velocity for state updates, which can lead to incorrect associations when the velocity estimates are inaccurate. To address these limitations, we propose ConTrack3D, an online tracking approach that jointly performs detection and tracking in an end-to-end manner. Specifically, ConTrack3D incorporates a Joint Encoder module to capture detection embeddings and a Temporal Extender module for data-driven state updates. By employing contrastive learning, ConTrack3D learns discriminative tracking representation for more accurate association. ConTrack3D is evaluated on the nuScenes benchmark, and the experimental results demonstrate its significant improvements in tracking performance.

Abstract:
Accurate and realistic 3D scene reconstruction enables the lifelike creation of autonomous driving simulation environments. With advancements in 3D Gaussian Splatting (3DGS), previous studies have applied it to reconstruct complex dynamic driving scenes. These methods typically require expensive LiDAR sensors and pre-annotated datasets of dynamic objects. To address these challenges, we propose OG-Gaussian, a novel approach that replaces LiDAR point clouds with Occupancy Grids (OGs) generated from surround-view camera images using Occupancy Prediction Network (ONet). Our method leverages the semantic information in OGs to separate dynamic vehicles from static street background, converting these grids into two distinct sets of initial point clouds for reconstructing both static and dynamic objects. Additionally, we estimate the trajectories and poses of dynamic objects through a learning-based approach, eliminating the need for complex manual annotations. Experiments on Waymo Open dataset demonstrate that OG-Gaussian is on par with the current state-of-the-art in terms of reconstruction quality and rendering speed, achieving an average PSNR of 35.13 and a rendering speed of 143 FPS, while significantly reducing computational costs and economic overhead.

Abstract:
Imitation learning (IL) has shown great success in learning complex robot manipulation tasks. However, there remains a need for practical safety methods to justify widespread deployment. In particular, it is important to certify that a system obeys hard constraints on unsafe behavior in settings when it is unacceptable to design a tradeoff between performance and safety via tuning the policy (i.e. soft constraints). This leads to the question, how does enforcing hard constraints impact the performance (meaning safely completing tasks) of an IL policy? To answer this question, this paper builds a reach ability - based safety filter to enforce hard constraints on IL, which we call Reachability-Aided Imitation Learning (RAIL). Through evaluations with state-of-the-art IL policies in mobile robots and manipulation tasks, we make two key findings. First, the highest-performing policies are sometimes only so because they frequently violate constraints, and significantly lose performance under hard constraints. Second, surprisingly, hard constraints on the lower-performing policies can occasionally increase their ability to perform tasks safely. Finally, hardware evaluation confirms the method can operate in real time. More results can be found at our website: https://safe-robotics-lab-gt.github.io/rail/.

Abstract:
We investigate the problem of teaching a robot manipulator to perform dynamic non-prehensile object transport, also known as the ‘robot waiter’ task, from a limited set of real-world demonstrations. We propose an approach that combines batch reinforcement learning (RL) with modelpredictive control (MPC) by pretraining an ensemble of value functions from demonstration data, and utilizing them online within an uncertainty-aware MPC scheme to ensure robustness to limited data coverage. Our approach is straightforward to integrate with off-the-shelf MPC frameworks and enables learning solely from task space demonstrations with sparsely labeled transitions, while leveraging MPC to ensure smooth joint space motions and constraint satisfaction. We validate the proposed approach through extensive simulated and real-world experiments on a Franka Panda robot performing the robot waiter task and demonstrate robust deployment of value functions learned from 50 – 100 demonstrations. Furthermore, our approach enables generalization to novel objects not seen during training and can improve upon suboptimal demonstrations. We believe that such a framework can reduce the burden of providing extensive demonstrations and facilitate rapid training of robot manipulators to perform non-prehensile manipulation tasks. Project videos and supplementary material can be found at: https://sites.google.com/view/cvmpc

Abstract:
Although existing 3D perception algorithms have demonstrated significant improvements in performance, their deployment on edge devices continues to encounter critical challenges due to substantial runtime latency. We propose a new benchmark tailored for online evaluation by considering runtime latency. Based on the benchmark, we build a Latency-Aware 3D Streaming Perception (LASP) framework that addresses the latency issue through two primary components: 1) latency-aware history integration, which extends query propagation into a continuous process, ensuring the integration of historical feature regardless of varying latency; 2) latency-aware predictive detection, a module that compensates the detection results with the predicted trajectory and the posterior accessed latency. By incorporating the latency-aware mechanism, our method shows generalization across various latency levels, achieving an online performance that closely aligns with 80% of its offline evaluation on the Jetson AGX Orin without any acceleration techniques.

Abstract:
Useful robot control algorithms should not only achieve performance objectives but also adhere to hard safety constraints. Control Barrier Functions (CBFs) have been developed to provably ensure system safety through forward invariance. However, they often unnecessarily sacrifice performance for safety since they are purely reactive. Receding horizon control (RHC), on the other hand, consider planned trajectories to account for the future evolution of a system. This work provides a new perspective on safety-critical control by introducing Forward Invariance in Trajectory Spaces (FITS). We lift the problem of safe RHC into the trajectory space and describe the evolution of planned trajectories as a controlled dynamical system. Safety constraints defined over states can be converted into sets in the trajectory space which we render forward invariant via a CBF framework. We derive an efficient quadratic program (QP) to synthesize trajectories that provably satisfy safety constraints. Our experiments support that FITS improves the adherence to safety specifications without sacrificing performance over alternative CBF and NMPC methods.

Abstract:
Accurate object detection depends on the precise refinement of bounding box regression. Recent advancements in bounding box regression have introduced a variety of methodologies aimed at reducing the disparity between predicted and ground truth bounding boxes. The prevailing objective functions for bounding box regression typically encompass three key perspectives: i) Intersection over Union (IoU), ii) distance between central points, and iii) aspect ratio alignment. Nonetheless, these existing loss functions encounter two primary challenges including slow convergence of the distance term and aspect ratio variation irrelevant to bounding box localization. This paper presents two novel loss terms to address these challenges. Firstly, we introduce the concept of the Integral of Central-Gaussian, a novel approach that leverages the cumulative distribution function (CDF) derived from a closed-form Gaussian distribution based on the central points of bounding boxes. Secondly, we introduce an alternative aspect ratio representation by minimizing the angle between two bounding boxes in direct proportion to their IoU. We term this comprehensive loss function “Central-Gaussian Angle-IoU” (CA-IoU), seamlessly incorporating the Integral of Central-Gaussian with angle-based IoU. Extensive experiments on various models and benchmarks for object detection highlight the superior performance of CA-IoU loss compared to existing bounding box regression methods. The source code and the corresponding trained models will be made available.

Abstract:
Multimodal task specification is essential for enhanced robotic performance, where Cross-modality Alignment enables the robot to holistically understand complex task instructions. Directly annotating multimodal instructions for model training proves impractical, due to the sparsity of paired multimodal data. In this study, we demonstrate that by leveraging unimodal instructions abundant in real data, we can effectively teach robots to learn multimodal task specifications. First, we endow the robot with strong Crossmodality Alignment capabilities, by pretraining a robotic multimodal encoder using extensive out-of-domain data. Then, we employ two Collapse and Corrupt operations to further bridge the remaining modality gap in the learned multimodal representation. This approach projects different modalities of identical task goal as interchangeable representations, thus enabling accurate robotic operations within a well-aligned multimodal latent space. Evaluation across more than 130 tasks and 4000 evaluations on both simulated LIBERO benchmark and real robot platforms showcases the superior capabilities of our proposed framework, demonstrating significant potential in overcoming data constraints in robotic learning. Website: zh1hao.wang/Robo_MUTUAL

Abstract:
Narrow passage path planning is a prevalent problem from industrial to household sites, often facing difficulties in finding feasible paths or requiring excessive computational resources. Given that deep penetration into the environment can cause optimization failure, we propose a framework to ensure feasibility throughout the process using a series of subproblems tailored for narrow passage problem. We begin by decomposing the environment into convex objects and initializing collision constraints with a subset of these objects. By continuously interpolating the collision constraints through the process of sequentially introducing remaining objects, our proposed framework generates subproblems that guide the optimization toward solving the narrow passage problem. Several examples are presented to demonstrate how the proposed framework addresses narrow passage path planning problems.

Abstract:
Multi-stage deformable linear object manipulation, such as cable routing, is the common and necessary part of human life and industry. However, autonomous robots still lack the dexterity and generalization required for these complex tasks. Direct teleoperation is an alternative approach, but the absence of reliable force and haptic feedback methods undermines its robustness and efficiency. This paper proposes a shared control method based on tactile sensing to address a multi-stage, contact-rich cable routing task. The proposed method allows human and robotic autonomy to share control of the robot platform. An action primitive vocabulary is constructed, incorporating adaptive authority allocation between human and autonomy, to generate motions for specific task stages. These allocations modulate the control weights of human and autonomy in accordance with the requirements of task stages. The method selects primitives from this vocabulary based on the tactile data and human intention. The effectiveness of our approach is demonstrated through a task involving straightening a cable and slotting it into a clip. We compare its performance with alternative methods and present that our method has a higher success rate and takes less time than direct teleoperation.

Abstract:
In this paper, we propose a Contact Diffusion Model (CDM), a novel learning-based approach for multi-contact point localization. We consider a robot equipped with joint torque sensors and a force/torque sensor at the base. By leveraging a diffusion model, CDM addresses the singularity where multiple pairs of contact points and forces produce identical sensor measurements. We formulate CDM to be conditioned on past model outputs to account for the time-dependent characteristics of the multi-contact scenarios. Moreover, to effectively address the complex shape of the robot surfaces, we incorporate the signed distance field in the denoising process. Consequently, CDM can localize contacts at arbitrary locations with high accuracy. Simulation and real-world experiments demonstrate the effectiveness of the proposed method. In particular, CDM operates at 15.97ms and, in the real world, achieves an error of 0.44cm in single-contact scenarios and 1.24cm in dual-contact scenarios.

Abstract:
Diffusion models have seen rapid adoption in robotic imitation learning, enabling autonomous execution of complex dexterous tasks. However, action synthesis is often slow, requiring many steps of iterative denoising, limiting the extent to which models can be used in tasks that require fast reactive policies. To sidestep this, recent works have explored how the distillation of the diffusion process can be used to accelerate policy synthesis. However, distillation is computationally expensive and can hurt both the accuracy and diversity of synthesized actions. We propose SDP (Streaming Diffusion Policy), an alternative method to accelerate policy synthesis, leveraging the insight that generating a partially denoised action trajectory is substantially faster than a full output action trajectory. At each observation, our approach outputs a partially denoised action trajectory with variable levels of noise corruption, where the immediate action to execute is noise-free, with subsequent actions having increasing levels of noise and uncertainty. The partially denoised action trajectory for a new observation can then be quickly generated by applying a few steps of denoising to the previously predicted noisy action trajectory (rolled over by one timestep). We illustrate the efficacy of this approach, dramatically speeding up policy synthesis while preserving performance across both simulated and real-world settings. Project website: https://streaming-diffusion-policy.github.io.

Abstract:
We introduce SPOT, an object-centric imitation learning framework. The key idea is to capture each task by an object-centric representation, specifically the SE(3) object pose trajectory relative to the target. This approach decouples embodiment actions from sensory inputs, facilitating learning from various demonstration types, including both action-based and action-less human hand demonstrations, as well as crossembodiment generalization. Additionally, object pose trajectories inherently capture planning constraints from demonstrations without the need for manually-crafted rules. To guide the robot in executing the task, the object trajectory is used to condition a diffusion policy. We systematically evaluate our method on simulation and real-world tasks. In real-world evaluation, using only eight demonstrations shot on an iPhone, our approach completed all tasks while fully complying with task constraints. Project page: https://nvlabs.github.io/object_centric_diffusion

Abstract:
3D occupancy prediction has recently emerged as a new paradigm for holistic 3D scene understanding and provides valuable information for downstream planning in autonomous driving. Most existing methods, however, are computationally expensive, requiring costly attention-based 2D- 3D transformation and 3D feature processing. In this paper, we present a novel 3D occupancy prediction approach, H30, which features highly efficient architecture designs that incur a significantly lower computational cost as compared to the current state-of-the-art methods. In addition, to compensate for the ambiguity in ground-truth 3D occupancy labels, we advocate leveraging auxiliary tasks to complement the direct 3D supervision. In particular, we integrate multi-camera depth estimation, semantic segmentation, and surface normal estimation via differentiable volume rendering, supervised by corresponding 2D labels that introduces rich and heterogeneous supervision signals. We conduct extensive experiments on the Occ3D-nuScenes and SemanticKITTI benchmarks that demonstrate the superiority of our proposed H30.

Abstract:
In recent years, with the rapid development of Advanced Driver Assistance Systems (ADAS), the demand for the precise and efficient surround view stitching system has significantly increased. Traditional stitching methods perform well in small single-unit vehicles with stable camera poses. However, the stitching quality sharply degrades when applied to large tractor-trailers due to the continuous pose changes caused by the non-rigid connection between the tractor and trailer. In detail, first, the extended length of tractor-trailers results in low overlap between cameras, making feature extraction and matching challenging. Additionally, the stitched images often appear irregular, detracting from visual quality. Besides, even if static stitching looks natural, it causes jitter in dynamic scenarios due to random feature extraction. In this paper, we propose an unsupervised deep stitching method for tractor-trailer surround view system. We introduce a feature extraction module for tractor-trailer scenarios (FMT) to enhance feature extraction in low-overlap situations. Besides, we design a spatio-temporally consistent control point constraint strategy (STCC) to achieve spatial shape preservation and temporal smoothing effects, resulting in visually consistent and stable stitched sequences. Experimental results from both public and real dataset show that our method efficiently completes tractor-trailer surround view stitching, producing well-aligned and natural panoramic images compared to previous methods.

Abstract:
This paper investigates payload grasping from a moving platform using a hook-equipped aerial manipulator. First, a computationally efficient trajectory optimization based on complementarity constraints is proposed to determine the optimal grasping time. To enable application in complex, dynamically changing environments, the future motion of the payload is predicted using a physics simulator-based model. The success of payload grasping under model uncertainties and external disturbances is formally verified through a robustness analysis method based on integral quadratic constraints. The proposed algorithms are evaluated in a high-fidelity physical simulator, and in real flight experiments using a customdesigned aerial manipulator platform.

Abstract:
This paper presents Lite VLoc, a hierarchical vi-sual localization framework that uses a lightweight topo-metric map to represent the environment. The method consists of three sequential modules that estimate camera poses in a coarse-to-fine manner. Unlike dense 3D mapping methods, LiteVLoc reduces storage by avoiding geometric reconstruction. It uses a learning-based feature matcher to establish dense correspondences between sparse keyframes and observations, and then refines poses with a geometric solver, enabling robustness to viewpoint changes. The system assumes depth sensors or stereo camera for deployment. A novel dataset for the map-free relocalization task is also introduced. Extensive experiments including localization and navigation in both simulated and real-world scenarios have validate the system's performance and demonstrated its precision and efficiency for large-scale deployment. Code and data will be made publicly available at the webpage:https://rpl-cs-ucl.github.io/LiteVLoc.

Abstract:
Image-goal navigation enables a robot to reach the location where a target image was captured, using visual cues for guidance. However, current methods either rely heavily on data and computationally expensive learning-based approaches or lack efficiency in complex environments due to insufficient exploration strategies. To address these limitations, we propose Bayesian Embodied Image-goal Navigation Using Gaussian Splatting, a novel method that formulates ImageNav as an optimal control problem within a model predictive control framework. BEINGS leverages 3D Gaussian Splatting as a scene prior to predict future observations, enabling efficient, real-time navigation decisions grounded in the robot's sensory experiences. By integrating Bayesian updates, our method dynamically refines the robot's strategy without requiring extensive prior experience or data. Our algorithm is validated through extensive simulations and physical experiments, showcasing its potential for embodied robot systems in visually complex scenarios. Project Page: www.mwg.ink/BEINGS-web.

Abstract:
This paper presents a novel cascade nonlinear observer framework for inertial state estimation. It tackles the problem of intermediate state estimation when external localization is unavailable or in the event of a sensor outage. The proposed observer comprises two nonlinear observers based on a recently developed iteratively preconditioned gradient descent (IPG) algorithm. It takes the inputs via an IMU preintegration model where the first observer is a quaternion-based IPG. The output for the first observer is the input for the second observer, estimating the velocity and, consequently, the position. The proposed observer is validated on a public underwater dataset and a real-world experiment using our robot platform. The estimation is compared with an extended Kalman filter (EKF) and an invariant extended Kalman filter (InEKF). Results demonstrate that our method outperforms these methods regarding better positional accuracy and lower variance.

Abstract:
We address the challenge of effectively controlling the locomotion of legged robots by incorporating precise frequency and phase characteristics, which is often ignored in locomotion policies that do not account for the periodic nature of walking. We propose a hierarchical architecture that integrates a low-level phase tracker, oscillators, and a high-level phase modulator. This controller allows quadruped robots to walk in a natural manner that is synchronized with external musical rhythms. Our method generates diverse gaits across different frequencies and achieves real-time synchronization with music in the physical world. This research establishes a foundational framework for enabling real-time execution of accurate rhythmic motions in legged robots. The video and code are available at https://music-walker.github.io/.

Abstract:
Humanoid robots require both robust lower-body locomotion and precise upper-body manipulation. While recent Reinforcement Learning (RL) approaches provide whole-body loco-manipulation policies, they lack precise manipulation with high DoF arms. In this paper, we propose decoupling upper-body control from locomotion, using inverse kinematics (IK) and motion retargeting for precise manipulation, while RL focuses on robust lower-body locomotion. We introduce PMP (Predictive Motion Priors), trained with Conditional Variational Autoencoder (CVAE) to effectively represent upper-body motions. The locomotion policy is trained conditioned on this upper-body motion representation, ensuring that the system re-mains robust with both manipulation and locomotion. We show that CVAE features are crucial for stability and robustness, and significantly outperforms RL-based whole-body control in precise manipulation. With precise upper-body motion and robust lower-body locomotion control, operators can remotely control the humanoid to walk around and explore different environments, while performing diverse manipulation tasks.

Abstract:
Grasping has been a long-standing challenge in facilitating the final interface between a robot and the environment. As environments and tasks become complicated, the need to embed higher intelligence to infer from the surroundings and act on them has become necessary. Although most methods utilize techniques to estimate grasp pose by treating the problem via pure sampling-based approaches in the six-degree-of-freedom space or as a learning problem, they usually fail in real-life settings owing to poor generalization across domains. In addition, the time taken to generate the grasp plan and the lack of repeatability, owing to sampling inefficiency and the probabilistic nature of existing grasp planning approaches, severely limits their application in real-world tasks. This paper presents a lightweight analytical approach towards robotic grasp planning, particularly antipodal grasps, with little to no sampling in the six-degree-of-freedom space. The proposed grasp planning algorithm is formulated as an optimization problem towards estimating grasp points on the object surface instead of directly estimating the endeffector pose. To this extent, a soft-region-growing algorithm is presented for effective plane segmentation, even in the case of curved surfaces. An optimization-based quality metric is then used for the evaluation of grasp points to ensure indirect force closure. The proposed grasp framework is compared with the existing state-of-the-art grasp planning approach, Grasp pose detection (GPD), as a baseline over multiple simulated objects. The effectiveness of the proposed approach in comparison to GPD is also evaluated in a real-world setting using image and point-cloud data, with the planned grasps being executed using a ROBOTIQ gripper and UR5 manipulator. The proposed approach shows better performance in terms of higher probability for force closure with complete repeatability.

Abstract:
Transparent object grasping remains a persistent challenge in robotics, largely due to the difficulty of acquiring precise 3D information. Conventional optical 3D sensors struggle to capture transparent objects, and machine learning methods are often hindered by their reliance on high-quality datasets. Leveraging NeRF's capability for continuous spatial opacity modeling, our proposed architecture integrates a NeRF-based approach for reconstructing the 3D information of transparent objects. Despite this, certain portions of the reconstructed 3D information may remain incomplete. To address these deficiencies, we introduce a shape-prior-driven completion mechanism, further refined by a geometric pose estimation method we have developed. This allows us to obtain a complete and reliable 3D information of transparent objects. Utilizing this refined data, we perform scene-level grasp prediction and deploy the results in real-world robotic systems. Experimental validation demonstrates the efficacy of our architecture, showcasing its capability to reliably capture 3D information of various transparent objects in cluttered scenes, and correspondingly, achieve high-quality, stable, and executable grasp predictions.

Abstract:
Adapting reinforcement learning (RL) policies to various robot body configurations is a significant challenge for creating flexible autonomous systems. This study presents a novel framework that integrates Reservoir Computing (RC) with the First-Order Reduced and Controlled Error (FORCE) learning rule to enhance policy adaptability in RL. The RC serves as a dynamic feature extractor, capturing temporal dependencies by pre-training on state transitions generated through random actions. This pre-training acts as regularization, reducing variance and preventing overfitting to specific configurations Subsequently, the control policy network is trained on a limited set of body variations using the enriched features from the RC. Experimental results across three distinct environments demonstrate that the proposed RC+FORCE framework significantly improves policy performance and adaptability to unseen robot configurations compared to traditional reinforcement learning through domain randomization. These findings highlight the effectiveness of combining RC-based feature extraction with FORCE-based training in developing robust RL agents.

Abstract:
Multimodal end-to-end autonomous driving has shown promising advancements in recent work. By embedding more modalities into end-to-end networks, the system's understanding of both static and dynamic aspects of the driving environment is enhanced, thereby improving the safety of autonomous driving. In this paper, we introduce METDrive, an end-to-end system that leverages temporal guidance from the embedded time series features of ego states, including rotation angles, steering, throttle signals, and waypoint vectors. The geometric features derived from the perception sensor data and the time series features of ego state data jointly guide the waypoint prediction with the proposed temporal guidance loss function. We evaluated METDrive on the CARLA leaderboard benchmarks, achieving a driving score of 70%, a route completion score of 94%, and an infraction score of 0.78.

Abstract:
We present LidarDM, a novel LiDAR generative model capable of producing realistic, layout-aware, physically plausible, and temporally coherent LiDAR videos. LidarDM stands out with two unprecedented capabilities in LiDAR generative modeling: (i) LiDAR generation guided by driving scenarios, offering significant potential for autonomous driving simulations, and (ii) 4D LiDAR point cloud generation, enabling the creation of realistic and temporally coherent sequences. At the heart of our model is a novel integrated 4D world generation framework. Specifically, we employ latent diffusion models to generate the 3D scene, combine it with dynamic actors to form the underlying 4D world, and subsequently produce realistic sensory observations within this virtual environment. Our experiments indicate that our approach outperforms competing algorithms in realism, temporal coherency, and layout consistency. We additionally show that LidarDM can be used as a generative world model simulator for training and testing perception models. We release our source code and checkpoints at https://github.com/vzyrianov/LidarDM

Abstract:
4D millimeter-wave (MMW) radar, which provides both height information and dense point cloud data over 3D MMW radar, has become increasingly popular in 3D object detection. In recent years, radar-vision fusion models have demonstrated performance close to that of LiDAR-based models, offering advantages in terms of lower hardware costs and better resilience in extreme conditions. However, many radar-vision fusion models treat radar as a sparse LiDAR, underutilizing radar-specific information. Additionally, these multi-modal networks are often sensitive to the failure of a single modality, particularly vision. To address these challenges, we propose the Radar Depth Lift-Splat-Shoot (RDL) module, which integrates radar-specific data into the depth prediction process, enhancing the quality of visual Bird's-Eye View (BEV) features. We further introduce a Unified Feature Fusion (UFF) approach that extracts BEV features across different modalities using shared module. To assess the robustness of multimodal models, we develop a novel Failure Test (FT) ablation experiment, which simulates vision modality failure by injecting Gaussian noise. We conduct extensive experiments on the View-of-Delft (VoD) and TJ4D datasets. The results demonstrated that our proposed Unified BEVFusion (UniBEVFusion) network significantly outperforms state-of-the-art models on the TJ4D dataset, with improvements of 3.96% in 3D and 4.17% in BEV object detection accuracy.

Abstract:
This paper introduces a safety filter to ensure collision avoidance for multirotor aerial robots. The proposed formalism leverages a single Composite Control Barrier Function from all position constraints acting on a third-order nonlinear representation of the robot's dynamics. We analyze the recursive feasibility of the safety filter under the composite constraint and demonstrate that the infeasible set is negligible. The proposed method allows computational scalability against thousands of constraints and, thus, complex scenes with numerous obstacles. We experimentally demonstrate its ability to guarantee the safety of a quadrotor with an onboard LiDAR, operating in both indoor and outdoor cluttered environments against both naive and adversarial nominal policies.

Abstract:
Scene flow enables an understanding of the motion characteristics of the environment in the 3D world. It gains particular significance in the long-range, where object-based perception methods might fail due to sparse observations far away. Although significant advancements have been made in scene flow pipelines to handle large-scale point clouds, a gap remains in scalability with respect to long-range. We attribute this limitation to the common design choice of using dense feature grids, which scale quadratically with range. In this paper, we propose Sparse Scene Flow (SSF), a general pipeline for long-range scene flow, adopting a sparse convolution based backbone for feature extraction. This approach introduces a new challenge: a mismatch in size and ordering of sparse feature maps between time-sequential point scans. To address this, we propose a sparse feature fusion scheme, that augments the feature maps with virtual voxels at missing locations. Additionally, we propose a range-wise metric that implicitly gives greater importance to faraway points. Our method, SSF, achieves state-of-the-art results on the Argoverse2 dataset, demonstrating strong performance in long-range scene flow estimation. Our code is open-sourced at https://github.com/KTH-RPL/SSF.git.

Abstract:
We present a hybrid centralized-decentralized planning algorithm for a multi-robot system consisting of a Mothership robot and multiple Passenger robots. In this system, the Passenger robots execute tasks while the Mothership provides support. This paper addresses the challenge of planning Passenger robot movements, framing it as a Stochastic Multi-Agent Orienteering Problem (SMOP) complicated by factors like stochastic operational efforts and disruptive events. We optimize the task completion efficiency of the system by combining centralized solutions from the Mothership with local plans from Passengers to enhance system resilience. Our contributions include defining the SMOP, developing a solution using Decentralized Monte Carlo Tree Search, presenting a hybrid algorithm that integrates centralized plans into the distributed framework, and evaluating the algorithm's performance in a simulated environment. Our results show that our hybrid approaches outperform fully centralized and fully distributed algorithms in highly-dynamic scenarios with up to a 26.6 % increase in task completion efficiency over baseline methods.

Abstract:
Collaborative perception (CP) is emerging as a promising solution to the inherent limitations of stand-alone intelligence. However, current wireless communication systems are unable to support feature-level and raw-level collaborative algorithms due to their enormous bandwidth demands. In this paper, we propose DiffCP, a novel CP paradigm that utilizes a diffusion model to efficiently compress the sensing information of collaborators. By incorporating both geometric and semantic conditions into the generative model, DiffCP enables feature-level collaboration with an ultra-low communication cost, advancing the practical implementation of CP systems. This paradigm can be seamlessly integrated into existing CP algorithms to enhance a wide range of downstream tasks. Through extensive experimentation, we investigate the tradeoffs between communication, computation, and performance. Numerical results demonstrate that DiffCP can significantly reduce communication costs by 14.5-fold while maintaining the same performance as the state-of-the-art algorithm.

Abstract:
In human-robot collaboration (HRC), it is crucial for robot agents to consider humans' knowledge of their surroundings. In reality, humans possess a narrow field of view (FOV), limiting their perception. However, research on HRC often overlooks this aspect and presumes an omniscient human collaborator. Our study addresses the challenge of adapting to the evolving subtask intent of humans while accounting for their limited FOV. We integrate FOV within the humanaware probabilistic planning framework. To account for large state spaces due to considering FOV, we propose a hierarchical online planner that efficiently finds approximate solutions while enabling the robot to explore low-level action trajectories that enter the human FOV, influencing their intended subtask. Through user study with our adapted cooking domain, we demonstrate our FOV-aware planner reduces human's interruptions and redundant actions during collaboration by adapting to human perception limitations. We extend these findings to a virtual reality kitchen environment, where we observe similar collaborative behaviors.

Abstract:
A major limitation of minimally invasive surgery is the difficulty in accurately locating the internal anatomical structures of the target organ due to the lack of tactile feedback and transparency. Augmented reality (AR) offers a promising solution to overcome this challenge. Numerous studies have shown that combining learning-based and geometric methods can achieve accurate preoperative and intraoperative data registration. This work proposes a real-time monocular 3D tracking algorithm for post-registration tasks. The ORBSLAM2 framework is adopted and modified for prior-based 3D tracking. The primitive 3D shape is used for fast initialization of the ORB-SLAM2 monocular mode. A pseudo-segmentation strategy is employed to separate the target organ from the background for tracking, and the 3D shape is incorporated as a geometric prior in its pose graph optimization. Experiments from in-vivo and ex-vivo tests demonstrate that the proposed 3D tracking system provides robust 3D tracking and effectively handles typical challenges such as fast motion, out-of-field-of-view scenarios, partial visibility, and “organ-background” relative motion.

Abstract:
Learning motor skills for sports or performance driving is often done with professional instruction from expert human teachers, whose availability is limited. Our goal is to enable automated teaching via a learned model that interacts with the student similar to a human teacher. However, training such automated teaching systems is limited by the availability of highquality annotated datasets of expert teacher and student interactions as they are difficult to collect at scale. To address this data scarcity problem, we propose an approach for training a coaching system for complex motor tasks such as high performance driving via a Multi-Task Imitation Learning (MTIL) paradigm. MTIL allows our model to learn robust representations by utilizing self-supervised training signals from more readily available non-interactive datasets of humans performing the task of interest. We validate our approach with (1) a semi-synthetic dataset created from real human driving trajectories, (2) a professional track driving instruction dataset, (3) a track-racing driving simulator human-subject study, and (4) a system demonstration on an instrumented car at a race track. Our experiments show that the right set of auxiliary machine learning tasks improves prediction of teaching instructions. Moreover, in the human subjects study, students exposed to the instructions from our teaching system improve their ability to stay within track limits, and show favorable perception of the model's interaction with them, in terms of usefulness and satisfaction.

Abstract:
Distribution shifts between operational domains can severely affect the performance of learned models in self-driving vehicles (SDVs). While this is a well-established problem, prior work has mostly explored naive solutions such as fine-tuning, focusing on the motion prediction task. In this work, we explore novel adaptation strategies for differentiable autonomy stacks (structured policy) consisting of prediction, planning, and control, perform evaluation in closed-loop, and investigate the often-overlooked issue of catastrophic forgetting. Specifically, we introduce two simple yet effective techniques: a low-rank residual decoder (LoRD) and multi-task fine-tuning. Through experiments across three models conducted on two real-world autonomous driving datasets (nuPlan, exiD), we demonstrate the effectiveness of our methods and highlight a significant performance gap between open-loop and closed-loop evaluation in prior approaches. Our approach improves forgetting by up to 23.33% and the closed-loop out-of-distribution driving score by 9.93% in comparison to standard fine-tuning. https://github.com/rst-tu-dortmund/LoRD

Abstract:
Contact estimation is a key ability for limbed robots, where making and breaking contacts has a direct impact on state estimation and balance control. Existing approaches typically rely on gait-cycle priors or designated contact sensors. We design a contact estimator that is suitable for the emerging wheeled-biped robot types that do not have these features. To this end, we propose a Bayes filter in which update steps are learned from real-robot torque measurements while prediction steps rely on inertial measurements. We evaluate this approach in extensive real-robot and simulation experiments. Our method achieves better performance while being considerably more sample efficient than a comparable deep-learning baseline.

Abstract:
Robotics and artificial intelligence hold significant potential for advancing precision agriculture. While robotic systems have been successfully deployed for various tasks, adapting them to perform diverse missions remains challenging, particularly because end users often lack technical expertise. In this paper, we present an end-to-end system that leverages large language models (LLMs), specifically ChatGPT, to enable users to assign complex data collection tasks to autonomous robots using natural language instructions. To enhance reusability, mission plans are encoded using an existing IEEE task specification standard, and are executed on robots via ROS2 nodes that bridge high-level mission descriptions with existing ROS libraries. Through extensive experiments, we highlight the strengths and limitations of LLMs in this context, particularly regarding spatial reasoning and solving complex routing challenges, and show how our proposed implementation-overcomes them.

Abstract:
We present a novel differentiable approach to quantifying and optimizing stability in robotic systems addressing an open challenge in the field of robot analysis, control, design, and optimization. Our method leverages differentiable simulation over extended time horizons to estimate a robustness metric based on the Lyapunov exponents. The proposed metric offers several properties, including a natural extension to limit cycles (commonly encountered in robotics tasks and locomotion) and independence from the trajectory path for states converging to the attractor. We showcase, with an ad-hoc JAX gradient-based optimization framework, remarkable flexibility in tackling the robustness challenge. Our approach is tested through diverse scenarios of varying complexity, encompassing high-degree-of-freedom systems and contact-rich environments. The positive outcomes across these cases highlight the potential of our method in quantifying and possibly enhancing system robustness.

Abstract:
Generalist robot manipulation policies (GMPs) have the potential to generalize across a wide range of tasks, devices, and environments. However, existing policies continue to struggle with out-of-distribution scenarios due to the inherent difficulty of collecting sufficient action data to cover extensively diverse domains. While fine-tuning offers a practical way to quickly adapt a GMPs to novel domains and tasks with limited samples, we observe that the performance of the resulting GMPs differs significantly with respect to the design choices of fine-tuning strategies. In this work, we first conduct an indepth empirical study to investigate the effect of key factors in GMPs fine-tuning strategies, covering the action space, policy head, supervision signal and the choice of tunable parameters, where 2,500 rollouts are evaluated for a single configuration. We systematically discuss and summarize our findings and identify the key design choices, which we believe give a practical guideline for GMPs fine-tuning. We observe that in a lowdata regime, with carefully chosen fine-tuning strategies, a GMPs significantly outperforms the state-of-the-art imitation learning algorithms. The results presented in this work establish a new baseline for future studies on fine-tuned GMPs.

Abstract:
The autonomous transportation of materials over challenging terrain is a challenge with major economic implications and remains unsolved. This paper introduces LEVA, a high-payload, high-mobility robot designed for autonomous logistics across varied terrains, including those typical in agriculture, construction, and search and rescue operations. LEVA uniquely integrates an advanced legged suspension system using parallel kinematics. It is capable of traversing stairs using a reinforcement learning (RL) controller, has steerable wheels, and includes a specialized box pickup mechanism that enables autonomous payload loading as well as precise and reliable cargo transportation of up to 85 kg across uneven surfaces, steps and inclines while maintaining a Cost of Transportation (CoT) of as low as 0.15. Through extensive experimental validation, LEVA demonstrates its off-road capabilities and reliability regarding payload loading and transport.

Abstract:
Recent advancements in learning-based multi-view stereo (MVS) have demonstrated significant improvements over traditional counterpart, primarily due to the extensive availability of multi-view training images with ground-truth metric depths in the terrestrial in-air domain. However, underwater multi-view stereo (UwMVS) faces substantial challenges arising from the domain gap between in-air and underwater environments, leading to degraded performance when applying in-air MVS models to underwater scenarios. Furthermore, the progress of learning-based UwMVS methods has been hindered by the scarcity of underwater multi-view images with ground-truth depth maps and point clouds. In this paper, we address these challenges by introducing a physically-guided approach for synthesizing underwater multi-view images and present the first large-scale UwMVS dataset for end-to-end training and evaluation of learning-based UwMVS methods. Furthermore, we propose a novel UwMVS network that enhances geometric cue encoding to achieve more accurate and complete point cloud reconstruction. Extensive experiments on our dataset and real-world underwater scenes demonstrate that our dataset enables the trained models for underwater dense reconstruction and that our method achieves state-of-the-art performance in underwater reconstruction. Dataset, code and appendix are available at: https://cuhk-usr-group.github.io/UwMVS/

Abstract:
Visual odometry (VO) in underwater environments presents significant challenges due to poor visibility and dynamic scene changes, which render conventional (in-air) VO solutions unsuitable for underwater applications. We propose an underwater robust monocular visual odometry (UR-MVO) pipeline tailored for underwater scenarios with feature extraction and matching based on SuperPoint and SuperGlue models, respectively. We enhance the robustness of the feature extractor through field-specific fine-tuning of the SuperPoint model using few-shot unsupervised learning. This tuning was done on real images of underwater scenes in order to enhance its performance in the harsh underwater image conditions. Moreover, we integrate semantic segmentation trained on underwater images into our pipeline to eliminate unreliable features belonging to dynamic objects and background. We evaluated the proposed solution on the Aqualoc dataset, demonstrating higher localization accuracy compared to other SOTA direct and feature-based monocular VO methods like DSO and SVO and also obtained very competitive results compared to more resource-intensive monocular VSLAM approaches with loop closure process like LDSO, UVS, and ORB-SLAM. The results show a high potential for our approach for further applications in underwater exploration and mapping using affordable sensory setups. We publish the code for the benefit of the community https://github.com/be2rlab/UR-MVO

Abstract:
This paper presents a new trajectory replanner for grasping irregular objects. Unlike conventional grasping tasks where the object's geometry is assumed simple, we aim to achieve a “dynamic grasp” of the irregular objects, which requires continuous adjustment during the grasping process. To effectively handle irregular objects, we propose a trajectory optimization framework that comprises two phases. Firstly, in a specified time limit of 10 s, initial offline trajectories are computed for a seamless motion from an initial configuration of the robot to grasp the object and deliver it to a predefined target location. Secondly, fast online trajectory optimization is implemented to update robot trajectories in real-time within 100 ms. This helps to mitigate pose estimation errors from the vision system. To account for model inaccuracies, disturbances, and other non-modeled effects, trajectory tracking controllers for both the robot and the gripper are implemented to execute the optimal trajectories from the proposed framework. The intensive experimental results effectively demonstrate the performance of our trajectory planning framework in both simulation and real-world scenarios.

Abstract:
Soft robotic grippers offer advantages over rigid end effectors but are typically coupled to a rigid robot for locomotion. In contrast, this paper details a soft robot for both locomotion and grasping. The system is a type of boundaryconstrained granular swarm robot, which is composed of a closed-loop series of active (capable of locomotion) sub-robots. Prior work has shown how this type of robot is capable of locomotion and grasping. For this paper, we propose a new grasping strategy and demonstrate real-time grasp quality evaluation using pressure sensors and the Ferrari-Canny grasp metric. The grasping strategy leverages gradient-based control via distance functions and dynamic system planning to achieve desired robot geometries for effective grasping. Previous research primarily used pull tests to evaluate grasping efficacy, which lacked realtime feedback on grasp quality. Simulated and experimental results confirm the effectiveness of this method.

Abstract:
Grasping large flat objects, such as books or keyboards lying horizontally, presents significant challenges for single-arm robotic systems, often requiring extra actions like pushing objects against walls or moving them to the edge of a surface to facilitate grasping. In contrast, dualarm manipulation, inspired by human dexterity, offers a more refined solution by directly coordinating both arms to lift and grasp the object without the need for complex repositioning. In this paper, we propose a model-free deep reinforcement learning (DRL) framework to enable dual-arm coordination for grasping large flat objects. We utilize a large scale grasp pose detection model as a backbone to extract high-dimensional features from input images, which are then used as the state representation in a reinforcement learning (RL) model. A CNNbased Proximal Policy Optimization (PPO) algorithm with shared Actor-Critic layers is employed to learn coordinated dual-arm grasp actions. The system is trained and tested in Isaac Gym and deployed to real robots. Experimental results demonstrate that our policy can effectively grasp large flat objects without requiring additional maneuvers. Furthermore, the policy exhibits strong generalization capabilities, successfully handling unseen objects. Importantly, it can be directly transferred to real robots without fine-tuning, consistently outperforming baseline methods. Videos of our experiments are available online: https://sites.google.com/view/grasplargeflat

Abstract:
The rise of intelligent autonomous systems, especially in robotics and autonomous agents, has created a critical need for robust communication middleware that can ensure real-time processing of extensive sensor data. Current robotics middleware like Robot Operating System (ROS) 2 faces challenges with nondeterminism and high communication latency when dealing with large data across multiple subscribers on a multi-core compute platform. To address these issues, we present High-Performance Robotic Middleware (HPRM), built on top of the deterministic coordination language Lingua Franca (LF). HPRM employs optimizations including an in-memory object store for efficient zero-copy transfer of large payloads, adaptive serialization to minimize serialization overhead, and an eager protocol with real-time sockets to reduce handshake latency. Benchmarks show HPRM achieves up to 114x lower latency than ROS2 when broadcasting large messages to multiple nodes. We then demonstrate the benefits of HPRM by integrating it with the CARLA simulator and running reinforcement learning agents along with object detection workloads. In the CARLA autonomous driving application, HPRM attains 91.1% lower latency than ROS2. The deterministic coordination semantics of HPRM, combined with its optimized IPC mechanisms, enable efficient and predictable real-time communication for intelligent autonomous systems. Code and videos can be found on our project page: https://hprm-robotics.github.io/HPRM

Abstract:
Automating design minimizes errors, accelerates the design process, and reduces cost. However, automating robot design is challenging due to recursive constraints, multiple design objectives, and cross-domain design complexity possibly spanning multiple abstraction layers. Here we look at the problem of component selection, a combinatorial optimization problem in which a designer, given a robot model, must select compatible components from an extensive catalog. The goal is to satisfy high-level task specifications while optimally balancing trade-offs between competing design objectives. In this paper, we extend our previous constraint programming approach to multi-objective design problems and propose the novel technique of monotone subsystem decomposition to efficiently compute a Pareto front of solutions for large-scale problems. We prove that subsystems can be optimized for their Pareto fronts and, under certain conditions, these results can be used to determine a globally optimal Pareto front. Furthermore, subsystems serve as an intuitive design abstraction and can be reused across various design problems. Using an example quadcopter design problem, we compare our method to a linear programming approach and demonstrate our method scales better for large catalogs, solving a multi-objective problem of 1025 component combinations in seconds. We then expand the original problem and solve a task-oriented, multi-objective design problem to build a fleet of quadcopters to deliver packages. We compute a Pareto front of solutions in seconds where each solution contains an optimal component-level design and an optimal package delivery schedule for each quadcopter.

Abstract:
Shared autonomy allows for combining the global planning capabilities of a human operator with the strengths of a robot such as repeatability and accurate control. In a real-time teleoperation setting, one possibility for shared autonomy is to let the human operator decide for the rough movement and to let the robot do fine adjustments, e.g., when the view of the operator is occluded. We present a learning-based concept for shared autonomy that aims at supporting the human operator in a real-time teleoperation setting. At every step, our system tracks the target pose set by the human operator as accurately as possible while at the same time satisfying a set of constraints which influence the robot's behavior. An important characteristic is that the constraints can be dynamically activated and deactivated which allows the system to provide task-specific assistance. Since the system must generate robot commands in real-time, solving an optimization problem in every iteration is not feasible. Instead, we sample potential target configurations and use Neural Networks for predicting the constraint costs for each configuration. By evaluating each configuration in parallel, our system is able to select the target configuration which satisfies the constraints and has the minimum distance to the operator's target pose with minimal delay. We evaluate the framework with a pick and place task on a bi-manual setup with two Franka Emika Panda robot arms with Robotiq grippers.

Abstract:
Game-theoretic models are effective tools for modeling multi-agent interactions, especially when robots need to coordinate with humans. However, applying these models requires inferring their specifications from observed behaviors—a challenging task known as the inverse game problem. Existing inverse game approaches often struggle to account for behavioral uncertainty and measurement noise, and leverage both offline and online data. To address these limitations, we propose an inverse game method that integrates a generative trajectory model into a differentiable mixed-strategy game framework. By representing the mixed strategy with a conditional variational autoencoder (CVAE), our method can infer high-dimensional, multi-modal behavior distributions from noisy measurements while adapting in real-time to new observations. We extensively evaluate our method in a simulated navigation benchmark, where the observations are generated by an unknown game model. Despite the model mismatch, our method can infer Nash-optimal actions comparable to those of the ground-truth model and the oracle inverse game baseline, even in the presence of uncertain agent objectives and noisy measurements.

Abstract:
Specifying tasks for robotic systems traditionally requires coding expertise, deep domain knowledge, and significant time investment. While learning from demonstration offers a promising alternative, existing methods often struggle with tasks of longer horizons. To address this limitation, we introduce a computationally efficient approach for learning probabilistic deterministic finite automata (PDFA) that capture task structures and expert preferences directly from demonstrations. Our approach infers sub-goals and their temporal dependencies, producing an interpretable task specification that domain experts can easily understand and adjust. We validate our method through experiments involving object manipulation tasks, showcasing how our method enables a robot arm to effectively replicate diverse expert strategies while adapting to changing conditions.

Abstract:
We present \mathbfR+\mathbfX, a framework which enables robots to learn skills from long, unlabelled, first-person videos of humans performing everyday tasks. Given a language command from a human, \mathbfR+\mathbfX first retrieves short video clips containing relevant behaviour, and then executes the skill by conditioning an in-context imitation learning method (KAT) on this behaviour. By leveraging a Vision Language Model (VLM) for retrieval, \mathbfR+\mathbfX does not require any manual annotation of the videos, and by leveraging in-context learning for execution, robots can perform commanded skills immediately, without requiring a period of training on the retrieved videos. Experiments studying a range of everyday household tasks show that \mathbfR+\mathbfX succeeds at translating unlabelled human videos into robust robot skills, and that \mathbfR+\mathbfX outperforms several recent alternative methods. Appendix and videos are available at https://www.robot-learning.uk/r-plus-x.

Abstract:
Diffusion models have proven effective at generating high-quality images from learned distributions, but their application to the temporal domain, especially for driving scenarios, remains underexplored. Our work addresses key challenges in existing simulations, such as limited data quality, diversity, and high costs, by extending diffusion models to generate realistic long driving videos. We introduce the Dualconditioned Temporal Diffusion Model (DcTDM), an opensource method that incorporates dual conditioning to enforce temporal consistency by guiding frame transitions. Alongside DcTDM, we present DriveSceneDDM, a comprehensive driving video dataset featuring textual scene descriptions, dense depth maps, and canny edge data. We evaluate DcTDM using common video quality metrics, demonstrating its superior performance over other video diffusion models by producing long, temporally consistent driving videos up to 40s, achieving over 25% improvement in consistency and frame quality.

Abstract:
Imitation learning is a promising approach to acquiring autonomous driving policies by mimicking human driver behaviors. However, a major drawback of existing driving policies derived from imitation learning is their proneness to capturing spurious correlations, owing to the lack of an explicit causal model. Deploying such policies in unpredictable real-world environments poses severe risks, as spurious correlations may result in flawed decisions that compromise safety. To tackle this challenge, we introduce a novel approach called Multi-Task Invariant Representation Imitation Learning (MIRIL). MIRIL combines invariant learning with imitation learning to identify cross-environment invariant causal representations from driving demonstrations in various scenarios. These representations are then fed into multiple downstream branches for multi-task learning, including policy learning, perception prediction, invariant representation learning, and transition dynamics learning. Through the multi-task learning approach, the model not only makes consistent driving decisions across different environments but also perceives the vehicle's surroundings, thereby improving adaptability and robustness in diverse driving conditions. This enables MIRIL to effectively handle a wide range of driving scenarios, ensuring safety and efficiency. Supported by clear metrics, this paper details our comprehensive experimental setup, including datasets, benchmarks, and comparative analyses, underscoring the capability of MIRIL to significantly boost system generalization and excel in decision-making significantly.

Abstract:
The integration of 3D Gaussians has introduced a novel scene representation in Simultaneous Localization and Mapping (SLAM), characterized by explicit representation and differentiable rendering capabilities that enhance scene reconstruction and understanding. However, most current SLAM systems only exploit the basic representational capacity of 3D Gaussians, neglecting their potential to offer richer information and facilitate higher-dimensional scene comprehension. Furthermore, these systems often struggle with reconstruction when encountering rapid camera movements or depth missing. Drawing inspiration from 3D language field, which explores the intrinsic relationships among scene objects, we propose SAPSLAM, a dense SLAM system that combines high-fidelity reconstruction and advanced semantic understanding. Our approach leverages pre-trained visual models to extract semantic features, which are then fused, dimensionally reduced, and encoded into the 3D Gaussian model for optimization and rendering. The integration of these features improves the systems semantic comprehension and scene representation, ultimately enabling the creation of high-precision 3D semantic maps. Additionally, we introduce a semantic-guided Gaussian densification and pruning strategy, which uses semantic consistency to prioritize attention on poorly reconstructed areas, greatly improving performance in complex scenarios. SAP-SLAM achieves competitive results on both real-world and synthetic datasets, demonstrating superior capabilities in semantic understanding and reconstruction.

Abstract:
The flying speed of autonomous quadrotors has increased significantly in the field of autonomous drone racing. However, most research primarily focuses on the aggressive flight of a single quadrotor, simplifying the racing gate traversal problem to a waypoint passing problem that neglects the orientations of the racing gates or implicitly considers the waypoint direction during path planning. In this paper, we propose a systematic method called Pairwise Model Predictive Control (PMPC) that can guide two quadrotors online to navigate racing gates with minimal time and without collisions. The flight task is initially simplified as a point-mass model waypoint passing problem to provide time optimal reference through an efficient two-step velocity search method. Subsequently, we utilize the spatial configuration of the racing track to compute the optimal heading at each gate, maximizing the visibility of subsequent gates for the quadrotors. To address varying gate orientations, we introduce a novel Magnetic Induction Line-based spatial curve to guide the quadrotors through racing gates of different orientations. Furthermore, we formulate a nonlinear optimization problem that uses the point-mass trajectory as initial values and references to enhance solving efficiency. The feasibility of the proposed method is validated through both simulation and real-world experiments. In real-world tests, the two quadrotors achieved a top speed of 6.1m/s on a 7-waypoint racing track within a compact flying arena of 5m× 4m× 2m.

Abstract:
We propose WildFusion, a novel approach for 3D scene reconstruction in unstructured, in-the-wild environments using multimodal implicit neural representations. WildFusion integrates signals from LiDAR, RGB camera, contact microphones, tactile sensors, and IMU. This multimodal fusion generates comprehensive, continuous environmental representations, including pixel-level geometry, color, semantics, and traversability. Through real-world experiments on legged robot navigation in challenging forest environments, WildFusion demonstrates improved route selection by accurately predicting traversability. Our results highlight its potential to advance robotic navigation and 3D mapping in complex outdoor terrains.

Abstract:
Despite recent advances in both model architectures and data augmentation, multimodal object detectors still barely outperform their LiDAR-only counterparts. This shortcoming has been attributed to a lack of sufficiently powerful multimodal data augmentation. To address this, we present SurfaceAug, a novel ground truth sampling algorithm. SurfaceAug pastes objects by resampling both images and point clouds, enabling object-level transformations in both modalities. We evaluate our algorithm by training a multimodal detector on KITTI and compare its performance to previous works. We show experimentally that SurfaceAug demonstrates promising improvements on car detection tasks.

Abstract:
One of the most appealing applications of drone swarms is drone light shows, in which a group of drones displays an animation by showing a sequence of light patterns in the sky. In this paper, we consider using drone swarms as video game platforms and utilize planning techniques to display pixels in animations correctly while providing a fast response to user inputs. We devise a new sampling algorithm to solve a contingency formation planning problem, which aims to find a contingency formation plan such that drones can always move to the correct positions to display every possible future frame regardless of the user inputs in the future. The algorithm provides interactivity by preemptively relocating hidden drones, which move in stealth mode to the locations of all possible future frames. Our experiments show that the size of the frame buffer and the ratio between the number of drones and the number of pixels can greatly affect the performance of our system.

Abstract:
Recently, Large Language Models (LLMs) have achieved remarkable success using in-context learning (ICL) in the language domain. However, leveraging the ICL capabilities within LLMs to directly predict robot actions remains largely unexplored. In this paper, we introduce RoboPrompt, a frame-work that enables off-the-shelf text-only LLMs to directly predict robot actions through ICL without training. Our approach first heuristically identifies keyframes that capture important moments from an episode. Next, we extract end-effector actions from these keyframes as well as the estimated initial object poses, and both are converted into textual descriptions. Finally, we construct a structured template to form ICL demonstrations from these textual descriptions and a task instruction. This enables an LLM to directly predict robot actions at test time. Through extensive experiments and analysis, RoboPrompt shows stronger performance over zero-shot and ICL baselines in simulated and real-world settings. Our project page is available at https://davidyyd.github.io/roboprompt.

Abstract:
Microswimmers are sub-millimeter swimming robots that show potential as a platform for controllable locomotion in applications, including targeted cargo delivery and minimally invasive surgery. To be viable for these target applications, microswimmers will eventually need to be able to navigate environments with dynamic fluid flows and forces. Experimental studies with microswimmers towards this goal are currently rare because of the difficulty of isolating intentional microswimmer locomotion from environment-induced motion. In this work, we present a method for measuring microswimmer locomotion within a complex flow environment using fiducial microspheres. By tracking the particle motion of ferromagnetic and non-magnetic polystyrene fiducial microspheres, we capture the effect of fluid flow and magnetic field gradients on microswimmer trajectories. We then determine the field-driven translation of these microswimmers relative to fluid flow and demonstrate the effectiveness of this method by illustrating the motion of multiple microswimmers through different flows.

Abstract:
We propose a novel method to enhance the accuracy of the Iterative Closest Point (ICP) algorithm by integrating altitude constraints from a barometric pressure sensor. While ICP is widely used in mobile robotics for Simultaneous Localization and Mapping (SLAM), it is susceptible to drift, especially in underconstrained environments such as vertical shafts. To address this issue, we propose to augment ICP with altimeter measurements, reliably constraining drifts along the gravity vector. To demonstrate the potential of altimetry in SLAM, we offer an analysis of calibration procedures and noise sensitivity of various pressure sensors, improving measurements to centimeter-level accuracy. Leveraging this accuracy, we propose a novel ICP formulation that integrates altitude measurements along the gravity vector, thus simplifying the optimization problem to 3-Degree Of Freedom (DOF). Experimental results from real-world deployments demonstrate that our method reduces vertical drift by 84% and improves overall localization accuracy compared to state-of-the-art methods in non-planar environments.

Abstract:
Precisely detecting which object a person is paying attention to is critical for human-robot interaction since it provides important cues for the next action from the human user. We propose an end-to-end approach for gaze target detection: predicting a head-target connection between individuals and the target image regions they are looking at. Most of the existing methods use independent components such as off-the-shelf head detectors or have problems in establishing associations between heads and gaze targets. In contrast, we investigate an end-to-end multi-person Gaze target detection framework with Heads and Targets Association (GazeHTA), which predicts multiple head-target instances based solely on input scene image. GazeHTA addresses challenges in gaze target detection by (1) leveraging a pre-trained diffusion model to extract scene features for rich semantic understanding, (2) re-injecting a head feature to enhance the head priors for improved head understanding, and (3) learning a connection map as the explicit visual associations between heads and gaze targets. Our extensive experimental results demonstrate that GazeHTA outperforms state-of-the-art gaze target detection methods and two adapted diffusion-based baselines on two standard datasets.

Abstract:
Vision Language Models (VLMs) have achieved impressive performance in 2D image understanding; however, they still struggle with spatial understanding, which is fundamental to embodied AI. In this paper, we propose SpatialBot, a model designed to enhance spatial understanding by utilizing both RGB and depth images. To train VLMs for depth perception, we introduce the SpatialQA and SpatialQA \boldsymbolE datasets, which include multi-level depth-related questions spanning various scenarios and embodiment tasks. SpatialBench is also developed to comprehensively evaluate VLMs' spatial understanding capabilities across different levels. Extensive experiments on our spatial-understanding benchmark, general VLM benchmarks, and embodied AI tasks demonstrate the remarkable improvements offered by SpatialBot. The model, code, and datasets are available at https://github.com/BAAI-DCAI/SpatialBot.

Abstract:
Visual-LiDAR odometry is a critical component for autonomous system localization, yet achieving high accuracy and strong robustness remains a challenge. Traditional approaches commonly struggle with sensor misalignment, fail to fully leverage temporal information, and require extensive manual tuning to handle diverse sensor configurations. To address these problems, we introduce DVLO4D, a novel visual-LiDAR odometry framework that leverages sparse spatial-temporal fusion to enhance accuracy and robustness. Our approach proposes three key innovations: (1) Sparse Query Fusion, which utilizes sparse LiDAR queries for effective multi-modal data fusion; (2) a Temporal Interaction and Update module that integrates temporally-predicted positions with current frame data, providing better initialization values for pose estimation and enhancing model's robustness against accumulative errors; and (3) a Temporal Clip Training strategy combined with a Collective Average Loss mechanism that aggregates losses across multiple frames, enabling global optimization and reducing the scale drift over long sequences. Extensive experiments on the KITTI and Argoverse Odometry dataset demonstrate the superiority of our proposed DVLO4D, which achieves state-of-the-art performance in terms of both pose accuracy and robustness. Additionally, our method has high efficiency, with an inference time of 82 ms, possessing the potential for the real-time deployment.

Abstract:
Off-road freespace detection is more challenging than on-road scenarios because of the blurred boundaries of traversable areas. Previous state-of-the-art (SOTA) methods employ multi-modal fusion of RGB images and LiDAR data. However, due to the significant increase in inference time when calculating surface normal maps from LiDAR data, multimodal methods are not suitable for real-time applications, particularly in real-world scenarios where higher FPS is required compared to slow navigation. This paper presents a novel RGB-only approach for off-road freespace detection, named ROD, eliminating the reliance on LiDAR data and its computational demands. Specifically, we utilize a pre-trained Vision Transformer (ViT) to extract rich features from RGB images. Additionally, we design a lightweight yet efficient decoder, which together improve both precision and inference speed. ROD establishes a new SOTA on ORFD and RELLIS-3D datasets, as well as an inference speed of 50 FPS, significantly outperforming prior models. Our code will be available at https://github.com/STLIFE97/offroad_roadseg.

Abstract:
LiDAR-camera fusion has gradually become the mainstream for the freespace detection in unstructured off-road environments. However, existing methods mainly use the traditional method to densify the sparse LiDAR data in the perspective view, which introduces noise and limits the representation ability. In this paper, we propose a lightweight end-to-end freespace detection network with cascaded LiDAR-camera fusion and multi-scale self-distillation. It first performs sparse freespace detection in the range view, and then projects the range-view features onto the perspective view and densifies them. The dense features obtained are fused with camera images to get the final freespace detection results. In our method, the cascaded fusion strategy reduces the impact of resolution differences between LiDAR point clouds and camera images, and the introduction of noise during the data densification process. The multi-scale self-distillation strategy distills knowledge from the LiDAR-camera fusion module to the perspective-view module to further improve the freespace detection performance using LiDAR data only. Experiments on the off-road ORFD datasets demonstrate the effectiveness of the proposed cascaded fusion and multi-scale self-distillation strategies, our method obtains 93.4% IoU at speeds of more than 50 Hz. It also achieves state-of-the-art performance among all LiDAR-based freespace detection methods.

Abstract:
While LiDAR and cameras are becoming ubiquitous for unmanned aerial vehicles (UAVs) but can be ineffective in challenging environments, 4D millimeter-wave (MMW) radars that can provide robust 3D ranging and Doppler velocity measurements are less exploited for aerial navigation. In this paper, we develop an efficient and robust error-state Kalman filter (ESKF)-based radar-inertial navigation for UAVs. The key idea of the proposed approach is the point-to-distribution radar scan matching to provide motion constraints with proper uncertainty qualification, which are used to update the navigation states in a tightly coupled manner, along with the Doppler velocity measurements. Moreover, we propose a robust keyframe-based matching scheme against the prior map (if available) to bound the accumulated navigation errors and thus provide a radar-based global localization solution with high accuracy. Extensive real-world experimental validations have demonstrated that the proposed radar-aided inertial navigation outperforms state-of-the-art methods in both accuracy and robustness.

Abstract:
Autonomous drone racing requires powerful perception, planning, and control and has become a benchmark and test field for autonomous, agile flight. Existing work usually assumes static race tracks with known maps, which enables offline planning of time-optimal trajectories, performing localization to the gates to reduce the drift in visual-inertial odometry (VIO) for state estimation or training learning-based methods for the particular race track and operating environment. In contrast, many real-world tasks like disaster response or delivery need to be performed in unknown and dynamic environments. To make drone racing more robust against unseen environments and moving gates, we propose a control algorithm that operates without a race track map or VIO, relying solely on monocular measurements of the line of sight to the gates. For this purpose, we adopt the law of proportional navigation (PN) to accurately fly through the gates despite gate motions or wind. We formulate the PN-informed vision-based control problem for drone racing as a constrained optimization problem and derive a closed-form optimal solution. Through simulations and real-world experiments, we demonstrate that our algorithm can navigate through moving gates at high speeds while being robust to different gate movements, model errors, wind, and delays.

Abstract:
Autonomous aerial robots are becoming commonplace in our lives. Hands-on aerial robotics courses are pivotal in training the next-generation workforce to meet the growing market demands. Such an efficient and compelling course depends on a reliable testbed. In this paper, we present VizFlyt, an open-source perception-centric Hardware-In- The-Loop (HITL) photorealistic testing framework for aerial robotics courses. We utilize pose from an external localization system to hallucinate real-time and photorealistic visual sensors using 3D Gaussian Splatting. This enables stress-free testing of autonomy algorithms on aerial robots without the risk of crashing into obstacles. We achieve over 100Hz of system update rate. Lastly, we build upon our past experiences of offering hands-on aerial robotics courses and propose a new open-source and open-hardware curriculum based on VizFlyt for the future. We test our framework on various course projects in real-world HITL experiments and present the results showing the efficacy of such a system and its large potential use cases. Code, datasets, hardware guides and demo videos are available at https://pear.wpi.edu/research/vizfiyt.html

Abstract:
We present a diffusion-based approach to quadrupedal locomotion that simultaneously addresses the limitations of learning and interpolating between multiple skills (modes) and of offline adapting to new locomotion behaviours after training. This is the first framework to apply classifier-free guided diffusion to quadruped locomotion and demonstrate its efficacy by extracting goal-conditioned behaviour from an originally unlabelled dataset. We show that these capabilities are compatible with a multi-skill policy and can be applied with little modification and minimal compute overhead, i.e., running entirely on the robot's onboard CPU. We verify the validity of our approach with hardware experiments on the ANYmal quadruped platform.

Abstract:
Humanoid whole-body control requires adapting to diverse tasks such as navigation, loco-manipulation, and tabletop manipulation, each demanding a different mode of control. For example, navigation relies on root velocity or position tracking, while tabletop manipulation prioritizes upper-body joint angle tracking. Existing approaches typically train individual policies tailored to a specific command space, limiting their transferability across modes. We present the key insight that full-body kinematic motion imitation can serve as a common abstraction for all these tasks and provide general-purpose motor skills for learning multiple modes of whole-body control. Building on this, we propose HOVER (Humanoid Versatile Controller), a multi-mode policy distillation framework that consolidates diverse control modes into a unified policy. HOVER enables seamless transitions between control modes while preserving the distinct advantages of each, offering a robust and scalable solution for humanoid control across a wide range of modes. By eliminating the need for policy retraining for each control mode, our approach improves efficiency and flexibility for future humanoid applications.

Abstract:
In endovascular surgery, the precise identification of catheters and guidewires in X-ray images is essential for reducing intervention risks. However, accurately segmenting catheter and guidewire structures is challenging due to the limited availability of labeled data. Foundation models offer a promising solution by enabling the collection of similar-domain data to train models whose weights can be fine-tuned for downstream tasks. Nonetheless, large-scale data collection for training is constrained by the necessity of maintaining patient privacy. This paper proposes a new method to train a foundation model in a decentralized federated learning setting for endovascular intervention. To ensure the feasibility of the training, we tackle the unseen data issue using differentiable Earth Mover's Distance within a knowledge distillation frame-work. Once trained, our foundation model's weights provide valuable initialization for downstream tasks, thereby enhancing task-specific performance. Intensive experiments show that our approach achieves new state-of-the-art results, contributing to advancements in endovascular intervention and robotic-assisted endovascular surgery, while addressing the critical issue of data sharing in the medical domain.

Abstract:
Maps of dynamics are effective representations of motion patterns learned from prior observations, with recent research demonstrating their ability to enhance various downstream tasks such as human-aware robot navigation, long-term human motion prediction, and robot localization. Current advancements have primarily concentrated on methods for learning maps of human flow in environments where the flow is static, i.e., not assumed to change over time. In this paper we propose an online update method of the CLiFF-map (an advanced map of dynamics type that models motion patterns as velocity and orientation mixtures) to actively detect and adapt to human flow changes. As new observations are collected, our goal is to update a CLiFF-map to effectively and accurately integrate them, while retaining relevant historic motion patterns. The proposed online update method maintains a probabilistic representation in each observed location, updating parameters by continuously tracking sufficient statistics. In experiments using both synthetic and real-world datasets, we show that our method is able to maintain accurate representations of human motion dynamics, contributing to high performance flow-compliant planning downstream tasks, while being orders of magnitude faster than the comparable baselines.

Abstract:
Learning to manipulate dynamic and deformable objects from a single demonstration video holds great promise in terms of scalability. Previous approaches have predominantly focused on either replaying object relationships or actor trajectories. The former often struggles to generalize across diverse tasks, while the latter suffers from data inefficiency. Moreover, both methodologies encounter challenges in capturing invisible physical attributes, such as forces. In this paper, we propose to interpret video demonstrations through a series of Parameterized Symbolic Abstraction Graphs (PSAGs), where nodes represent objects and edges denote relationships between objects. We further ground geometric constraints through simulation to estimate non-geometric, visually imperceptible attributes. The augmented PSAGs are then applied in real robot experiments. Our approach has been validated across a range of tasks, such as Cutting Avocado, Cutting Vegetable, Pouring Liquid, Rolling Dough, and Slicing Pizza. We demonstrate successful generalization to novel objects with distinct visual and physical properties. For visualizations of the learned policies please check: https://www.jianrenw.com/PSAG/

Abstract:
Learning-based street scene semantic understanding in autonomous driving (AD) has advanced significantly recently, but the performance of the AD model is heavily dependent on the quantity and quality of the annotated training data. However, traditional manual labeling involves high cost to annotate the vast amount of required data for training robust model. To mitigate this cost of manual labeling, we propose a Label Anything Model (denoted as LAM), serving as an interpretable, high-fidelity, and prompt-free data annotator. Specifically, we firstly incorporate a pretrained Vision Transformer (ViT) to extract the latent features. On top of ViT, we propose a semantic class adapter (SCA) and an optimization-oriented unrolling algorithm (OptOU), both with a quite small number of trainable parameters. SCA is proposed to fuse ViT-extracted features to consolidate the basis of the subsequent automatic annotation. OptOU consists of multiple cascading layers and each layer contains an optimization formulation to align its output with the ground truth as closely as possible, though which OptOU acts as being interpretable rather than learning-based blackbox nature. In addition, training SCA and OptOU requires only a single pre-annotated RGB seed image, owing to their small volume of learnable parameters. Extensive experiments clearly demonstrate that the proposed LAM can generate high-fidelity annotations (almost 100% in mIoU) for multiple real-world datasets (i.e., Camvid, Cityscapes, and Apolloscapes) and CARLA simulation dataset.

Abstract:
This paper introduces a 3D shape reconstruction technique for interventional devices in endovascular surgery, utilizing a flexible magnetic tip guidewire that preserves the fundamental attributes of standard guidewires. We developed a model that correlates the magnetic tip's shape with the surrounding magnetic field distribution to estimate the shape through the magnetic field. The inherently nonlinear relationship between the magnetic field distribution and the shape of the magnetic guidewire presents challenges for direct shape estimation. To address this, we incorporated image and physical constraints to streamline the estimation process. This method shows high accuracy and stability in shape estimation, with root mean square error (RMSE) and Hausdorff distance (HD) both below 1 mm, which is better than other existing estimation methods. Notably, the interventional guidewire requires no embedded sensors or wiring, and the fluoroscopic images used are standard in clinical practice. The reconstruction process is non-disruptive to clinical procedures, suggesting broad applicability in vascular interventional navigation.

Abstract:
Untethered miniature surgical tools could significantly reduce invasiveness and enhance patient outcomes in robot-assisted laparoscopic surgical procedures. This paper demonstrates the feasibility of performing semi-autonomous suturing tasks using an untethered magnetic needle controlled by an external electromagnetic manipulator. The electromagnetic manipulator can generate magnetic torques and gradient-based pulling forces to actuate the magnetic needle. Here, we develop and implement a semi-autonomous 2.5D control method for controlling the in-plane position and both in-plane and out-of-plane orientations of a magnetic needle for suturing on tissue-mimicking agar gel phantoms. The method includes recognizing needles and incisions, planning trajectory, and performing suturing with visual feedback control. We conduct two mock suturing tasks using both continuous and interrupted techniques on 1% agar gel phantoms with 2 cm and 3 cm incision sizes. The results demonstrate precise needle control, with an average root-mean-square position error of 1.01 mm and 1.12 mm across tasks. The system also achieved submillimeter-level suture spacing accuracy, comparable to surgeons using state-of-the-art surgical robots. These findings highlight the feasibility of using untethered magnetic suture needles for minimally invasive suturing procedures.

Abstract:
LLMs have shown promising results in task planning due to their strong natural language understanding and reasoning capabilities. However, issues such as hallucinations, ambiguities in human instructions, environmental constraints, and limitations in the executing agent's capabilities often lead to flawed or incomplete plans. This paper proposes MultiTalk, an LLM-based task planning methodology that addresses these issues through a framework of introspective and extrospective dialogue loops. This approach helps ground generated plans in the context of the environment and the agent's capabilities, while also resolving uncertainties and ambiguities in the given task. These loops are enabled by specialized systems designed to extract and predict task-specific states, and flag mismatches or misalignments among the human user, the LLM agent, and the environment. Effective feedback pathways between these systems and the LLM planner foster meaningful dialogue. The efficacy of this methodology is demonstrated through its application to robotic manipulation tasks. Experiments and ablations highlight the robustness and reliability of our method, and comparisons with baselines further illustrate the superiority of MultiTalk in task planning for embodied agents. Project Website: https://llm-multitalk.github.io/

Abstract:
Navigating in human-filled public spaces is a critical challenge for deploying autonomous robots in real-world environments. This paper introduces NaviDIFF, a novel Hamiltonian-constrained socially-aware navigation framework designed to address the complexities of human-robot interaction and socially-aware path planning. NaviDIFF integrates a port-Hamiltonian framework to model dynamic physical interactions and a diffusion model to manage uncertainty in human-robot cooperation. The framework leverages a spatial-temporal transformer to capture social and temporal dependencies, enabling more accurate spatial-temporal environmental dynamics understanding and port-Hamiltonian physical interactive process construction. Additionally, reinforcement learning from human feedback is employed to fine-tune robot policies, ensuring adaptation to human preferences and social norms. Extensive experiments demonstrate that NaviDIFF outperforms state-of-the-art methods in social navigation tasks, offering improved stability, efficiency, and adaptability11The experimental videos and additional information about this work can be found at: https://sites.google.com/view/NaviDIFF.

Abstract:
In this paper, we present ZSORN, a novel language-driven object-centric image representation for object retrieval and navigation task within complex scenes. We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stability during training and zero-shot inference. We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation. We show that our proposed method can achieve an improvement of 1.38 - 13.38% in terms of text-to-image recall on different benchmark settings for the retrieval task. For object navigation, we show the benefit of our approach in simulation and real world, showing 5% and 16.67% improvement in terms of navigation success rate, respectively.

Abstract:
This paper presents an Impedance Primitive-augmented hierarchical reinforcement learning framework for efficient robotic manipulation in sequential contact tasks. We leverage this hierarchical structure to sequentially execute behavior primitives with variable stiffness control capabilities for contact tasks. Our proposed approach relies on three key components: an action space enabling variable stiffness control, an adaptive stiffness controller for dynamic stiffness adjustments during primitive execution, and affordance coupling for efficient exploration while encouraging compliance. Through comprehensive training and evaluation, our framework learns efficient stiffness control capabilities and demonstrates improvements in learning efficiency, compositionality in primitive selection, and success rates compared to the state-of-the-art. The training environments include block lifting, door opening, object pushing, and surface cleaning. Real world evaluations further confirm the framework's sim2real capability. This work lays the foundation for more adaptive and versatile robotic manipulation systems, with potential applications in more complex contact-based tasks.

Abstract:
The ability to predict trajectories of surrounding agents and obstacles is a crucial component in many robotic applications. Data-driven approaches are commonly adopted for state prediction in scenarios where the underlying dynamics are unknown. However, the performance, reliability, and uncertainty of data-driven predictors become compromised when encountering out-of-distribution observations relative to the training data. In this paper, we introduce a Plug-and-Play Physics-Informed Machine Learning (PnP-PIML) framework to address this challenge. Our method employs conformal prediction to identify outlier dynamics and, in that case, switches from a nominal predictor to a physics-consistent model, namely distributed Port-Hamiltonian systems (dPHS). We leverage Gaussian processes to model the energy function of the dPHS, enabling not only the learning of system dynamics but also the quantification of predictive uncertainty through its Bayesian nature. In this way, the proposed framework produces reliable physics-informed predictions even for the out-of-distribution scenarios.

Abstract:
Point cloud registration aims to provide estimated transformations to align point clouds, which plays a crucial role in pose estimation of various navigation systems, such as surgical guidance systems and autonomous vehicles. Despite the impressive performance of recent models on benchmark datasets, many rely on complex modules like KPConv and Transformers, which impose significant computational and memory demands. These requirements hinder their practical application, particularly in resource-constrained environments such as mobile robotics. In this paper, we propose a novel point cloud registration network that leverages a pure MLP architecture, constructing geometric information offline. This approach eliminates the computational and memory burdens associated with traditional complex feature extractors and significantly reduces inference time and resource consumption. Our method is the first to replace 3D coordinate inputs with offline-constructed geometric encoding, improving generalization and stability, as demonstrated by Maximum Mean Discrepancy (MMD) comparisons. This efficient and accurate geometric representation marks a significant advancement in point cloud analysis, particularly for applications requiring fast and reliability.

Abstract:
While global point cloud registration systems have advanced significantly in all aspects, many studies have focused on specific components, such as feature extraction, graph-theoretic pruning, or pose solvers. In this paper, we take a holistic view on the registration problem and develop an open-source and versatile C++ library for point cloud registration, called KISS-Matcher. KISS-Matcher combines a novel feature detector, Faster-PFH, that improves over the classical fast point feature histogram (FPFH). Moreover, it adopts a k-core-based graph-theoretic pruning to reduce the time complexity of rejecting outlier correspondences. Finally, it combines these modules in a complete, user-friendly, and ready-to-use pipeline. As verified by extensive experiments, KISS-Matcher has superior scalability and broad applicability, achieving a substantial speed-up compared to state-of-the-art outlier-robust registration pipelines while preserving accuracy. Our code will be available at https://github.com/MIT-SPARK/KISS-Matcher.

Abstract:
Haptic feedback enhances user interaction with systems by adding the sense of touch, thereby improving immersion and realism in applications like virtual reality (VR), augmented reality (AR), video games, education, and robotic surgery. To address the challenges in mechanically actuated haptic feedback devices such as limited mobility, mechanical wear, and complex mechanical structures, several research sought to develop electromagnetic haptic feedback systems. However, they also suffer from the rapid decay of magnetic force with distance, thus restricting their workspace size and application potential. In this paper, we propose a novel electromagnetic haptic feedback device that is actuated by magnetic torque instead of magnetic force. By controlling the magnetic torque, which decays with distance only at a thirdorder rate, our device achieves a large workspace—a 200-mm-diameter hemisphere—while still delivering perceptible realtime haptic feedback within the hemisphere. While using the device, the user wears a lightweight haptic thimble housing a permanent magnet on their finger, which enables 2 degree-offreedom (DoF) haptic feedback. A 13-coil electromagnet array serves as the source of the magnetic field. A mathematical model is proposed to determine the currents in the electromagnet array to generate the desired amount of haptic feedback torque. We conducted two experiments to prove the viability of the device. A haptic feedback accuracy experiment was conducted and validated the device's ability to generate sufficient torque within a large workspace. A user evaluation experiment showed that the device achieved an overall accuracy of 77.86% in a virtual enclosure exploration task, indicating its effectiveness and usability in haptic feedback applications.

Abstract:
Keypoint detection and local feature description are fundamental tasks in robotic perception, critical for applications such as SLAM, robot localization, feature matching, pose estimation, and 3D mapping. While existing methods predominantly operate on RGB images, we propose a novel network that directly processes raw images, bypassing the need for the Image Signal Processor (ISP). This approach significantly reduces hardware requirements and memory consumption, which is crucial for robotic vision systems. Our method introduces two custom-designed convolutional kernels capable of performing convolutions directly on raw images, preserving inter-channel information without converting to RGB. Experimental results show that our network outperforms existing algorithms on raw images, achieving higher accuracy and stability under large rotations and scale variations. This work represents the first attempt to develop a keypoint detection and feature description network specifically for raw images, offering a more efficient solution for resource-constrained environments.

Abstract:
Assembly is a crucial skill for robots in both modern manufacturing and service robotics. However, mastering transferable insertion skills that can handle a variety of high-precision assembly tasks remains a significant challenge. This paper presents a novel framework that utilizes diffusion models to generate 6D wrench for high-precision tactile robotic insertion tasks. It learns from demonstrations performed on a single task and achieves a zero-shot transfer success rate of 95.7% across various novel high-precision tasks. Our method effectively inherits the self-adaptability demonstrated by our previous work. In this framework, we address the frequency misalignment between the diffusion policy and the real-time control loop with a dynamic system-based filter, significantly improving the task success rate by 9.15%. Furthermore, we provide a practical guideline regarding the trade-off between diffusion models' inference ability and speed.

Abstract:
Robotic pushing is one of the intuitive nonprehensile manipulation skills that can handle ungraspable objects without any complex task-specific tools. In this paper, we proposed a goal-driven accurate robotic pushing framework to achieve the robotic pushing tasks in practice that can operate under uncertain object properties. We employed a model predictive path integral (MPPI) as a goal-driven pushing controller building upon our prior work to operate pushing tasks under uncertain object properties. Unlike our prior work, the proposed framework can push the object toward the goal pose without predefined trajectories. The results of the numerical experiments demonstrated that the proposed framework can accomplish the pushing task with a significantly shorter total length, smaller total step, and a higher success rate even though the model parameters are unknown. Moreover, we demonstrated the proposed framework also works well in the real world through real-robot demonstrations.

Abstract:
In complex manipulation tasks, e.g., manipulation by pivoting, the motion of the object being manipulated has to satisfy path constraints that can change during the motion. Therefore, a single grasp may not be sufficient for the entire path, and the object may need to be regrasped. Additionally, geometric data for objects from a sensor are usually available in the form of point clouds. The problem of computing grasps and regrasps from point-cloud representation of objects for complex manipulation tasks is a key problem in endowing robots with manipulation capabilities beyond pick-and-place. In this paper, we formalize the problem of grasping/regrasping for complex manipulation tasks with objects represented by (partial) point clouds and present an algorithm to solve it. We represent a complex manipulation task as a sequence of constant screw motions. Using a manipulation plan skeleton as a sequence of constant screw motions, we use a grasp metric to find graspable regions on the object for every constant screw segment. The overlap of the graspable regions for contiguous screws are then used to determine when and how many times the object needs to be regrasped. We present experimental results on point cloud data collected from RGB-D sensors to illustrate our approach.

Abstract:
Visual uncertainties such as occlusions, lack of texture, and noise present significant challenges in obtaining accurate kinematic models for safe robotic manipulation. We introduce a probabilistic real-time approach that leverages the human hand as a prior to mitigate these uncertainties. By tracking the constrained motion of the human hand during manipulation and explicitly modeling uncertainties in visual observations, our method reliably estimates an object's kinematic model online. We validate our approach on a novel dataset featuring challenging objects that are occluded during manipulation and offer limited articulations for perception. The results demonstrate that by incorporating an appropriate prior and explicitly accounting for uncertainties, our method produces accurate estimates, outperforming two recent baselines by 195 % and 140 %, respectively. Furthermore, we demonstrate that our approach's estimates are precise enough to allow a robot to manipulate even small objects safely.

Abstract:
Many robotic applications, such as sanding, polishing, wiping and sensor scanning, require a manipulator to dexterously cover a surface using its end-effector. In this paper, we provide an efficient and effective coverage path planning approach that leverages a manipulator's redundancy and task tolerances to minimize costs in joint space. We formulate the problem as a Generalized Traveling Salesman Problem and hierarchically streamline the graph size. Our strategy is to identify guide paths that roughly cover the surface and accelerate the computation by solving a sequence of smaller problems. We demonstrate the effectiveness of our method through a simulation experiment and an illustrative demonstration using a physical robot.

Abstract:
Quantifying sperm flagellar beating behavior (e.g., beating amplitude, frequency, and wavelength) plays a crucial role in biological research, clinical diagnostics, and the design of sperm-inspired microrobots. However, existing computational methods struggle to accurately and efficiently analyze the highly dynamic, complex, and fine structures of sperm flagella, especially when portions of the flagellum become invisible due to three-dimensional out-of-focus beating. This paper proposes an automated high-throughput tool for quantitative analysis of sperm flagellar beating. The core innovation is continuous convolution (CConv), which adaptively captures the irregular, time-varying patterns of sperm flagella while ensuring continuity in segmentation outputs, even in the presence of locally invisible regions caused by out-of-focus motion. CConv can be integrated into various neural network architectures as a plug-and-play module. Extensive experiments demonstrate that integrating CConv consistently improves the accuracy and continuity of flagella segmentation across different networks. Furthermore, utilizing a curvature-based approach, we quantified key flagellar beating parameters, including length, amplitude, frequency, and wavelength. Applying the high-throughput tool on 1200 sperm revealed that sperm from fertile donors had significantly higher flagellar beating frequency than sperm from infertile patients. The proposed automated tool unlocks high-throughput, quantitative analysis of sperm flagellar beating, showing the potential for applications in reproductive biology and engineering research. The codes and datasets will be released at https://github.com/Goldfish-Yu/CConv.

Abstract:
Interactable objects are ubiquitous in our daily lives. Recent advances in 3D generative models make it possible to automate the modeling of these objects, benefiting a range of applications from 3D printing to the creation of robot simulation environments. However, while significant progress has been made in modeling 3D shapes and appearances, modeling object physics, particularly for interactable objects, remains challenging due to the physical constraints imposed by interpart motions. In this paper, we tackle the problem of physically plausible part completion for interactable objects, aiming to generate 3D parts that not only fit precisely into the object but also allow smooth part motions. To this end, we propose a diffusion-based part generation model that utilizes geometric conditioning through classifier-free guidance and formulates physical constraints as a set of stability and mobility losses to guide the sampling process. Additionally, we demonstrate the generation of dependent parts, paving the way toward sequential part generation for objects with complex part-whole hierarchies. Experimentally, we introduce a new metric for measuring physical plausibility based on motion success rates. Our model outperforms existing baselines over shape and physical metrics, especially those that do not adequately model physical constraints. We also demonstrate our applications in 3D printing, robot manipulation, and sequential part generation, showing our strength in realistic tasks with the demand for high physical plausibility.

Abstract:
This paper introduces the design, fabrication, and autonomous control of MicroASV, a low-cost, centimeter-scale autonomous surface Vehicle (ASV). MicroASV has a square footprint with a side length of 85 mm. Its propulsion system consists of four custom water jets arranged in a “Diamond” shaped actuator configuration, powered by magnetically coupled brushless motors. This setup allows for complete 2D mobility, enabling forward and backward motion, lateral translation, and in-place rotation. The MicroASV is built using commercially available motors and 3D-printed components, creating a modular, appendage-free structure that is simple to assemble. An onboard camera and inertial measurement unit (IMU) are integrated to enable real-time localization, with position and heading controllers developed to provide autonomous feedback control. Preliminary experiments validate the platform's effectiveness in motion, sensing, and control, establishing MicroASV as a valuable tool for studying centimeter-scale ASV control, both individually and in collective swarm operations.

Abstract:
Recognizing objects in the 3D world is a significant challenge for robotics. Due to the lack of high-quality 3D data, directly training a general-purpose segmentation model in 3D is almost infeasible. Meanwhile, vision foundation models (VFM) have revolutionized the 2D computer vision field with outstanding performance, making the use of VFM to assist 3D perception a promising direction. However, most existing VFM-assisted methods do not effectively address the 2D-3D inconsistency problem or adequately provide corresponding semantic information for 3D instance objects. To address these two issues, this paper introduces a novel framework for 3D zero-shot instance segmentation called RE0. For the given 3D point clouds and multi-view RGB-D images with poses, we leverage the 3D geometric information, projection relationships, and CLIP semantic features. Specifically, we utilize CropFormer to extract mask information from multi-view posed images, combined with projection relationships to assign point-level labels to each point in the point cloud, and achieve instance-level consistency through inter-frame information interaction. Then, we employ projection relationships again to assign CLIP semantic features to the point cloud and achieve aggregation of small-scale point clouds. Notably, RE0 does not require any additional training and can be implemented by supporting only one inference of CropFormer and one inference of CLIP. Experiments on ScanNet200 and ScanNet++ show that our method achieves higher quality segmentation than the previous zero-shot methods. Our codes and demos are available at https://recognizeeverything.github.io/, with only one RTX 3090 GPU required.

Abstract:
Availability of datasets is a strong driver for research on 3D semantic understanding, and whilst obtaining unlabeled 3D point cloud data is straightforward, manually annotating this data with semantic labels is time-consuming and costly. Recently, Vision Foundation Models (VFMs) enable open-set semantic segmentation on camera images, potentially aiding automatic labeling. However, VFMs for 3D data have been limited to adaptations of 2D models, which can introduce inconsistencies to 3D labels. This work introduces Label Any Pointcloud (LeAP), leveraging 2D VFMs to automatically label multi-frame 3D data with any set of classes in any kind of application whilst ensuring label consistency. Using a Bayesian update, point labels are combined into voxels to improve spatio-temporal consistency. A novel 3D Consistency Network (3D-CN) exploits 3D information to further improve label quality. Through various experiments, we show that our method can generate high-quality 3D semantic labels across diverse fields without any manual labeling. Further, models adapted to new domains using our labels show a significant mIoU increase in semantic segmentation tasks.

Abstract:
Rip currents are a major hazard on beaches worldwide, and their strong, offshore-directed currents can place even experienced beachgoers at risk of drowning. While it is intuitive to consider developing an automated rip current detection system to assist lifeguards in protecting beachgoers, rip current detection is in its infancy due to the lack of high-quality large-scale annotated rip current datasets. Also, the collection and annotation of rip current images require expert knowledge, which makes it more difficult to build datasets. So, this paper proposes a GAN-based rip current data augmentation method, RipGAN, to improve the performance of rip current detectors by increasing representative training data. To create new training images, RipGAN, has two branches. One is a texture generator that enriches the pattern and texture details of waves, making the image more realistic. The other is a rip generator based on FFFM-Unet. FFFM (Fast Fourier Fusion Module) uses Fast Fourier convolution to fuse the features from the low and the high layers, so as to further optimise the generated image. Furthermore, we trained Yolov8, YOLOv10, DINO and RT-DETR as rip current detectors to prove the effectiveness of RipGAN. The detectors' rnAP 50:95 improved by 2.67% on the test set and AP50 by 4.93% on real-scene videos, outperforming other data augmentation methods. Besides, abundant ablation studies have been conducted to further evaluate each component of RipGAN.

Abstract:
Owing to their compact structure, high stability, and low cost, Indirect Time-of-Fligh (IToF) cameras have gained increasing attention in the fields of robotics and automation. However, in real-world scenarios, IToF cameras are affected by multipath interference, which severely degrades imaging quality. Existing learning-based methods for multipath interference correction are all based on CNN architectures and rely on synthetic datasets, leading to poor generalization in real-world scenarios. We proposed an efficient and accurate real data collection scheme and explored the application of Transformer and Mamba in multipath interference correction tasks. Additionally, we introduced a cross-propagation network that integrates Mamba and CNN modules, reducing system complexity to linear levels while achieving superior multipath interference correction compared to state-of-the-art methods.

Abstract:
Clear, interpretable instructions are invaluable for complex tasks, helping to clarify goals and anticipate necessary steps. In this work, we propose a robot learning framework for communicating, planning, and executing a wide range of tasks, dubbed This&That. This&That solves general tasks by leveraging video generative models, which, through training on internet-scale data, contain rich physical and semantic context. Through this work, we tackle three fundamental challenges in video-based planning: 1) unambiguous task communication with simple human instructions, 2) controllable video gen-eration that respects user intent, and 3) translating visual plans into robot actions. This& That adds gesture conditioning alongside language to generate video predictions as a suc-cinct and unambiguous alternative to existing language-only methods, especially in complex and uncertain environments. These video predictions are then fed into a behavior cloning architecture dubbed Diffusion Video to Action (DiVA), which outperforms prior state-of-the-art behavior cloning and video-based planning methods by substantial margins. Project web-site: https://this-and-that-vid.github.io/this-and-thatl.

Abstract:
This paper extends the family of gap-based local planners to unknown dynamic environments through generating provably collision-free properties for hierarchical navigation systems. Existing perception-informed local planners that operate in dynamic environments rely on emergent or empirical robustness for collision avoidance as opposed to performing formal analysis of dynamic obstacles. In addition to this, the obstacle tracking that is performed in these existent planners is often achieved with respect to a global inertial frame, subjecting such tracking estimates to transformation errors from odometry drift. The proposed local planner, dynamic gap, shifts the tracking paradigm to modeling how the free space, represented as gaps, evolves over time. Gap crossing and closing conditions are developed to aid in determining the feasibility of passage through gaps, and a breadth of simulation benchmarking is performed against other navigation planners in the literature where the proposed dynamic gap planner achieves the highest success rate out of all planners tested in all environments.

Abstract:
Thanks to the high robustness of 4D millimeterwave radar in various environments, it has been widely applied in the field of autonomous driving. Recent research has increasingly focused on utilizing raw data, as a substitute for the sparse and noisy point cloud data. However, these approaches have not fully exploited the Doppler features present in the raw data. In this paper, we introduce the Doppler Former (DPF) module to efficiently extract velocity information from the target environment. DPF can be seamlessly integrated into most radar perception backbone and enhance their performance in downstream tasks. Additionally, we propose a new backbone, Fully Complex Convolutional Network (FCCN), which is more suitable for raw data. By incorporating the DPF module into FCCN, we achieved state-of-the-art (SOTA) performance on the RADIal dataset, with code available at https://github.com/coconut-zs/Fvidar-DopplerFormer.

Abstract:
Robots are often employed in hazardous or inaccessible environments, such as disaster sites, extraterrestrial terrains, agricultural fields, and ocean floors. Autonomous operation is crucial in these scenarios to reduce reliance on human operators and enable real-time decision-making. However, robots must balance multiple, often conflicting, objectives. These objectives are subject to change based on new data or evolving conditions. This paper presents a novel approach to dynamic multi-objective trajectory planning. The proposed method leverages the boundary intersection decomposition technique to adaptively plan trajectories that balance multiple evolving objectives. Our approach ensures efficient and effective exploration by continuously optimizing the trade-offs between changing objectives. We show that our method performs on average 34 % better in terms of solution quality on the dynamic multi-objective trajectory planning problem as compared to prior work.

Abstract:
Generalization in robotic manipulation remains a critical challenge, particularly when scaling to new environments with limited demonstrations. This paper introduces CAGE, a novel robotic manipulation policy designed to overcome these generalization barriers by integrating the pretrained visual representation with causal attention mechanism. CAGE utilizes the powerful feature extraction capabilities of the vision foundation model DINOv2, combined with LoRA fine-tuning for robust environment understanding. The policy further employs a causal perceiver for effective token compression and a diffusion-based action head with attention to enhance task-specific fine-grained conditioning. With as few as 50 demonstrations from a single training environment, CAGE achieves robust generalization across diverse visual changes in objects, backgrounds, and viewpoints. Extensive experiments validate that CAGE significantly outperforms existing state-of-the-art RGB/RGB-D-based approaches in various manipulation tasks, especially under large distribution shifts. In similar environments, CAGE offers an average of 42 % increase in task completion rate. While all baselines fail in unseen environments, CAGE manages to obtain a 43 % completion rate and a 51 % success rate in average, marking a substantial advancement toward the practical deployment of robots in real-world settings. Project website: cage-policy.github.io.

Abstract:
Traffic accidents present complex challenges for autonomous driving, often featuring unpredictable scenarios that hinder accurate system interpretation and responses. Nonetheless, prevailing methodologies fall short in elucidating the causes of accidents and proposing preventive measures due to the paucity of training data specific to accident scenarios. In this work, we introduce AVD2 (Accident Video Diffusion for Accident Video Description), a novel framework that enhances accident scene understanding by generating accident videos that aligned with detailed natural language descriptions and reasoning, resulting in the contributed EMM-AU (Enhanced Multi-Modal Accident Video Understanding) dataset. Empirical results reveal that the integration of the EMM-AU dataset establishes state-of-the-art performance across both automated metrics and human evaluations, markedly advancing the domains of accident analysis and prevention. Project resources are available at https://an-answer-tree.github.io

Abstract:
In recent years, 3D scene graphs have become a critical tool in robotics and computer vision for enabling systems to understand both the geometric and semantic aspects of their surroundings. These data structures represent spatial and semantic relationships between objects in a three-dimensional environment, supporting tasks like navigation, object manipulation, and scene understanding. This paper presents a real-time pipeline for 3D scene graph generation that offers flexibility in image segmentation techniques while incorporating room classification that is based on a Random Forest model. Our work enables robots to dynamically update their understanding of complex and large-scale environments in real-time. We evaluate our approach systematically on a dataset and in a real-life experiment. The results demonstrate the capability of running our solution at over 10 Hz on an Nvidia Jetson AGX Orin SoC while also scaling favorably in larger environments. Our proposed room classification approach predicts classes with an average accuracy of 80%.

Abstract:
Multimodal Large Language Models (MLLMs) have made significant progress in tasks such as image captioning and question answering. However, while these models can generate realistic captions, they often struggle with providing precise instructions, particularly when it comes to localizing and disambiguating objects in complex 3D environments. This capability is critical as MLLMs become more integrated with collaborative robotic systems. In scenarios where a target object is surrounded by similar objects (distractors), robots must deliver clear, spatially-aware instructions to guide humans effectively. We refer to this challenge as contextual object localization and disambiguation, which imposes stricter constraints than conventional 3D dense captioning, especially regarding ensuring target exclusivity. In response, we propose simple yet effective techniques to enhance the model's ability to localize and disambiguate target objects. Our approach not only achieves state-of-the-art performance on conventional metrics that evaluate sentence similarity, but also demonstrates improved 3D spatial understanding through 3D visual grounding model. https://birdy666.github.io/projects/3d_spatial_understanding_in_mllms/

Abstract:
Simulations are highly valuable in marine robotics, offering a cost-effective and controlled environment for testing in the challenging conditions of underwater and surface operations. Given the high costs and logistical difficulties of real-world trials, simulators capable of capturing the operational conditions of subsea environments have become key in developing and refining algorithms for remotely-operated and autonomous underwater vehicles. This paper highlights recent enhancements to the Stonefish simulator, an advanced open-source platform supporting development and testing of marine robotics solutions. Key updates include a suite of additional sensors, such as an event-based camera, a thermal camera, and an optical flow camera, as well as, visual light communication, support for tethered operations, improved thruster modelling, more flexible hydrodynamics, and enhanced sonar accuracy. These developments and an automated annotation tool significantly bolster Stonefish's role in marine robotics research, especially in the field of machine learning, where training data with a known ground truth is hard or impossible to collect. https://github.com/patrykcieslak/stonefish

Abstract:
As marine exploration becomes increasingly important, marine robots have been extensively studied in recent years. Despite some well-designed robots have already achieved to various successful missions, most existing robots struggle to adapt to diverse demands or tasks due to their fixed structure and complexity of the marine environment. To address these challenges, we present a novel reconfigurable marine robot named Sea-U-Whale. This system can dynamically adjust its actuator configuration in the marine environment, providing superior environmental adaptability, maneuverability, and ver-satile mobility. Considering the demands of unmanned ocean exploration, an active reconfiguration mechanism and three distinct vehicle modes are designed for optimal actuation in various marine scenarios. The multi-modal mobility of our system and its robust performance have been validated through extensive field tests and water tank experiments, demonstrating its potential in handling a wide range of mission profiles.

Abstract:
Grasp detection requires flexibility to handle objects of various shapes without relying on prior object knowledge, while also offering intuitive, user-guided control. In this paper, we introduce GraspSAM, an innovative extension of the Segment Anything Model (SAM) designed for prompt-driven and category-agnostic grasp detection. Unlike previous methods, which are often limited by small-scale training data, Grasp-SAM leverages SAM's large-scale training and prompt-based segmentation capabilities to efficiently support both target-object and category-agnostic grasping. By utilizing adapters, learnable token embeddings, and a lightweight modified decoder, GraspSAM requires minimal fine-tuning to integrate object segmentation and grasp prediction into a unified frame-work. Our model achieves state-of-the-art (SOTA) performance across multiple datasets, including Jacquard, Grasp-Anything, and Grasp-Anything++. Extensive experiments demonstrate GraspSAM's flexibility in handling different types of prompts (such as points, boxes, and language), highlighting its robustness and effectiveness in real-world robotic applications. Robot demonstrations, additional results, and code can be found at https://gistailab.github.io/GraspSAM/.

Abstract:
Active perception approaches select future viewpoints by using some estimate of the information gain. An inaccurate estimate can be detrimental in critical situations, e.g., locating a person in distress. However the true information gained can only be calculated post hoc, i.e., after the observation is realized. We present an approach to estimate the discrepancy between the estimated information gain (which is the expectation over putative future observations while neglecting correlations among them) and the true information gain. The key idea is to analyze the mathematical relationship between active perception and the estimation error of the information gain in a gametheoretic setting. Using this, we develop an online estimation approach that achieves sub-linear regret (in the number of timesteps) for the estimation of the true information gain and reduces the sub-optimality of active perception systems. We demonstrate our approach 11Code is available at https://github.com/grasp-lyd/active-perception-game. Proofs are available at https://arxiv.org/abs/2404.00769.for active perception using a comprehensive set of experiments on: (a) different types of environments, including a quadrotor in a photorealistic simulation, real-world robotic data, and real-world experiments with ground robots exploring indoor and outdoor scenes; (b) different types of robotic perception data; and (c) different map representations. On average, our approach reduces information gain estimation errors by 42%, increases the information gain by 7%, PSNR by 5%, and semantic accuracy (measured as the number of objects that are localized correctly) by 6%. In real-world experiments with a Jackal ground robot, our approach demonstrated complex trajectories to explore occluded regions.

Abstract:
This paper introduces GET-Zero, a model architecture and training procedure for learning an embodimentaware control policy that can immediately adapt to new hardware changes without retraining. To do so, we present Graph Embodiment Transformer (GET), a transformer model that leverages the embodiment graph connectivity as a learned structural bias in the attention mechanism. We use behavior cloning to distill demonstration data from embodiment-specific expert policies into an embodiment-aware GET model that conditions on the hardware configuration of the robot to make control decisions. We conduct a case study on a dexterous inhand object rotation task using different configurations of a four-fingered robot hand with joints removed and with link length extensions. Using the GET model along with a selfmodeling loss enables GET-Zero to zero-shot generalize to unseen variation in graph structure and link length, yielding a 20 % improvement over baseline methods. All code and qualitative video results are on our project website.

Abstract:
Handling orientations of robots and objects is a crucial aspect of many applications. Yet, ever so often, there is a lack of mathematical correctness when dealing with orientations, especially in learning pipelines involving, for example, artificial neural networks. In this paper, we investigate reinforcement learning with orientations and propose a simple modification of the network's input and output that adheres to the Lie group structure of orientations. As a result, we obtain an easy and efficient implementation that is directly usable with existing learning libraries and achieves significantly better performance than other common orientation representations. We briefly introduce Lie theory specifically for orientations in robotics to motivate and outline our approach. Subsequently, a thorough empirical evaluation of different combinations of orientation representations for states and actions demonstrates the superior performance of our proposed approach in different scenarios, including: direct orientation control, end effector orientation control, and pick-and-place tasks.

Abstract:
For effective operation in challenging outdoor environments, mobile unmanned robots face stiff and competing demands including payload capacity, driving speed, range, as well as the ability to traverse rough terrain. To address these issues we introduce the hybrid wheel-leg quadrupedal robot WaLTER. WaLTER utilizes a unique combination of continuously rotating distal leg joints, actuated wheels, and a roll body DOF to efficiently drive on flat ground and effectively tumble over stairs and difficult, broken terrain. We developed an intuitive teleoperation scheme and employed deep reinforcement learning as proof of concept control techniques for the novel morphology. To test its capabilities, we constructed a multi-body simulation in MuJoCo and a 2.1 kg physical prototype for experimentation on traversability and energy economy. Our testing demonstrated the ability to traverse rougher terrain relative to larger-wheeled counterparts and reliable stair-climbing while maintaining a 4 km range on a 24.4 Wh battery (COT: 1.21).

Abstract:
We present a parallelized differentiable traffic simulator based on the Intelligent Driver Model (IDM), a car-following framework that incorporates driver behavior as key variables. Our vehicle simulator efficiently models vehicle motion, generating trajectories that can be supervised to fit real-world data. By leveraging its differentiable nature, IDM parameters are optimized using gradient-based methods. With the capability to simulate up to 2 million vehicles in real time, the system is scalable for large-scale trajectory optimization. We show that we can use the simulator to filter noise in the input trajectories (trajectory filtering), reconstruct dense trajectories from sparse ones (trajectory reconstruction), and predict future trajectories (trajectory prediction), with all generated trajectories adhering to physical laws. We validate our simulator and algorithm on several datasets including NGSIM and Waymo Open Dataset. The code is publicly available at: https://github.com/SonSang/diffidm.

Abstract:
We present AstroLoc2, a monocular and time-offlight (ToF) visual-inertial graph-based localizer used by the Astrobee free-flying robots on the International Space Station (ISS). AstroLoc2 sequentially performs odometry and absolute localization in a single process to decouple map noise from velocity and IMU bias estimation and run efficiently on resource constrained platforms. It improves monocular visual-inertial odometry robustness by adding ToF correspondence factors and uses adaptive map-matching to increase image registration reliability in dynamic environments while preserving fast matching in static ones. We evaluate the performance of AstroLoc2 on a public dataset of 10 ISS activities and show that it improves localization accuracy by 16 % and success rates by 5.5 % while maintaining a faster runtime than leading methods. AstroLoc2 has enabled the Astrobee robots to perform higher precision maneuvers in changing environments on the ISS. It can be configured for other limited computation platforms and we release the source code to the public.

Abstract:
In this work, we introduce a framework that enables highly maneuverable locomotion using non-periodic contacts. This task is challenging for traditional optimization and planning methods to handle due to difficulties in specifying contact mode sequences in real-time. To address this, we use a bi-level contact-implicit planner and hybrid model predictive controller to draft and execute a motion plan. We investigate how this method allows us to plan arm contact events on the shmoobot, a smaller ballbot, which uses an inverse mouseball drive to achieve dynamic balancing with a low number of actuators. Through multiple experiments we show how the arms allow for acceleration, deceleration and dynamic obstacle avoidance that are not achievable with the mouseball drive alone. This demonstrates how a holistic approach to locomotion can increase the control authority of unique robot morpohologies without additional hardware by leveraging robot arms that are typically used only for manipulation. Project website: https://cmushmoobot.github.io/Wallbounce

Abstract:
When instructing robots, users want to flexibly express constraints, refer to arbitrary landmarks, and verify robot behavior, while robots must disambiguate instructions into specifications and ground instruction referents in the real world. To address this problem, we propose Language Instruction grounding for Motion Planning (LIMP), an approach that enables robots to verifiably follow complex, open-ended instructions in real-world environments without prebuilt semantic maps. LIMP constructs a symbolic instruction representation that reveals the robot's alignment with an instructor's intended motives and affords the synthesis of correct-by-construction robot behaviors. We conduct a large-scale evaluation of LIMP on 150 instructions across five real-world environments, demonstrating its versatility and ease of deployment in diverse, unstructured domains. LIMP performs comparably to state-of-the-art baselines on standard open-vocabulary tasks and additionally achieves a 79% success rate on complex spatiotemporal instructions, significantly outperforming baselines that only reach 38%. 11See supplementary materials and demo videos at robotlimp.github.io

Abstract:
CAFEs (Collaborative Agricultural Floating Endeffectors) is a new robot design and control approach to automating large-scale agricultural tasks. Based upon a cable driven robot architecture, by sharing the same roller-driven cable set with modular robotic arms, a fast-switching clamping mechanism allows each CAFE to clamp onto or release from the moving cables, enabling both independent and synchronized movement across the workspace. The methods developed to enable this system include the mechanical design, precise position control and a dynamic model for the spring-mass liked system, ensuring accurate and stable movement of the robotic arms. The system's scalability is further explored by studying the tension and sag in the cables to maintain performance as more robotic arms are deployed. Experimental and simulation results demonstrate the system's effectiveness in tasks including pick-and-place showing its potential to contribute to agricultural automation.

Abstract:
As home robotics gains traction, robots are increasingly integrated into households, offering companionship and assistance. Quadruped robots, particularly those resembling dogs, have emerged as popular alternatives for traditional pets. However, user feedback highlights concerns about the noise these robots generate during walking at home, particularly the loud footstep sound. To address this issue, we propose a sim-to-real based reinforcement learning (RL) approach to minimize the foot contact velocity highly related to the footstep sound. Our framework incorporates three key elements: learning varying PD gains to actively dampen and stiffen each joint, utilizing foot contact sensors, and employing curriculum learning to gradually enforce penalties on foot contact velocity. Experiments demonstrate that our learned policy achieves superior quietness compared to a RL baseline and the carefully handcrafted Sony commercial controllers. Furthermore, the trade-off between robustness and quietness is shown. This research contributes to developing quieter and more user-friendly robotic companions in home environments.

Abstract:
Bio-inspired soft robots have gained significant attention for their flexible design and adaptability to various environments, making them suitable for exploration and task execution in confined or hazardous areas. However, the deformation and motion of soft magnetic robots rely on both their structural design and magnetization, which complicates the guided movement and balance maintenance for aquatic environments. In this work, inspired by the flat and symmetrical body of rays, we design a soft magnetic fish-shaped robot capable of flexible motions and trajectory swimming on the water surface. This robot features the muscle made of magnetic elastomer, which connects with the acrylic skeleton and silicone film fins with a soft body. In the external magnetic field, the robot achieves hovering by flapping its fins, driven by the magnetically actuated deformation of its magnetic muscle. Besides, the robot's axial magnetization enables the rapid steering guided by a horizontal field. In experiments, the soft magnetic robot was tasked with performing a looping figure-eight trajectory movement on the water surface, guided by the field gradient generated by a dense planar electromagnetic coils' array. When moving, the onboard circuit board of the robot collected its inertial and temperature information, and sent these data to the host computer via Bluetooth in real-time for motion monitoring. Received data demonstrated that our robot performed the specified afloat swimming trajectory, exhibiting a good stability on its yaw angle during the continuous motion. The soft magnetic swimming robot shows its integrated functionalities in untethered actuation, on-robot sensing, and wireless communication, indicating a significant prospect on applications in inspection and cleaning within narrow pipelines and enclosed mechanical interior spaces.

Abstract:
We present a novel approach, MAGIC (manipulation analogies for generalizable intelligent contacts), for one-shot learning of manipulation strategies with fast and extensive generalization to novel objects. By leveraging a reference action trajectory, MAGIC effectively identifies similar contact points and sequences of actions on novel objects to replicate a demonstrated strategy, such as using different hooks to retrieve distant objects of different shapes and sizes. Our method is based on a twostage contact-point matching process that combines global shape matching using pretrained neural features with local curvature analysis to ensure precise and physically plausible contact points. We experiment with three tasks including scooping, hanging, and hooking objects. MAGIC demonstrates superior performance over existing methods, achieving significant improvements in runtime speed and generalization to different object categories. Website: https://magic-2024.github.io/.

Abstract:
Dormant tree pruning is laborintensive but essential to maintaining modern highly-productive fruit orchards. In this work we present a closed-loop visuomotor controller for robotic pruning. The controller guides the cutter through a cluttered tree environment to reach a specified cut point and ensures the cutters are perpendicular to the branch. We train the controller using a novel orchard simulation that captures the geometric distribution of branches in a target apple orchard configuration. Unlike traditional methods requiring full 3D reconstruction, our controller uses just optical flow images from a wrist-mounted camera. We deploy our learned policy in simulation and the real-world for an example V-Trellis envy tree with zero-shot transfer, achieving a ~30% success rate - approximately half the performance of an oracle planner.

Abstract:
Acquiring prior knowledge of trajectory distributions in specific environments can significantly expedite the optimisation process in robot motion planning. Leveraging successful past plans and utilising trajectory generative models as priors offers a clear advantage. Previous studies have proposed various methods to harness these priors, such as using prior samples for initialisation or incorporating the prior distribution into trajectory optimisation through inference. Recently, diffusion models have demonstrated effectiveness in encoding multi-modal data in high-dimensional settings. In this study, we introduce a methodology that integrates Stein Variational Gradient Descent (SVGD) with Gaussian Process Motion Planning (GPMP), leveraging diffusion models as multi-modal priors. This approach combines the advantages of deep generative model and Bayesian inference to reduce the computation time required to approximate the posterior distribution of trajectories, particularly when adapting to new, unseen environments. In addition, we incorporate path signatures into our method to enhance the diversity of the posterior distribution, thereby improving the optimality of trajectories in multi-modal settings. To validate our approach, we conduct comparative assessments against multiple baseline methods across various scenarios, including 2D planar robots and robotic manipulators.

Abstract:
Laboratory robotics offer the capability to conduct experiments with a high degree of precision and reproducibility, with the potential to transform scientific research. Trivial and repeatable tasks; e.g., sample transportation for analysis and vial capping are well-suited for robots; if done successfully and reliably, chemists could contribute their efforts towards more critical research activities. Currently, robots can perform these tasks faster than chemists, but how reliable are they? Improper capping could result in human exposure to toxic chemicals which could be fatal. To ensure that robots perform these tasks as accurately as humans, sensory feedback is required to assess the progress of task execution. To address this, we propose a novel methodology based on behaviour trees with multi-modal perception. Along with automating robotic tasks, this methodology also verifies the successful execution of the task, a fundamental requirement in safety-critical environments. The experimental evaluation was conducted on two lab tasks: sample vial capping and laboratory rack insertion. The results show high success rate, i.e., 88% for capping and 92% for insertion, along with strong error detection capabilities. This ultimately proves the robustness and reliability of our approach and that using multi-modal behaviour trees should pave the way towards the next generation of robotic chemists.

Abstract:
The online setting brings greater flexibility and practicality to the three-dimensional bin packing problem (3DBPP) but at the cost of algorithm performance. Existing methods mitigate the performance impact by introducing semionline settings with look-ahead or buffer zones. However, these methods either fail to fundamentally alter the packing order or reduce packing efficiency. This paper proposes a novel semionline setting that allows for the observation of multiple items and the selection of one for packing, thereby adjusting the packing order without reducing packing efficiency. We do work for solving the semi-online packing problem via reinforcement learning which faces two real-world challenges: (1) a variable and difficult-to-predict number of observed items, and (2) the obstruction of robotic arm movement by already packed items. On the one hand, we design a policy network capable of adapting to variable item quantities. On the other hand, we introduce a guided bottom-up packing reward function to free up space for robotic arm motion. We show that our method outperforms the baselines in terms of space utilization with the condition of observing at least two items. Further experiments demonstrate the functionality of our reward function, which can guide a virtual robot to complete packing tasks.

Abstract:
Robot Operating System (ROS) has been widely used to develop robotic applications. The first generation of ROS generally lacks security features, and ROS 2 is introduced with security support. However, security concerns still exist for running ROS in practical multi-tenant environments. In this paper, we conduct an in-depth investigation into the security of ROS 2. We focus on vulnerabilities in ROS nodes and topics and intend to explore methods to break the isolation and security mechanisms systematically. We devise a set of strategies that can be exploited by attackers to escalate privilege or cause information leakage in a multi-tenant environment. These attacks can bypass existing isolation and security mechanisms, including ROS 2's native security module. To validate our findings, we employ simulations across various real-world scenarios to demonstrate how attackers could exploit these vulnerabilities to bypass existing security mechanisms. Finally, we present several defense practices to mitigate these identified threats.

Abstract:
Autonomous systems leveraging visual perception face a rising threat from adversarial patches, jeopardizing their robustness. Existing defense methods adaptable to various pre-trained models typically rely on observed patch characteristics or prior attack data, having difficulty adapting to new threats. This study innovatively focuses on modeling patch attack behavior instead of existing patches, proposing a unified robustness enhancement framework against various adversarial patches. Through self-supervised learning, we accurately locate diverse adversarial patches without prior attack knowledge. Furthermore, we introduce an efficient adaptive patch inpainting method to mitigate patch impact while maintaining visual coherence. Experiments show that our methods effectively boost the robustness of visual perception models against various adversarial patches across different tasks.

Abstract:
Many recent advances in robotic manipulation have come through imitation learning, yet these rely largely on mimicking a particularly hard-to-acquire form of demonstrations: those collected on the same robot in the same room with the same objects as the trained policy must handle at test time. In contrast, large pre-recorded human video datasets demonstrating manipulation skills in-the-wild already exist, which contain valuable information for robots. Is it possible to distill a repository of useful robotic skill policies out of such data without any additional requirements on robot-specific demonstrations or exploration? We present the first such system ZeroMimic, that generates immediately deployable image goal-conditioned skill policies for several common categories of manipulation tasks (opening, closing, pouring, pick&place, cutting, and stirring) each capable of acting upon diverse objects and across diverse unseen task setups. ZeroMimic is carefully designed to exploit recent advances in semantic and geometric visual understanding of human videos, together with modern grasp affordance detectors and imitation policy classes. After training ZeroMimic on the popular EpicKitchens dataset of ego-centric human videos, we evaluate its out-of-the-box performance in varied real-world and simulated kitchen settings with two different robot embodiments, demonstrating its impressive abilities to handle these varied tasks. To enable plug-and-play reuse of ZeroMimic policies on other task setups and robots, we release software and policy checkpoints of our skill policies.

Abstract:
SLAM is an important capability for many autonomous systems, and modern LiDAR-based methods offer promising performance. However, for long duration missions, existing works that either take directly the full pointclouds or extracted features face key tradeoffs in accuracy and computational efficiency (e.g., memory consumption). To address these issues, this paper presents DFLIOM with several key innovations. Unlike previous methods that rely on handcrafted heuristics and hand-tuned parameters for feature extraction, we propose a learning-based approach that select points relevant to LiDAR SLAM pointcloud registration. Furthermore, we extend our prior work DLIOM with the learned feature extractor and observe our method enables similar or even better localization performance using only about 20% of the points in the dense point clouds. We demonstrate that DFLIOM performs well on multiple public benchmarks, achieving a 2.4% decrease in localization error and 57.5% decrease in localization error and 57.5 % decrease in memory usage compared to state-of-the-art methods (DLIOM). Although extracting features with the proposed network requires extra time, it is offset by the faster processing time downstream, thus maintaining real-time performance using 20 Hz LiDAR on our hardware setup. The effectiveness of our learning-based feature extraction module is further demonstrated through comparison with several handcrafted feature extractors.

Abstract:
This paper introduces a tangent line decomposition (TLD) algorithm that efficiently finds collision-free paths close to optimal in both 2D and 3D environments. Compared with the existing visibility line-based algorithms, the proposed algorithm innovatively proposed the concept of tangent line decomposition, which decomposes complicated planning into many simple steps. For each step, only one key obstacle is taken into consideration. Besides, instead of constructing a complete graph, a best-first search algorithm is used to avoid searching redundant edges. The path planned by the algorithm is not the optimal path. However, following the idea of the informed RRT algorithm, the path length planned by TLD can be used as a precondition for other optimal algorithms. In this way, the overall efficiency can be significantly improved. The simulations show that the proposed methods outperform existing methods regarding planning efficiency and solution quality.

Abstract:
This paper investigates Path planning Among Movable Obstacles (PAMO), which seeks a minimum cost collision-free path among static obstacles from start to goal while allowing the robot to push away movable obstacles (i.e., objects) along its path when needed. To develop planners that are complete and optimal for PAMO, the planner has to search a giant state space involving both the location of the robot as well as the locations of the objects, which grows exponentially with respect to the number of objects. This paper leverages a simple yet under-explored idea that, only a small fraction of this giant state space needs to be searched during planning as guided by a heuristic, and most of the objects far away from the robot are intact, which thus leads to runtime efficient algorithms. Based on this idea, this paper introduces two PAMO formulations, i.e., bi-objective and resource constrained problems in an occupancy grid, and develops PAMO, a planning method with completeness and solution optimality guarantees, to solve the two problems. We then further extend PAMO to hybrid-state PAMO to plan in continuous spaces with high-fidelity interaction between the robot and the objects. Our results show that, PAMO can often find optimal solutions within a second in cluttered maps with up to 400 objects.

Abstract:
Efficient energy management and scalability are critical for aerial robots in tasks such as pickup-and-delivery and surveillance. This paper introduces MochiSwarm, an open-source testbed of light-weight micro robotic blimps designed for multi-robot operation without external localization. We propose a modular system architecture that integrates adaptable hardware, a flexible software framework, and a detachable perception module. The hardware is designed to allow for rapid modifications and sensor integration, while the software supports multiple actuation models and robust communication between a base station and multiple blimps. We showcase a differential-drive module as an example, in which autonomy is enabled by visual servoing using the perception module. A case study of pickup-and-delivery tasks with up to 12 blimps highlights the autonomy of the MochiSwarm without relying on external infrastructures.

Abstract:
In autonomous driving and robotic navigation, multi-sensor fusion technology has become increasingly mainstream, with precise sensor calibration as its foundation. Traditional calibration methods rely on manual effort or specific targets, limiting adaptability to complex environments. Learning-based calibration methods still face challenges, such as insufficient overlap between the fields of view (FoV) of multiple sensors and suboptimal cross-modal feature association, which hinder accurate parameter regression. Unlike traditional CNN-based networks, we propose a KansFormer-based self-Calibration Network for camera and LiDAR (KFCalibNet) that replaces fixed activation functions and linear transformations with learnable nonlinear activation functions. This enables the extraction of more fine-grained features from both image and point cloud, significantly enhancing the network's robustness in scenarios with limited FoV overlap. We also employ a multihead attention (MHA) module to compute correlations between image and point cloud features, significantly enhancing cross-modal feature association. To reduce learning complexity, we designed KansFormer with FastKAN as the feedforward network, enabling deep fusion and regression of fine-grained cross-modal features for accurate extrinsic calibration. KFCalibNet achieves an absolute average calibration error of 0.0965 cm in translation and 0.0234° in rotation on the KITTI Odometry dataset, outperforming existing state-of-the-art calibration methods. Moreover, its accuracy and generalization capability have been validated across multiple real-world railway lines.

Affiliations: The Wilfred and Joyce Posluns Centre for Image Guided Innovation & Therapeutic Intervention (PCIGITI) at the Hospital for Sick Children (SickKids), Toronto, Canada; Department of Mechanical Engineering, Concordia University, Montreal, Canada; Discipline of Medical Engineering at the University of Newcastle, Australia; Department of Mechanical and Industrial Engineering, the Department of Biomedical Engineering, Robotics Institute at the University of Toronto, Toronto, Canada

Abstract:
Knowledge of the tip contact force in continuum robots, which are often used as medical instruments, is critical for clinical applications. It enhances the interventionalist's decision-making, navigation efficiency, and procedural safety. However, accurately determining the tip contact force in conventionally sized instruments remains challenging. This study introduces a learning-based method for estimating the external contact force at the tip of a continuum robot. By leveraging curvature and bending angle data from a multi-core fiber equipped with fiber Bragg gratings (FBGs) embedded inside the Nitinol tube, the method maps these inputs to the corresponding tip force in 3D. Experiments conducted on an FBG-embedded Nitinol rod validate the feasibility of the proposed method, yielding Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) values of 20.9 \left(m N^2\right), 2.7(m N), and 4.6(m N), respectively, which represent a 26 % improvement compared to the learning-based vision methodology.

Abstract:
For leg exoskeletons to operate effectively in real-world environments, they must be able to perceive and understand the terrain around them. However, unlike other legged robots, exoskeletons face specific constraints on where depth sensors can be mounted due to the presence of a human user. These constraints lead to a limited Field Of View (FOV) and greater sensor motion, making odometry particularly challenging. To address this, we propose a novel odometry algorithm that integrates proprioceptive data from the exoskeleton with point clouds from a depth camera to produce accurate elevation maps despite these limitations. Our method builds on an extended Kalman filter (EKF) to fuse kinematic and inertial measurements, while incorporating a tailored iterative closest point (ICP) algorithm to register new point clouds with the elevation map. Experimental validation with a leg exoskeleton demonstrates that our approach reduces drift and enhances the quality of elevation maps compared to a purely proprioceptive baseline, while also outperforming a more traditional point cloud map-based variant.

Abstract:
Monocular visual inertial odometry (VIO) has facilitated a wide range of real-time motion tracking applications, thanks to the small size of the sensor suite and low power consumption. To successfully bootstrap VIO algorithms, the initialization module is extremely important. Most initialization methods rely on the reconstruction of 3D visual point clouds. These methods suffer from high computational cost as state vector contains both motion states and 3D feature points. To address this issue, some researchers recently proposed a structureless initialization method, which can solve the initial state without recovering 3D structure. However, this method potentially compromises performance due to the decoupled estimation of rotation and translation, as well as linear constraints. To improve its accuracy, we propose novel structureless visual-inertial bundle adjustment to further refine previous structureless solution. Extensive experiments on real-world datasets show our method significantly improves the VIO initialization accuracy, while maintaining real-time performance.

Abstract:
Terrestrial-aerial bimodal vehicles (TABVs) can fly to avoid obstacles and move safely on the ground to save energy, offering enhanced adaptability and flexibility in various challenging environments. However, a robust localization approach becomes a bottleneck to stably applying the TABVs in real-world tasks. Besides the general limitations of visual SLAM methods, large FoV differences between the two modes, abrupt motion strikes in mode transitions, and unstable attitude in ground mode pose great challenges. In this paper, we present an environment-aware robust localization system specifically designed for passive-wheel-based TABVs, which feature two passive wheels alongside a standard quadrotor. The localization system tightly integrates data from multiple sensors, including a stereo camera, Inertial Measurement Units (IMUs), encoders, and single-point laser distance sensors. First, we introduce a terrain-aware odometer model that accurately estimates terrain slope and vehicle's velocity. Then, we propose an anomaly-aware method that senses anomalous sensors and dynamically adjusts the optimization weights accordingly. By explicitly estimating the environmental conditions, such as ground terrain slopes and visual information qualities, the robot can achieve accurate and robust localization results on the ground. To validate our localization approach, we conducted extensive experiments across various challenging scenarios, demonstrating the effectiveness and reliability of our system for real-world applications.

Abstract:
While learned robotic policies hold promise for advancing generalizable manipulation, their practical deployment is often hindered by suboptimal execution speeds. Imitation learning policies are inherently limited by hardware constraints and the speed of the operator during data collection. In addition, there are no established methods for accelerating policies learned via imitation, and the empirical relationship between execution speed and task success remains underexplored. To address these issues, we introduce Speed Tuning, a reinforcement learning framework specifically designed to enhance the speed of manipulation policies. SPEEDTUNING learns to predict the optimal execution speed for actions, thereby complementing a base policy without necessitating additional data collection. We provide empirical evidence that SPEEDTUNING achieves substantial improvements in execution speed, exceeding 2.4x speed-up, while preserving an adequate success rate compared to both the original task policy and straightforward speed-up methods such as linear interpolation at a fixed speed. We evaluate our approach across a diverse set of dynamic and precise tasks, including pouring, throwing, and picking, demonstrating its effectiveness and robustness in enhancing real-world robotic manipulation. Videos and code are available at https://daivdyuan.github.io/speed-tuning/

Abstract:
Robotic systems for manipulation tasks are increasingly expected to be easy to configure for new tasks or unpredictable environments, while keeping a transparent policy that is readable and verifiable by humans. We propose the method BEhavior TRee eXPansion with Large Language Models (BETR-XP-LLM) to dynamically and automatically expand and configure Behavior Trees as policies for robot control. The method utilizes an LLM to resolve errors outside the task planner's capabilities, both during planning and execution. We show that the method is able to solve a variety of tasks and failures and permanently update the policy to handle similar problems in the future.

Abstract:
Motion planning is a difficult task, especially when generating feasible future trajectories in complex and interactive scenarios. While recent advancements in imitation-based planning have shown significant progress, this approach often encounters causal confusion in dynamic traffic environments. This confusion will cause the planner to incorrectly associate certain actions with outcomes, leading to suboptimal or unsafe plans. To address this, we introduce a novel framework called \overlineC^2L, which improves the planner's latent Causal understanding by incorporating Contrastive Learning and counterfactual data augmentation. Additionally, we propose a shortcut eliminator to extract copycat-free features from history states, reducing the impact of temporal spurious correlations. We validate our method on the nuPlan and interPlan benchmarks, with extensive experiments demonstrating that C^2L delivers highly competitive performance compared to state-of-the-art methods.

Abstract:
Employing a teleoperation system for gathering demonstrations offers the potential for more efficient learning of robot manipulation. However, teleoperating a robot arm equipped with a dexterous hand or gripper, via a teleoperation system presents inherent challenges due to the task's high dimensionality, complexity of motion, and differences between physiological structures. In this study, we introduce a novel system for joint learning between human operators and robots, that enables human operators to share control of a robot end-effector with a learned assistive agent, simplifies the data collection process, and facilitates simultaneous human demonstration collection and robot manipulation training. As data accumulates, the assistive agent gradually learns. Consequently, less human effort and attention are required, enhancing the efficiency of the data collection process. It also allows the human operator to adjust the control ratio to achieve a tradeoff between manual and automated control. We conducted experiments in both simulated environments and physical realworld settings. Through user studies and quantitative evaluations, it is evident that the proposed system could enhance data collection efficiency and reduce the need for human adaptation while ensuring the collected data is of sufficient quality for downstream tasks. For more details, please refer to our webpage https://norweig1an.github.io/HAJL.github.io/.

Abstract:
One challenge in human-robot interaction is selecting communication methods that fit a given robotic system and avoid overpromising. For example, verbal speech provides a clear and easy-to-understand communication method, but can inflate expectations of robot abilities. Is verbal speech the ultimate option? Might other tactics provide similar advantages with fewer downsides? The presented work focuses on addressing these important questions by 1) quantifying any inflated opinions of robots that use verbal speech and 2) gathering perspectives on alternative nonverbal sound-based communication tactics (as a means to potentially shrink gaps between expected and actual robot performance). We conducted a within-subjects online study that varied robot communication modes in videos of successful and unsuccessful mock tasks by a modern commercial robot. Assessments of robot competence and trust after an observed robot failure were higher for verbal robots, but we observed less decline in competence and trust ratings due to the failure for a nonverbal robot using character-like sound (compared to a robot using verbal communication). Human-robot interaction practitioners can use our results to design effective and robust communication strategies for robots.

Abstract:
Recently, learning-based point cloud analysis has played a crucial role in robotic perception. Masked Point Modeling (MPM), owing to its powerful representational capabilities, has become the mainstream point cloud self-supervised learning method. However, existing MPM-based methods often suffer from the problem of negative transfer, due to the disparity in semantic distribution between upstream data and downstream data. To address this issue, we propose an expert enhancement strategy for existing MPM-based methods. Specifically, we insert a Sparse Mixture of Experts (SMoE) layer after each block of the backbone network, which utilizes a multi-branch expert architecture with routers that allocate data of different semantics to the appropriate experts for analysis. During the pre-training phase, our expert-enhanced model not only learns universal 3D representations for the backbone network but also acquires powerful semantic routing capabilities for all expert layers. In the fine-tuning phase, we freeze all backbones and conduct end-to-end fine-tuning solely on our expert layers to adaptively select multiple experts most relevant to the semantics of each downstream data for analysis. Extensive downstream experiments demonstrate the superiority of our method, especially outperforming baseline (Point-MAE) by 5.16%, 5.86%, and 4.62% in three variants of ScanObjectNN while utilizing only 12% of its trainable parameters. Our code is released at https://github.com/chenchen1104/point_e2mae.

Abstract:
For lung surgery robots, the precise segmentation of pulmonary fissures is very important. Damaging the inter-lobar fissures during surgery can have serious consequences. Accurately segmenting weak and abnormal fissures commonly found in clinical CT scans remains a challenging task. To solve the above problem, we aimed to develop a novel Convolution Transformer for accurate fissure segmentation (SC-Former). The proposed SC-Former adopts an encoder, attention block, and decoder structure. First, we designed an encoder with a hybrid CNNs-transformer block that ingeniously amalgamates coordinate convolution and coordinate transformer to effectively capture both local and global feature information. Second, we introduced the long skip connections of our designed attention block at layers of the decoder-encoder structure to emphasize the field of view for fissures. Third, we added the distance map strategy to alleviate the challenge of training the network to segment the false positives from the complex textures in the lung. Fourth, we developed a multi-scale supervision strategy for independent prediction at various decoder levels, effectively integrating multi-scale semantic information to facilitate the segmentation of weak and abnormal fissures. Because of the lack of open-source inter-pulmonary fissure datasets, we collected 3D CT scans from 400 participants in the clinical trial and created a new high-quality dataset: BMI dataset. Extensive experiments on this dataset revealed the great superiority of our method over several state-of-the-art competitors. The ablation study also validated the effectiveness and robustness of each part of SC-Former.

Abstract:
A novel method for optimal motion planning in the context of a class of dynamical system is proposed in this work. Our approach is based on the design of a provably safe and convergent actor structure, which is optimized via a policy iteration method. The proposed actor has wide applications, from control of mechanical systems to providing acceleration commands for more complex robotic platforms. Extra care is taken to provide theoretical guarantees, and the scheme is validated against an existing sampling-based planner.

Abstract:
Point cloud maps with accurate color are crucial in robotics and mapping applications. Existing approaches for producing RGB-colorized maps are primarily based on realtime localization using filter-based estimation or sliding window optimization, which may lack accuracy and global consistency. In this work, we introduce a novel global LiDAR-Visual bundle adjustment (BA) named LVBA to improve the quality of RGB point cloud mapping beyond existing baselines. LVBA first optimizes LiDAR poses via a global LiDAR BA, followed by a photometric visual BA incorporating planar features from the LiDAR point cloud for camera pose optimization. Additionally, to address the challenge of map point occlusions in constructing optimization problems, we implement a novel LiDAR-assisted global visibility algorithm in LVBA. To evaluate the effectiveness of LVBA, we conducted extensive experiments by comparing its mapping quality against existing state-of-the-art baselines (i.e., \mathbfR^3 LIVE and FAST-LIVO). Our results prove that LVBA can proficiently reconstruct high-fidelity, accurate RGB point cloud maps, outperforming its counterparts.

Abstract:
Place recognition is an important technique for autonomous mobile robotic applications. While single-modal sensor-based approaches have shown satisfactory performance, cross-modal place recognition remains underexplored due to the challenge of bridging the cross-modal heterogeneity gap. In this work, we introduce an instance-aware cross-modal place recognition approach, named InsCMPR. We design a novel instance-aware modality alignment module, which aligns multi-modal data at both pixel-level and instance-level by leveraging a pre-trained vision foundation model SAM. Then a novel dual-branch hybrid Mamba-Transformer network is proposed to efficiently enhance the distinctiveness of the produced descriptors by integrating global features with local instance features. Experimental results on the KITTI, NCLT, and HAOMO datasets show that our proposed methods achieve state-of-the-art performance while operating in real time. We will open source the implementation of our method at: https://github.com/nubot-nudt/InsCMPR.

Abstract:
Implicit surface representations have emerged as a powerful tool for the task of 3D reconstruction due to their excellent performance. Yet, when the normal information cannot be available, the previous methods often lead to unsatisfactory reconstruction results, even failure. To this end, we propose a winding number—guided implicit surface reconstruction method, which mainly consists of a winding number—guided regularizer and a dynamic edge sampling strategy. Among them, the winding number-guided regularizer can effectively constrain the global normal consistency of the input raw data, as well as improve the unsatisfactory implicit surface reconstruction result caused by the unavailability of normal information. Meanwhile, in order to reduce the excessive smoothing at sharp edges of implicit surface, we proposed a dynamic edge sampling strategy for sampling near the sharp edge regions of 3D shape, which can effectively avoid the regularizer from smoothing all regions. Finally, we combine them with a simple data term for robust implicit surface reconstruction. Compared with the state-of-the-art methods, experimental results show that our method significantly improves the quality of 3D reconstruction results. In addition, since the winding number-guided regularizer effectively constraints the globally consistent normal of the input 3D raw data, our method can also receive an additional gift, namely the globally consistent normal estimation results of 3D raw data.

Abstract:
Trajectory generation in dynamic environments presents a significant challenge for quadrotors, particularly due to the non-convexity in the spatial-temporal domain. Many existing methods either assume simplified static environments or struggle to produce optimal solutions in real-time. In this work, we propose an efficient safe interval motion planning framework for navigation in dynamic environments. A safe interval refers to a time window during which a specific configuration is safe. Our approach addresses trajectory generation through a two-stage process: a front-end graph search step followed by a back-end gradient-based optimization. We ensure completeness and optimality by constructing a dynamic connected visibility graph and incorporating low-order dynamic bounds within safe intervals and temporal corridors. To avoid local minima, we propose a Uniform Temporal Visibility Deformation (UTVD) for the complete evaluation of spatial-temporal topological equivalence. We represent trajectories with B-Spline curves and apply gradient-based optimization to navigate around static and moving obstacles within spatial-temporal corridors. Through simulation and real-world experiments, we show that our method can achieve a success rate of over \mathbf9 5 % in environments with different density levels, exceeding the performance of other approaches, demonstrating its potential for practical deployment in highly dynamic environments.

Abstract:
This work addresses the problem of radially segregating heterogeneous robotic swarms. Such swarms are those composed of different groups of robots. Unlike other works on segregation in the literature, we propose a controller for Dubins-like robots, motivated by autonomous aerial, wheeled, and underwater vehicles. Our controller can drive the robots individually to converge to circles that are shared only by robots of the same group. We present a heuristic and a collision avoidance scheme in which the information required is locally acquired. We present several simulations widely varying the number of robots per group and the number of groups in which segregation is always reached and collisions between robots are always avoided.

Abstract:
6-Degree of Freedom (6DoF) motion estimation with a combination of visual and inertial sensors is a growing area with numerous real-world applications. However, precise calibration of the time offset between these two sensor types is a prerequisite for accurate and robust tracking. To address this, we propose a universal online temporal calibration strategy for optimization-based visual-inertial navigation systems. Technically, we incorporate the time offset t_d as a state parameter in the optimization residual model to align the IMU state to the corresponding image timestamp using t_d, angular velocity and translational velocity. This allows the temporal misalignment t_d to be optimized alongside other tracking states during the process. As our method only modifies the structure of the residual model, it can be applied to various optimization-based frameworks with different tracking frontends. We evaluate our calibration method with both EuRoC [1] and simulation data and extensive experiments demonstrate that our approach provides more accurate time offset estimation and faster convergence, particularly in the presence of noisy sensor data. The experimental code is available at https://github.com/bytedance/Ts_Online_Optimization.

Abstract:
This paper presents a novel vision-based approach for tracking deformable contour targets using Unmanned Aerial Vehicles (UAVs) through combining image moments descriptor and a Policy Iteration scheme ensuring stability and generalization of knowledge to new tasks. This computationally efficient and optimal control scheme is suitable for diverse dynamic environments such as the surveillance and tracking of targets with evolving features. Due to the ability of the proposed scheme to comprehend an optimization output, the generated control sequence, from an offline successively approximated policy, makes the process less challenging. The proposed methodology is validated through extensive simulations and real-word exper-iments of environmental target surveillance using an octorotor UAV.

Abstract:
Compared with real-time multi-object tracking (MOT), offline multi-object tracking (OMOT) has the advantages to perform 2D-3D detection fusion, erroneous link correction, and full track optimization but has to deal with the challenges from bounding box misalignment and track evaluation, editing, and refinement. This paper proposes “BiTrack”, a 3D OMOT framework that includes modules of 2D-3D detection fusion, initial trajectory generation, and bidirectional trajectory re-optimization to achieve optimal tracking results from camera-LiDAR data. The novelty of this paper includes threefold: (1) development of a point-level object registration technique that employs a density-based similarity metric to achieve accurate fusion of 2D-3D detection results; (2) development of a set of data association and track management skills that utilizes a vertex-based similarity metric as well as false alarm rejection and track recovery mechanisms to generate reliable bidirectional object trajectories; (3) development of a trajectory re-optimization scheme that re-organizes track fragments of different fidelities in a greedy fashion, as well as refines each trajectory with completion and smoothing techniques. The experiment results on the KITTI dataset demonstrate that BiTrack achieves the state-of-the-art performance for 3D OMOT tasks in terms of accuracy and efficiency.

Abstract:
Ensuring safety via safety filters in real-world robotics presents significant challenges, particularly when the system dynamics is complex or unavailable. To handle this issue, learning-based safety filters recently gained popularity, which can be classified as model-based and model-free methods. Existing model-based approaches requires various assumptions on system model (e.g., control-affine), which limits their application in complex systems, and existing model-free approaches need substantial modifications to standard RL algorithms and lack versatility. This paper proposes a simple, plugin-and-play, and effective model-free safety filter learning framework. We introduce a novel reward formulation and use Q-learning to learn Q-value functions to safeguard arbitrary task specific nominal policies via filtering out their potentially unsafe actions. Due to its model-free nature and simplicity, our framework can be seamlessly integrated with various RL algorithms. We validate the proposed approach through simulations on double integrator and Dubin's car systems and demonstrate its effectiveness in real-world experiments with a soft robotic limb.

Abstract:
Current advanced policy learning methodologies have demonstrated the ability to develop expert-level strategies when provided enough information. However, their requirements, including task-specific rewards, action-labeled expert trajectories, and huge environmental interactions, can be expensive or even unavailable in many scenarios. In contrast, humans can efficiently acquire skills within a few trials and errors by imitating easily accessible internet videos, in the absence of any other supervision. In this paper, we try to let machines replicate this efficient watching-and-learning process through Unsupervised Policy from Ensemble Self-supervised labeled Videos (UPESV), a novel framework to efficiently learn policies from action-free videos without rewards and any other expert supervision. UPESV trains a video labeling model to infer the expert actions in expert videos through several organically combined self-supervised tasks. Each task performs its duties, and they together enable the model to make full use of both action-free videos and reward-free interactions for robust dynamics understanding and advanced action prediction. Simultaneously, UPESV clones a policy from the labeled expert videos, in turn collecting environmental interactions for self-supervised tasks. After a sample-efficient, unsupervised, and iterative training process, UPESV obtains an advanced policy based on a robust video labeling model. Extensive experiments in sixteen challenging procedurally generated environments demonstrate that the proposed UPESV achieves state-of-the-art interaction-limited policy learning performance (outperforming five current advanced baselines on 12/16 tasks) without exposure to any other supervision except for videos.

Abstract:
We propose MAC-VO, a novel learning-based stereo visual odometry (VO) framework that trains a metrics-aware uncertainty model to serve two critical functions: selecting keypoints and weighting residuals in pose graph optimization. Unlike traditional geometric methods that favor texture-rich features like edges, our keypoint selector leverages this learned uncertainty model to eliminate low-quality features based on global inconsistency. In contrast to learning-based approaches that rely on scale-agnostic weight matrices for covariance, our metrics-aware covariance modelderived from the learned uncertaintycaptures spatial errors in keypoint registration and inter-axis correlations. By embedding this co-variance model into pose graph optimization, MAC-VO achieves superior robustness and accuracy in pose estimation, excelling in challenging environments with varying illumination, feature density, and motion patterns. Evaluations on public benchmark datasets demonstrate that MAC-VO surpasses existing VO algorithms and even some SLAM systems in difficult scenarios. Additionally, the uncertainty map offers valuable insights for decision-making.

Abstract:
Understanding fine-grained object affordances is imperative for robots to manipulate objects in unstructured environments given open-ended task instructions. However, existing methods of visual affordance predictions often rely on manually annotated data or conditions only on a predefined set of tasks. We introduce Unsupervised Affordance Distillation (UAD), a method for distilling affordance knowledge from foundation models into a task-conditioned affordance model without any manual annotations. By leveraging the complementary strengths of large vision models and vision-language models, UAD automatically annotates a large-scale dataset with detailed pairs. Training only a lightweight task-conditioned decoder atop frozen features, UAD exhibits notable generalization to in-the-wild robotic scenes and to various human activities, despite only being trained on rendered objects in simulation. Using affordance provided by UAD as the observation space, we show an imitation learning policy that demonstrates promising generalization to unseen object instances, object categories, and even variations in task instructions after training on as few as 10 demonstrations. Project website with Appendix: unsup-affordance.github.io/.

Abstract:
Deep learning-based object detection methods have shown significant success, particularly in robotic vision tasks like autonomous navigation and object manipulation. However, their performance drops sharply in low-light conditions, challenging robots in poorly lit environments. To address this, we propose Dark-DENet, a lightweight detection-driven enhancement network specifically designed for low-light conditions. Dark-DENet introduces an Improved Global Enhancement Module for low-frequency components to capture multiscale features, and a multi-layer convolutional structure in the Detail Enhancement Module to enhance high-frequency components. Additionally, the Scale-Aware Pooling Fusion Module enriches the semantic information of HF components. Dark-DENet is a plug-and-play network that can be easily integrated into the backbone of various detectors for joint training. Integrated with YOLOv5 as DD-YOLO, and combined with other models like YOLO series, RT-DETR, RetinaNet, and Faster R-CNN, experimental results show Dark-DENet consistently improves detection performance across all models. It effectively enhances latent features under limited runtime, making it a robust solution for robotic vision in low-light environments.

Abstract:
The sample inefficiency of reinforcement learning (RL) remains a significant challenge in robotics. RL requires large-scale simulation and can still cause long training times, slowing research and innovation. This issue is particularly pronounced in vision-based control tasks where reliable state estimates are not accessible Differentiable simulation offers an alternative by enabling gradient back-propagation through the dynamics model, providing low-variance analytical policy gradients and, hence, higher sample efficiency. However, its usage for real-world robotic tasks has yet been limited. This work demonstrates the great potential of differentiable simulation for learning quadrotor control. We show that training in differentiable simulation significantly outperforms model-free RL in terms of both sample efficiency and training time, allowing a policy to learn to recover a quadrotor in seconds when providing vehicle states and in minutes when relying solely on visual features. The key to our success is two-fold. First, the use of a simple surrogate model for gradient computation greatly accelerates training without sacrificing control performance. Second, combining state representation learning with policy learning enhances convergence speed in tasks where only visual features are observable. These findings highlight the potential of differentiable simulation for real-world robotics and offer a compelling alternative to conventional RL approaches. Video: https://youtu.be/LdgvGCLB9do Code: https://github.com/uzh-rpg/rpgflightning

Abstract:
In computer-assisted orthopedic surgery (CAOS), accurate pre-operative to intra-operative bone registration is an essential and critical requirement for providing navigational guidance. This registration process is challenging since the intra-operative 3D points are sparse, only partially overlapped with the pre-operative model, and disturbed by noise and outliers. The commonly used method in current state-of-the-art orthopedic robotic system is bony landmarks based registration, but it is very time-consuming for the surgeons. To address these issues, we propose a novel partial-to-full registration framework based on gradient-SDF for CAOS. The simulation experiments using bone models from publicly available datasets and the phantom experiments performed under both optical tracking and electromagnetic tracking systems demonstrate that the proposed method can provide more accurate results than standard benchmarks and be robust to 90% outliers. Importantly, our method achieves convergence in less than 1 second in real scenarios and mean target registration error values as low as 2.198 mm for the entire bone model. Finally, it only requires random acquisition of points for registration by moving a surgical probe over the bone surface without the need for correspondences, thus showing significant potential clinical value. The code of the framework is available.

Abstract:
Recent studies have made significant progress in addressing dexterous manipulation problems, particularly in inhand object reorientation. However, there are few existing works that explore the potential utilization of developed dexterous manipulation controllers for downstream tasks. In this study, we focus on constrained dexterous manipulation for food peeling. Food peeling presents various constraints on the reorientation controller, such as the requirement for the hand to securely hold the object after reorientation for peeling. We propose a simple system for learning a reorientation controller that facilitates the subsequent peeling task. Videos are available at: https://taochenshh.github.io/projects/veg-peeling.

Abstract:
Robot scooping is a challenging and important task in robotic tool manipulation research due to the complex relationship between the robot, the tool, and target objects/environment. Taking into account different tools, different target objects and varying environments, the required scooping manipulation strategy usually varies greatly. Even considering a specific type of spoon, the question of how to obtain a policy model that requires less demonstration data but shows better generalization capabilities deserves further exploration. In this paper, we propose a progressive learning framework for general robot scooping tasks, which requires a limited number of demonstrations but shows promising generalization capability. We first learn a scooping policy via human demonstrations with a specific setup. We then use this as a pre-train model for reinforcement learning in a curriculum manner to achieve a scooping strategy that is generalizable to different task setups. Finally, we evaluate the capabilities of the policy with a series of experiments both in simulation and on a real robot.

Abstract:
Reliable self-localization is a foundational skill for many intelligent mobile platforms. This paper explores the use of event cameras for motion tracking thereby providing a solution with inherent robustness under difficult dynamics and illumination. In order to circumvent the challenge of event camera-based mapping, the solution is framed in a cross-modal way. It tracks a map representation that comes directly from frame-based cameras. Specifically, the proposed method operates on top of gaussian splatting, a state-of-the-art representation that permits highly efficient and realistic novel view synthesis. The key of our approach consists of a novel pose parametrization that uses a reference pose plus first order dynamics for local differential image rendering. The latter is then compared against images of integrated events in a staggered coarse-to-fine optimization scheme. As demonstrated by our results, the realistic view rendering ability of gaussian splatting leads to stable and accurate tracking across a variety of both publicly available and newly recorded data sequences.

Abstract:
Multi-modal reinforcement learning (RL) has been brought into focus due to its ability to provide complementary information from different sensors, enriching observations of agents. However, the introduction of multi-modal highdimensional observations brings challenges to sample efficiency. There is a lack of research on how to efficiently obtain multi-modal latent states while encouraging them to generate complementary information. To address this, we propose a representation learning method, Multi-modal Joint Predictive Representation (MJPR), which utilizes multi-modal interactive information to predict future latent states. The joint prediction method achieves the representation training for modalities and promotes each modality to generate complementary information related to predictions of each other. In addition, we introduce multi-modal loss balancing to prompt training equilibrium and cross-modal contrastive learning (CMCL) to align the modalities for effective modal interaction. We establish the multi-modal environments in the Deepmind Control suite (DMC) and Webots and compare our method with current RL representation methods. Experimental results show that MJPR outperforms state-of-the-art methods by an average of 12.0% on six subtasks in DMC environments. It outperforms advanced methods by 16.7% and 55.4% in simple tasks and complex tasks of Webots environment, respectively. Moreover, ablation experiments are established in the DMC environment to verify the importance of each module to MJPR.

Abstract:
Recently, diffusion policy has shown impressive results in handling multi-modal tasks in robotic manipulation. However, it has fundamental limitations in out-of-distribution failures that persist due to compounding errors and its limited capability to extrapolate. One way to address these limitations is robot-gated DAgger, an interactive imitation learning with a robot query system to actively seek expert help during policy rollout. While robot-gated DAgger has high potential for learning at scale, existing methods like Ensemble-DAgger struggle with highly expressive policies: They often misinterpret policy disagreements as uncertainty at multi-modal decision points. To address this problem, we introduce Diff-DAgger, an efficient robot-gated DAgger algorithm that leverages the training objective of diffusion policy. We evaluate Diff-DAgger across different robot tasks including stacking, pushing, and plugging, and show that Diff-DAgger improves the task failure prediction by 39.0 %, the task completion rate by 20.6 %, and reduces the wall-clock time by a factor of 7.8. We hope that this work opens up a path for efficiently incorporating expressive yet data-hungry policies into interactive robot learning settings. The project website is available at: https://diffdagger.github.io.

Abstract:
This paper presents a simultaneous localization and mapping (SLAM) system to provide accurate pose estimation and dynamic scene reconstruction. Our approach proposes a Joint Point-Gaussian Splatting representation, which fully integrates the robustness of isotropic feature points in pose estimation and the flexibility of anisotropic 3D Gaussians in scene representation. This system does not need to suppress the anisotropic representation of Gaussian elements, which enables the mapping module to achieve finer scene representation with lower memory consumption. Additionally, in order to enhance the adaptability of the system in dynamic environments, we introduced a dynamic region recognition module and utilized 3D Gaussian Splatting and 4D Gaussian Splatting representations to represent static and dynamic regions respectively. Furthermore, we developed a local map management strategy for Gaussian Splatting mapping, effectively reducing the memory and computational resource usage in the mapping process. Experiments on public datasets demonstrate that our system achieves state-of-the-art tracking and mapping accuracy compared to existing baselines.

Abstract:
Keyframes are LiDAR scans saved for future reference in Simultaneous Localization And Mapping (SLAM), but despite their central importance most algorithms leave choices of which scans to save and how to use them to wasteful heuristics. This work proposes two novel keyframe selection strategies for localization and map summarization, as well as a novel approach to submap generation which selects keyframes that best constrain localization. Our results show that online keyframe selection and submap generation reduce the number of saved keyframes and improve per scan computation time without compromising localization performance. We also present a map summarization feature for quickly capturing environments under strict map size constraints.

Abstract:
Cable transmissions are commonly used in robotics for remote force transmission, offering a lightweight, compact, and efficient solution for transmitting high forces between input and output. However, cables in flexible compression housings (Bowden cables), exhibit high static friction, which increases exponentially with total bend angle. Alternatively, internally routed ball-bearing supported cable capstan transmissions are low friction, but complex and present challenges in routing multiple sets of cables. In this paper, we propose motion-decoupled cable transmission modules that address these challenges, occupying the middle ground, functioning as discrete-joint ball-bearing supported Bowden cables. Our rolling-plus-twist joint design decouples pairs of routed cables from changing significantly in tension, length, or friction during large angle motion of the linked transmission. Using sub-1 mm diameter high-strength synthetic cable, the transmission exhibits a maximum coupling motion of only 0.15 mm over the full range of motion of the cable-transmission mechanism, approximately 10% of pretension in combined hysteresis and friction, a transmission stiffness of 10 N/mm, weighing just 9 g per rolling joint and 5 g per twist joint. Two applications are demonstrated: cable routing alongside a robot arm for, say, gripper remote actuation, and remote needle advancement for an MRI-safe needle biopsy robot.

Abstract:
Video object detection (VOD) of motile cells (e.g., bacteria and sperm) under microscopy is challenging due to motion blur, sporadic out-of-focus, and pose variations. Compared with VOD in generic scenes, the lower contrast and smaller color space of microscopy imaging further introduce feature overlap between the foreground objects and the background objects (e.g., impurity cells and contaminants). Transformer-based methods have achieved great success in the VOD of generic scenes by utilizing object queries to model the inner-frame objects and the inter-frame objects. However, the appearance overlap problem in microscopy video frames significantly compromises the inter-frame query aggregation by introducing background features into the object query. To tackle this challenge, this paper reports a static-dynamic query-based VOD network that treats object queries of the current video frame and reference video frames differently. Specifically, a two-stage framework is implemented that first generates high-quality object queries of reference frames with a static Transformer decoder pre-trained on a still image dataset. The network is then trained on a per-frame annotated dataset using a dynamic Transformer decoder to model the object queries of the current frame. A Reference Query Relation Module is further proposed to enhance the reference queries for more effective aggregation with the current query. Experiments on clinically collected biopsied sperm datasets validated the effectiveness of the proposed method.

Abstract:
Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to navigate with lowlevel actions following natural language instructions in 3D environments. Most existing approaches utilize observation features from the current step to represent the viewpoint. However, these representations often conflate redundant and essential information for navigation, introducing ambiguity into the agent's action prediction. To address the problem of inadequate representation, we propose a Knowledge-andHistory Aware Visual Representation for Continuous Vision-and-Language Navigation (VLN-KHVR). The proposed approach constructs enriched visual representations tailored to navigation instructions, enhancing agents' navigation performance. Specifically, VLN-KHVR extracts image features from the current observation, retrieves relevant knowledge in the knowledge base, and obtains the history of the navigation episode. Subsequently, the knowledge and history features are filtered to eliminate the information irrelevant to navigation instruction. These refined features are integrated with the instruction for further interaction. Finally, the aggregated features are used to guide navigation. Our model outperforms previous methods on the VLN-CE benchmark, demonstrating the effectiveness of the proposed method.

Abstract:
Underwater autonomous robotic operations require online localization and 3D mapping. Because of the absence of absolute positioning underwater, these tasks strongly rely on embedded sensors, including proprioceptive or navigation sensors - which can be fused for an odometry, - and exteroceptive sensors. One of the most popular exteroceptive sensors for underwater is the imaging sonar, which emits a large fan-shaped acoustic signal and estimates the position of the surrounding obstacles from a measure of the reflected signal. This paper addresses underwater online localization and 3D mapping using a forward looking, wide-aperture imaging sonar and vehicle's intrinsic navigation estimates. We introduce 3DSSDF (3D Sonar Reconstruction Using Signed Distance Functions), a new localization and 3D mapping algorithm based on signed distance functions, which is evaluated in simulation and on real data, in man-made and natural environments. Comparisons to reference trajectories and maps demonstrate that, in our tests, 3DSSDF efficiently corrects navigation drift and that trajectory and map accuracy is always below 1 m and below 1% of the distanced travelled, which can be sufficient for the safe inspection of natural or artificial underwater structures.

Abstract:
This work presents an algorithm based on the Nondominated Sorting Genetic Algorithm II (NSGA-II) to solve multi-objective offline Multi-Robot Coverage Path Planning (MCPP) problems. The proposed algorithm embeds a donation-mutation operator and a multiple-parent crossover that generates solutions which maintain the longest path while minimizing the average path length. The algorithm also uses a library of elitism-selected high-fitness robot paths, and tournament-selected high min-max fitness paths, to construct high multi-objective fitness offspring. We evaluate the performance of our proposed algorithm against the state-of-the-art NSGA-II extended with an improved Heuristic Genetic Algorithm Crossover, and we demonstrate that for different instances of the MCPP problem, the Pareto-fronts of our proposed algorithm are not dominated by any of the points of the fronts generated by the state-of-the-art NSGA-II. A comparison has also been performed in a virtual environment simulating five drones inspecting three wind turbines. Results show that our approach exhibits a higher convergence rate for higher values of the ratio between the number of points to visit and the number of drones.

Abstract:
In this work, we consider the problem of multiagent informative path planning (IPP) for robots whose sensor visibility continuously changes as a consequence of a time-varying natural phenomenon. We leverage ergodic trajectory optimization (ETO), which generates paths such that the amount of time an agent spends in an area is proportional to the expected information in that area. We focus specifically on the problem of multi-agent drone search of a wildfire, where we use the time-varying environmental process of smoke diffusion to construct a sensor visibility model. This sensor visibility model is used to repeatedly calculate an expected information distribution (EID) to be used in the ETO algorithm. Our experiments show that our exploration method achieves improved information gathering over both baseline search methods and naive ergodic search formulations.

Abstract:
This paper presents a novel design concept of a pair of soft gripper hands that can establish a secure connection between them for bearing a large load with a low air pressure. The design was inspired by dovetail joints in carpentry that enable a tight, strong connection between two pieces of wood. We propose to mimic the dovetail joint mechanism by using soft robotic fingers that interlace to each other for secure connection. The work was motivated by the need for securing a connection between two soft robotic arms for holding a balance-impaired older adult in case of losing balance. First, the design principle of dovetail-like secure soft finger connection is presented, and its potential application to a portable fall prevention system is described. Details of the dovetail soft finger design, its rapid inflation method, and other implementation issues are then discussed. Through experiments of a proof-of-concept prototype, it is validated that the dovetail soft fingers can bear at least 18 kg of load with only 52 kPa of air chamber pressure filled in 250 ms of charging time. At the end, the proposed method is compared to alternative methods using a Pugh chart.

Abstract:
For objects with complex topological and geometrical features, stochastic topological grasping can be executed without the necessity for feedback or precise planning. However, this grasping method has two significant limitations. First, the technique's effectiveness is reduced when interacting with topologically and geometrically simple objects like spheres, cubes, and cylinders, due to the inherent variability in grasping patterns. Additionally, the method's low stiffness restricts its ability to securely handling heavier objects. To address these challenges, this paper proposes an entanglement soft robotic gripper with variable stiffness and two transformed grasping modes (entanglement and clamping modes). The gripper contains three filaments, which can enhance the stiffness through the mechanism of layer jamming. Furthermore, the entanglement mode and the clamping mode, can be transformed by adjusting the working length of the filaments. The grasping performance comparison with and without variable stiffness was carried out, and the results indicated that the implementation of variable stiffness led to a 149 % increase in payload weight. Through experimental validation, we successfully employed the gripper in variable stiffness and transformed modes to grasp items with various shapes and weights. Demonstration of grasping heavier objects and transforming between two grasping modes were also conducted to showcase the adaptability and versatility of the gripper.

Abstract:
Recently, egocentric action anticipation for wearable robotics cameras has gained considerable attention due to its capability to analyze nouns and verbs from a firstperson view. However, this field encounters challenges due to various uncertainties, such as action-irrelevant information and semantically fused representations of verbs and nouns. To overcome these issues, we introduce Ego- A^3, designed to improve the robustness and reliability of egocentric action anticipation systems. Ego- A^3 adaptively extracts actionrelevant data to efficiently utilize additional information beyond visual data. Additionally, Ego- A^3 produces effective disentangled representations for verbs and nouns by employing learnable verb and noun queries. Experiments on the EpicKitchens-100 and EGTEA Gaze+ datasets demonstrate that Ego- A^3 outperforms existing methods in top-1 accuracy and mean top- 5 recall. Our code is publicly available at https://github.com/alsgur0720/egocentricanticipation.

Abstract:
Tactile perception is essential for human interaction with the environment and is becoming increasingly crucial in robotics. Tactile sensors like the BioTac mimic human fingertips and provide detailed interaction data. Despite its utility in applications like slip detection and object identification, this sensor is now deprecated, making many valuable datasets obsolete. However, recreating similar datasets with newer sensor technologies is both tedious and time-consuming. Therefore, adapting these existing datasets for use with new setups and modalities is crucial. In response, we introduce ACROSS, a novel framework for translating data between tactile sensors by exploiting sensor deformation information. We demonstrate the approach by translating BioTac signals into the DIGIT sensor. Our framework consists of first converting the input signals into 3D deformation meshes. We then transition from the 3D deformation mesh of one sensor to the mesh of another, and finally convert the generated 3D deformation mesh into the corresponding output space. We demonstrate our approach to the most challenging problem of going from a low-dimensional tactile representation to a high-dimensional one. In particular, we transfer the tactile signals of a BioTac sensor to DIGIT tactile images. Our approach enables the continued use of valuable datasets and data exchange between groups with different setups.

Abstract:
Vision language models have played a key role in extracting meaningful features for various robotic applications. Among these, Contrastive Language-Image Pretraining (CLIP) is widely used in robotic tasks that require both vision and natural language understanding. However, CLIP was trained solely on static images paired with text prompts and has not yet been fully adapted for robotic tasks involving dynamic actions. In this paper, we introduce Robotic-CLIP to enhance robotic perception capabilities. We first gather and label large-scale action data, and then build our Robotic-CLIP by fine-tuning CLIP on 309,433 videos (≈ 7.4 million frames) of action data using contrastive learning. By leveraging action data, Robotic-CLIP inherits CLIP's strong image performance while gaining the ability to understand actions in robotic contexts. Intensive experiments show that our Robotic-CLIP outperforms other CLIP-based models across various language-driven robotic tasks. Additionally, we demonstrate the practical effectiveness of Robotic-CLIP in real-world grasping applications.

Abstract:
This paper introduces a novel cable-driven rodent ankle exoskeleton system designed for in-vivo research on the restoration and enhancement of sensorimotor abilities. The system features a lightweight, actuator-decoupled exoskeleton for shaping motion and providing kinesthetic feedback, along with a vision system and feedback-controlled treadmill for gait analysis. Experiments conducted under anesthesia and in awake conditions demonstrated effective control with minimal interference to natural gait. Dynamic time warping distance and Pearson correlation coefficients were calculated between joint angles from natural gait and those from rats wearing both passive and active exoskeleton component. The knee joint showed a low DTW distance and high correlation regardless of conditions, while all three joint displayed a greater maximum value from natural gait when the active component was engaged. These results provide valuable insights into the physiological impacts of wearable robotics in animal models, advancing sensorimotor rehabilitation technologies.

Abstract:
Ophthalmic surgical robots offer superior stability and precision by reducing the natural hand tremors of human surgeons, enabling delicate operations in confined surgical spaces. Despite the advancements in developing vision- and force-based control methods for surgical robots, preoperative navigation remains heavily reliant on manual operation, limiting the consistency and increasing the uncertainty. Existing eye gaze estimation techniques in the surgery, whether traditional or deep learning-based, face challenges including dependence on additional sensors, occlusion issues in surgical environments, and the requirement for facial detection. To address these limitations, this study proposes an innovative eye localization and tracking method that combines machine learning with traditional algorithms, eliminating the requirements of landmarks and maintaining stable iris detection and gaze estimation under varying lighting and shadow conditions. Extensive real-world experiment results show that our proposed method has an average estimation error of 0.58 degrees for eye orientation estimation and 2.08-degree average control error for the robotic arm's movement based on the calculated orientation.

Abstract:
Simultaneous Localization and Mapping (SLAM) allows mobile robots to navigate without external positioning systems or pre-existing maps. Radar is emerging as a valuable sensing tool, especially in vision-obstructed environments, as it is less affected by particles than lidars or cameras. Modern 4D imaging radars provide three-dimensional geometric information and relative velocity measurements, but they bring challenges, such as a small field of view and sparse, noisy point clouds. Detecting loop closures in SLAM is critical for reducing trajectory drift and maintaining map accuracy. However, the directional nature of 4D radar data makes identifying loop closures, especially from reverse viewpoints, difficult due to limited scan overlap. This article explores using 4D radar for loop closure in SLAM, focusing on similar and opposing viewpoints. We generate submaps for a denser environment representation and use introspective measures to reject false detections in feature-degenerate environments. Our experiments show accurate loop closure detection in geometrically diverse settings for both similar and opposing viewpoints, improving trajectory estimation with up to 82% improvement in ATE and rejecting false positives in self-similar environments.

Abstract:
Cable transmission enables motors of robotic arm to operate lightweight and low-inertia joints remotely in various environments, but it also creates issues with motion coupling and cable routing that can reduce arm's control precision and performance. In this paper, we present a novel motion decoupling mechanism with low-friction to align the cables and efficiently transmit the motor's power. By arranging these mechanisms at the joints, we fabricate a fully decoupled and lightweight cable-driven robotic arm called D3-Arm with all the electrical components be placed at the base. Its 776 mm length moving part boasts six degrees of freedom (DOF) and only 1.6 kg weights. To address the issue of cable slack, a cable-pretension mechanism is integrated to enhance the stability of long-distance cable transmission. Through a series of comprehensive tests, D3-Arm demonstrated 1.29 mm average positioning error and 2.0 kg payload capacity, proving the practicality of the proposed decoupling mechanisms in cable-driven robotic arm.

Abstract:
Articulation between body segments of small insects and animals is a three degree-of-freedom (DOF) motion. Implementing this kind of motion in a compact robot is usually not tractable due to limitations in small actuator technologies. In this work, we concede full 3-DOF control and instead select a one degree-of-freedom curve in SO(3) to articulate segments of a caterpillar robot. The curve is approximated with a spherical four-bar, which is synthesized through optimal rigid body guidance. We specify the desired SO(3) motion using discrete task positions, then solve for candidate mechanisms by computing all roots of the stationary conditions using numerical homotopy continuation. A caterpillar robot prototype demonstrates the utility of this approach. This synthesis procedure is also used to design prolegs for the caterpillar robot. Each segment contains two DC motors and a shape memory alloy, which is used for latching and unlatching between segments. The caterpillar robot is capable of walking, steering, object manipulation, body articulation, and climbing.

Abstract:
High-speed object tracking holds significant relevance across robotic domains, such as drones and autonomous driving. Compared to conventional cameras, event cameras are equipped with the ability to capture object motion information at exceptionally high temporal resolution with relatively low power consumption and remain immune from motion-blurring effects. Regrettably, many existing methods adopt a framebased approach by stacking events into Event Frame, which overlooks the sparsity and high temporal resolution of events. This approach is also reliant on the huge pre-training backbone and reaches a performance plateau but demands unrealistically large networks and high power consumption, rendering it impractical for real-time applications in battery-constrained robotic scenarios. In this paper, we propose an efficient and effective single-modality tracker using Point Cloud representation named E2B (Event to Box). By directly handling the raw output of event cameras without dataformat transformation, E2B leverages events' coordinate guidance to accurately map Event Cloud features to 2D bounding boxes. Moreover, E2B incorporates the pyramid structure into the multi-stage feature extraction architecture to effectively track objects across diverse scales. In the experiments, E2B performs outstandingly on two large-scale and one synthetic event-based tracking datasets, covering both indoor and outdoor environments, as well as rigid and non-rigid objects.

Abstract:
Imitation learning is a popular method for teaching robots new behaviors. However, most existing methods focus on teaching short, isolated skills rather than long, multistep tasks. To bridge this gap, imitation learning algorithms must not only learn individual skills but also an abstract understanding of how to sequence these skills to perform extended tasks effectively. This paper addresses this challenge by proposing a neuro-symbolic imitation learning framework. Using task demonstrations, the system first learns a symbolic representation that abstracts the low-level state-action space. The learned representation decomposes a task into easier subtasks and allows the system to leverage symbolic planning to generate abstract plans. Subsequently, the system utilizes this task decomposition to learn a set of neural skills capable of refining abstract plans into actionable robot commands. Experimental results in three simulated robotic environments demonstrate that, compared to baselines, our neuro-symbolic approach increases data efficiency, improves generalization capabilities, and facilitates interpretability.

Abstract:
We aim to develop a general multi-agent reinforcement learning (MARL) policy that enables a group of robots to efficiently explore large-scale, unknown environments with random pose initialization. Existing MARL-based multi-robot exploration methods face challenges in reliably mapping observations to actions in large-scale scenarios and lack of zero-shot generalization to unknown environments. To this end, we propose a generic multi-task pre-training algorithm (termed TaskExp) to enhance the generalization of learning-based policies. In particular, we design a decision-related task to guide the policy to focus on valuable subspaces of the action space, improving the reliability of policy mapping. Moreover, two perception-related tasks-Location Estimation and Map Prediction-are designed to enhance the zero-shot capability of the policy by guiding it to extract general invariant features from unknown environments. With TaskExp pre-training, our policy significantly outperforms state-of-the-art planning-based methods in large-scale scenarios and demonstrates strong zero-shot performance in unseen environments. Furthermore, TaskExp can also be easily integrated to improve the existing learning-based multi-robot exploration methods.

Abstract:
The Multi-Agent Combinatorial Path Finding (MCPF) problem is a generalized version of the Multi-Agent Path Finding (MAPF) problem, in which each agent must collectively visit multiple intermediate target locations on the way to its final destination. The state-of-the-art approach for addressing MCPF, known as Conflict-Based Steiner Search (CBSS) [1], leverages K-best joint sequences to create multiple search trees, and employs a CBS-like search to resolve collisions for each tree. Despite its optimality guarantee, CBSS is computationally burdensome due to the duplicated collision resolutions across multiple trees and the computation of the K best joint sequences. To address these challenges, we propose a novel algorithm called Improved Conflict-Based Steiner Search (ICBSS), aiming at expediting CBSS by replacing the multi trees with a single constraint tree (CT), which can be implemented by interleaving the time-dependent traveling salesman algorithm to compute the optimal joint path for agents under the newly generated constraints in each CT vertex. Additionally, we introduce a sub-optimal variant of ICBSS, which improves computational efficiency at the expense of solution optimality. Empirical results show that ICBSS outperforms state-of-the-art MCPF algorithms on a variety of MAPF instances.

Abstract:
This paper proposes a heterogeneous teaming solution to the problem of target discovery and monitoring in unknown, non-convex environments. The team consists of two types of agents: agile agents with sensors capable of mapping their surroundings and slower agents that are capable of monitoring or servicing discovered targets. We propose an exploration algorithm that utilizes the IRIS algorithm to generate a graph decomposition from collision free ellipses contained within the environment. This graph is passed to the monitoring agents who execute polynomial complexity assignment and touring algorithms to generate high quality path plans which service all discovered targets. Our algorithmic structure allows the team to solve the problems of exploration, target discovery, assignment, and monitoring within unknown, non-convex environments efficiently using limited information. The performance of our proposed method is verified through batch simulations and complexity analysis.

Abstract:
This paper introduces the Proactive Assistance through action-Completion Estimation (PACE) framework, designed to enhance human-robot collaboration through real-time monitoring of human progress. PACE incorporates a novel method that combines Dynamic Time Warping (DTW) with correlation analysis to track human task progression from hand movements. PACE trains a reinforcement learning policy from limited demonstrations to generate a proactive assistance policy that synchronizes robotic actions with human activities, minimizing idle time and enhancing collaboration efficiency. We validate the framework through user studies involving 12 participants, showing significant improvements in interaction fluency, reduced waiting times, and positive user feedback compared to traditional methods.

Abstract:
We present DemoStart, a novel auto-curriculum reinforcement learning method capable of learning complex manipulation behaviors on an arm equipped with a three- fingered robotic hand, from only a sparse reward and a handful of demonstrations in simulation. Learning from simulation drastically reduces the development cycle of behavior generation, and domain randomization techniques are leveraged to achieve successful zero-shot sim-to- real transfer. Transferred policies are learned directly from raw pixels from multiple cameras and robot proprioception. Our approach outperforms policies learned from demonstrations on the real robot and requires 100 times fewer demonstrations, collected in simulation. More details and videos in sites.google.com/view/demostart.

Abstract:
Spherical Modular Self-reconfigurable Robots (SMSRs) have been popular in recent years. Their Self-reconfigurable nature allows them to adapt to different en-vironments and tasks, and achieve what a single module could not achieve. To collaborate with each other, relative localization between each module and assembly is crucial. Existing relative localization methods either have low accuracy, which is unsuit-able for short-distance collaborations, or are designed for fixed-shape robots, whose visual features remain static over time. This paper proposes the first visual relative localization method for SMSRs. We first detect and identify individual modules of SMSRs, and adopt visual tracking to improve the detection and identification robustness. Using an optimization-based method, tracking result is then fused with odometry to estimate the relative pose between assemblies. To deal with the non-convexity of the optimization problem, we adopt semi-definite relaxation to transform it into a convex form. The proposed method is validated and analysed in real-world experiments. The overall localization performance and the performance under time-varying configuration are evaluated. The result shows that the relative position estimation accuracy reaches 2%, and the orientation estimation accuracy reaches 6.64°, and that our method surpasses the state-of-the-art methods.

Abstract:
Preference- based reinforcement learning (PbRL) has shown significant promise for personalization in human- robot interaction (HRI) by explicitly integrating human preferences into the robot learning process. However, existing practices often require training a personalized robot policy from scratch, resulting in inefficient use of human feedback. In this paper, we propose preference-based action representation learning (PbARL), an efficient fine-tuning method that decouples common task structure from preference by leveraging pre-trained robot policies. Instead of directly fine-tuning the pre-trained policy with human preference, PbARL uses it as a reference for an action representation learning task that maximizes the mutual information between the pre-trained source domain and the target user preference-aligned domain. This approach allows the robot to personalize its behaviors while preserving original task performance and eliminates the need for extensive prior information from the source domain, thereby enhancing efficiency and practicality in real-world HRI scenarios. Empirical results on the Assistive Gym benchmark and a real-world user study (N=8) demonstrate the benefits of our method compared to state-of-the-art approaches. Website at https://sites.google.com/view/pbarl.

Abstract:
Multi-sensor fusion is essential for autonomous vehicle localization, as it is capable of integrating data from various sources for enhanced accuracy and reliability. The accuracy of the integrated location and orientation depends on the precision of the uncertainty modeling. Traditional methods of uncertainty modeling typically assume a Gaussian distribution and involve manual heuristic parameter tuning. However, these methods struggle to scale effectively and address long-tail scenarios. To address these challenges, we propose a learning-based method that encodes sensor information using higher-order neural network features, thereby eliminating the need for uncertainty estimation. This method significantly eliminates the need for parameter fine-tuning by developing an end-to-end neural network that is specifically designed for multi-sensor fusion. In our experiments, we demonstrate the effectiveness of our approach in real-world autonomous driving scenarios. Results show that the proposed method outperforms existing multi-sensor fusion methods in terms of both accuracy and robustness. A video of the results can be viewed at https://youtu.be/q4iuobMbjME.

Abstract:
Aerial manipulation for safe physical interaction with their environments is gaining significant momentum in robotics research. In this paper, we present a disturbance-observer-based safety-critical control for a fully actuated aerial manipulator interacting with both static and dynamic structures. Our approach centers on a safety filter that dynamically adjusts the desired trajectory of the vehicle's pose, accounting for the aerial manipulator's dynamics, the disturbance observer's structure, and motor thrust limits. We provide rigorous proof that the proposed safety filter ensures the forward invariance of the safety set—representing motor thrust limits—even in the presence of disturbance estimation errors. To demonstrate the superiority of our method over existing control strategies for aerial physical interaction, we perform comparative experiments involving complex tasks, such as pushing against a static structure and pulling a plug firmly attached to an electric socket. Furthermore, to highlight its repeatability in scenarios with sudden dynamic changes, we perform repeated tests of pushing a movable cart and extracting a plug from a socket. These experiments confirm that our method not only outperforms existing methods but also excels in handling tasks with rapid dynamic variations.

Abstract:
This paper introduces a virtual gravity controller for underactuated biped robots. A bio-inspired model of passive bipedal walking is used as the basis for the controller's design. An analytical expression of the controller is obtained, allowing on-line implementations of the developed control scheme. Following a design modification tailored to the controller, the robot is able to reproduce its passive gait even on level-ground. The results are verified via independent high-fidelity physics simulations of the real robot's digital twin. The active robot demonstrates significant dynamic convergence to the passive model's dynamics, with only minor motorization efforts. The developed control scheme showcases robustness and energetic efficiency, and leads the way to a design-oriented approach in active biped locomotion.

Abstract:
Multi-Robot Path Planning (MRPP) on graphs, also known as Multi-Agent PathFinding (MAPF), is a well-established NP-hard problem with critically important applications. In (near)-optimally solving MRPP, as serial computation approaches its efficiency limits, parallelization offers a promising route to extend that limit further. As a single solution is unlikely to be successful in addressing all settings, e.g., in handling small/hard or large/sparse MRPP instances, in this study, we explore a targeted parallelization effort to boost the performance of conflict-based search for MRPP. Specifically, when instances are relatively small but robots are densely packed with strong interactions, we devise a decen-tralized parallel algorithm that concurrently explores multiple branches that leads to markedly enhanced solution discovery. On the other hand, for large problems with sparse robot-robot interactions, we find that prioritizing node expansion and conflict resolution more promising. Our innovative multi-threaded approach to parallelizing bounded-suboptimal conflict search-based algorithms demonstrates significant improvements over baseline serial methods in success rate or runtime. Our work furthers the understanding of MRPP and charts a promising path for elevating solution quality and computational efficiency through parallel algorithmic strategies.

Abstract:
We investigate the Combined Target-Assignment and Path-Finding (TAPF) problem that computes both task assignments and collision-free paths for multiple agents, that is, each agent is required to select a target from an underlying set, reaching which leads to a payoff. There is a cost closely related to the time required for each agent to reach the goal. The objective is to maximize the minimum gain generated by the agents. We proposed a Compilation-Based Approach with Heuristics (TA-CBWH) to approximate the optimal solution, behind which are two critical ideas: (i) for a specific task assignment, we formulate an integer linear programming (ILP) and create the iteration combined with large neighborhood search (LNS) to quickly improve the solution quality to near-optimal; (ii) regarding distinct task assignments, a switching mechanism is developed to determine the most promising iteration while progressively eliminating unnecessary task assignments. Comparative experiments demonstrate that TA-CBWH outperforms a wide range of existing approaches across various maps and different numbers of agents.

Abstract:
Robots must operate safely when deployed in novel and human-centered environments, like homes. Current safe control approaches typically assume that the safety constraints are known a priori, and thus, the robot can precompute a corresponding safety controller. While this may make sense for some safety constraints (e.g., avoiding collision with walls by analyzing a floor plan), other constraints are more complex (e.g., spills), inherently personal, context-dependent, and can only be identified at deployment time when the robot is interacting in a specific environment and with a specific person (e.g., fragile objects, expensive rugs). Here, language provides a flexible mechanism to communicate these evolving safety constraints to the robot. In this work, we use vision language models (VLMs) to interpret language feedback and the robot's image observations to continuously update the robot's representation of safety constraints. With these inferred constraints, we update a Hamilton-Jacobi reachability safety controller online via efficient warm-starting techniques. Through simulation and hardware experiments, we demonstrate the robot's ability to infer and respect language-based safety constraints with the proposed approach.

Abstract:
Autonomous navigation in crowded environments remains a significant challenge due to the highly dynamic and unpredictable nature of pedestrian movements. This paper presents a novel approach for socially-compliant crowd navigation by leveraging human pose tracking, trajectory prediction, and obstacle avoidance techniques. We introduce PoseTrajNet, an end-to-end autonomous agent navigation pipeline that integrates YOLOv8 for object detection, BlazePose for real-time human pose estimation, and a custom trajectory prediction model drawing on concepts from Social GANs. PoseTrajNet employs pose keypoints as socially-compliant features to anticipate pedestrian trajectories, enabling proactive path planning and dynamic safe radius adjustments for obstacle avoidance. Extensive evaluations on standard datasets demonstrate PoseTrajNet's effectiveness in seamless crowd navigation, outperforming baselines while adhering to social norms.

Abstract:
This study presents an innovative offset-trimmed helicoids (OTH) structure, featuring a tunable deformation center that emulates the flexibility of human fingers. This design significantly reduces the actuation force needed for larger elastic deformations, particularly when dealing with harder materials like thermoplastic polyurethane (TPU). The incorporation of two helically routed tendons within the finger enables both in- plane bending and lateral out-of-plane transitions, effectively expanding its workspace and allowing for variable curvature along its length. Compliance analysis indicates that the compliance at the fingertip can be fine-tuned by adjusting the mounting placement of the fingers. This customization enhances the gripper's adaptability to a diverse range of objects. By leveraging TPU's substantial elastic energy storage capacity, the gripper is capable of dynamically rotating objects at high speeds, achieving approximately 60° in just 15 milliseconds. The three-finger gripper, with its high dexterity across six degrees of freedom, has demonstrated the capability to successfully perform intricate tasks. One such example is the adept spinning of a rod within the gripper's grasp.

Abstract:
Unlike most human-engineered systems, many biological systems rely on emergent behaviors from low-level interactions, enabling greater diversity and superior adaptation to complex, dynamic environments. This study explores emergent decentralized rotation in the Loopy multicellular robot, composed of homogeneous, physically linked, 1-degree-of-freedom cells. Inspired by biological systems like sunflowers, Loopy uses simple local interactions-diffusion, reaction, and active transport of simulated chemicals, called morphogens-without centralized control or knowledge of its global morphology. Through these interactions, the robot self-organizes to achieve coordinated rotational motion and forms lobes-local protrusions created by clusters of motor cells. This study investigates how these interactions drive Loopy's rotation, the impact of its morphology, and its resilience to actuator failures. Our findings reveal two distinct behaviors: 1) inner valleys between lobes rotate faster than the outer peaks, contrasting with rigid body dynamics, and 2) cells rotate in the opposite direction of the overall morphology. The experiments show that while Loopy's morphology does not affect its angular velocity relative to its cells, larger lobes increase cellular rotation and decrease morphology rotation relative to the environment. Even with up to one-third of its actuators disabled and significant morphological changes, Loopy maintains its rotational abilities, highlighting the potential of decentralized, bio-inspired strategies for resilient and adaptable robotic systems.

Abstract:
Recent advances in AI have led to significant results in robotic learning, but skills like grasping remain partially solved. Many recent works exploit synthetic grasping datasets to learn to grasp unknown objects. However, those datasets were generated using simple grasp sampling methods using priors. Recently, Quality-Diversity (QD) algorithms have been proven to make grasp sampling significantly more efficient. In this work, we extend QDG-6DoF, a QD framework for generating object-centric grasps, to scale up the production of synthetic grasping datasets. We propose a data augmentation method that combines the transformation of object meshes with transfer learning from previous grasping repertoires. The conducted experiments show that this approach reduces the number of required evaluations per discovered robust grasp by up to 20 %. We used this approach to generate QDGset, a dataset of 6 DoF grasp poses that contains about 3.5 and 4.5 times more grasps and objects, respectively, than the previous state-of-the-art. Our method allows anyone to easily generate data, eventually contributing to a large-scale collaborative dataset of synthetic grasps.

Abstract:
We explore how intermediate policy representations can facilitate generalization by providing guidance on how to perform manipulation tasks. Existing representations such as language, goal images, and trajectory sketches have been shown to be helpful, but these representations either do not provide enough context or provide over-specified context that yields less robust policies. We propose conditioning policies on affordances, which capture the pose of the robot at key stages of the task. Affordances offer expressive yet lightweight abstractions, are easy for users to specify, and facilitate efficient learning by transferring knowledge from large internet datasets. Our method, RT-Affordance, is a hierarchical model that first proposes an affordance plan given the task language, and then conditions the policy on this affordance plan to perform manipulation. Our model can flexibly bridge heterogeneous sources of supervision including large web datasets and robot trajectories. We additionally train our model on cheap-to-collect in-domain affordance images, allowing us to learn new tasks without collecting any additional costly robot trajectories. We show on a diverse set of novel tasks how RT-Affordance exceeds the performance of existing methods by over 50 %, and we empirically demonstrate that affordances are robust to novel settings. Videos available at https://snasiriany. me/rt-affordance

Abstract:
Large Language Models (LLMs) have made substantial advancements in the field of robotic and autonomous driving. This study presents the first Occupancy-based Large Language Model (Occ-LLM), which represents a pioneering effort to integrate LLMs with an important representation. To effectively encode occupancy as input for the LLM and address the category imbalances associated with occupancy, we propose Motion Separation Variational Autoencoder (MS-VAE). This innovative approach utilizes prior knowledge to distinguish dynamic objects from static scenes before inputting them into a tailored Variational Autoencoder (VAE). This separation enhances the model's capacity to concentrate on dynamic trajectories while effectively reconstructing static scenes. The efficacy of Occ-LLM has been validated across key tasks, including 4D occupancy forecasting, self-ego planning, and occupancybased scene question answering. Comprehensive evaluations demonstrate that Occ-LLM significantly surpasses existing state-of-the-art methodologies, achieving gains of about 6% in Intersection over Union (IoU) and 4% in mean Intersection over Union (mIoU) for the task of 4D occupancy forecasting. These findings highlight the transformative potential of Occ-LLM in reshaping current paradigms within robotic and autonomous driving.

Abstract:
Model predictive control (MPC) algorithms can be sensitive to model mismatch when used in challenging nonlinear control tasks. In particular, the performance of MPC for vehicle control at the limits of handling suffers when the underlying model overestimates the vehicle's performance capabilities. In this work, we propose a risk-averse MPC framework that explicitly accounts for uncertainty over friction limits and tire parameters. Our approach leverages a sample-based approximation of an optimal control problem with a conditional value at risk (CVaR) constraint. This sample-based formulation enables planning with a set of expressive vehicle dynamics models using different tire parameters. Moreover, this formulation enables efficient numerical resolution via sequential quadratic programming and GPU parallelization. Experiments on a Lexus LC 500 show that risk-averse MPC unlocks reliable performance, while a deterministic baseline that plans using a single dynamics model may lose control of the vehicle in adverse road conditions.

Abstract:
Two-vehicle racing is natural example of a competitive dynamic game. As with most dynamic games, there are many ways in which the underlying solution concept can be structured, resulting in different equilibrium concepts. The assumed solution concept influences the behaviors of two interacting players in racing. For example, blocking behavior emerges naturally in leader-follower play, but to achieve this in Nash play the costs would have to be chosen specifically to trigger this behavior. In this work, we develop a novel model for competitive two-player vehicle racing, represented as an equilibrium problem, complete with simplified aerodynamic drag and drafting effects, as well as position-dependent collisionavoidance responsibility. We use our model to explore how different solution concepts affect competitiveness. We develop a solution for bilevel optimization problems, enabling a largescale empirical study comparing bilevel strategies (either as leader or follower), Nash equilibrium strategy and a singleplayer constant velocity baseline. We find the choice of strategies significantly affects competitive performance and safety.

Abstract:
Autonomous racing has rapidly gained research attention. Traditionally, racing cars rely on 2D LiDAR as their primary visual system. In this work, we explore the integration of an event camera with the existing system to provide enhanced temporal information. Our goal is to fuse the 2D LiDAR data with event data in an end-to-end learning framework for steering prediction, which is crucial for autonomous racing. To the best of our knowledge, this is the first study addressing this challenging research topic. We start by creating a multisensor dataset specifically for steering prediction. Using this dataset, we establish a benchmark by evaluating various SOTA fusion methods. Our observations reveal that existing methods often incur substantial computational costs. To address this, we apply low-rank techniques to propose a novel, efficient, and effective fusion design. We introduce a new fusion learning policy to guide the fusion process, enhancing robustness against misalignment. Our fusion architecture provides better steering prediction than LiDAR alone, significantly reducing the RMSE from 7.72 to 1.28. Compared to the second-best fusion method, our work represents only 11% of the learnable parameters while achieving better accuracy. The source code and dataset are publicly available at: https://github.com/ZZY-Zhou/F1Tenth-Steering.

Abstract:
This paper presents the first worldwide functional prototype omnidirectional multi-rotor aerial vehicle with fixed uni-directional thrusters, with an on-board power source. An optimization algorithm computes the positions and orientations of the propellers in the body frame of the prototype to achieve the omnidirectional capability, while minimizing the platform's weight and the required thrust to hover at any orientation, in addition to other construction requirements. The effect of the aerodynamic interaction between the different propellers is identified experimentally, and the ensuing results are included in the optimization algorithm to avoid such interactions during flight. The prototype's performance is assessed in real experiments demonstrating the decoupling between the forces and moments of the drone, its ability to track concurrently independent positions and orientations, and its ability to hover at a fixed position while rotating.

Abstract:
Recent progress in large language models and access to large-scale robotic datasets has sparked a paradigm shift in robotics models transforming them into generalists able to adapt to various tasks, scenes, and robot modalities. A large step for the community are open Vision Language Action models which showcase strong performance in a wide variety of tasks. In this work, we study the visual generalization capabilities of three existing robotic foundation models, and propose a corresponding evaluation framework. Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios. This is potentially caused by limited variations in the training data and/or catastrophic forgetting, leading to domain limitations in the vision foundation models. We further explore OpenVLA, which uses two pre-trained vision foundation models and is, therefore, expected to generalize to out-of-domain experiments. However, we showcase catastrophic forgetting by DINO-v2 in OpenVLA through its failure to fulfill the task of depth regression. To overcome the aforementioned issue of visual catastrophic forgetting, we propose a gradual backbone reversal approach founded on model merging. This enables OpenVLA - which requires the adaptation of the visual backbones during initial training - to regain its visual generalization ability. Regaining this capability enables our ReVLA model to improve over OpenVLA by a factor of 77% and 66% for grasping and lifting in visual OOD tasks. Comprehensive evaluations, episode rollouts and model weights are available on the ReVLA Page

Abstract:
Multimodal foundation models offer promising advancements for enhancing driving perception systems, but their high computational and financial costs pose challenges. We develop a method that leverages foundation models to refine predictions from existing driving perception modelssuch as enhancing object classification accuracy-while minimizing the frequency of using these resource-intensive models. The method quantitatively characterizes uncertainties in the perception model's predictions and engages the foundation model only when these uncertainties exceed a pre-specified threshold. Specifically, it characterizes uncertainty by calibrating the perception model's confidence scores into theoretical lower bounds on the probability of correct predictions using conformal prediction. Then, it sends images to the foundation model and queries for refining the predictions only if the theoretical bound of the perception model's outcome is below the threshold. Additionally, we propose a temporal inference mechanism that enhances prediction accuracy by integrating historical predictions, leading to tighter theoretical bounds. The method demonstrates a 10 to 15 percent improvement in prediction accuracy and reduces the number of queries to the foundation model by 50 percent, based on quantitative evaluations from driving datasets.

Abstract:
Semantic segmentation of 3D LiDAR data plays a pivotal role in autonomous driving. Traditional approaches rely on extensive annotated data for point cloud analysis, incurring high costs and time investments. In contrast, real-world image datasets offer abundant availability and substantial scale. To mitigate the burden of annotating 3D LiDAR point clouds, we propose two crossmodal knowledge distillation methods: Unsupervised Domain Adaptation Knowledge Distillation (UDAKD) and Feature and Semantic-based Knowledge Distillation (FSKD). Leveraging readily available spatio-temporally synchronized data from cameras and LiDARs in autonomous driving scenarios, we directly apply a pretrained 2D image model to unlabeled 2D data. Through crossmodal knowledge distillation with known 2D-3D correspondence, we actively align the output of the 3D network with the corresponding points of the 2D network, thereby obviating the necessity for 3D annotations. Our focus is on preserving modality-general information while filtering out modality-specific details during crossmodal distillation. To achieve this, we deploy self-calibrated convolution on 3D point clouds as the foundation of our domain adaptation module. Rigorous experimentation validates the effectiveness of our proposed methods, consistently surpassing the performance of state-of-the-art approaches in the field. Code is available at https://github.com/KangJialiang/DAKD.

Abstract:
“Robot factors” design, analogous to ergonomics for humans, seeks to create devices and equipment that can be readily operated by robots, by considering typical capabilities of current robots throughout the design process. While a number of principles and heuristics for robot factors design have been identified, the successful design of hardware operable by autonomous robots often depends in practice on the designer's intuition about robot capabilities, developed through personal experience working with robots. Here we present a tool we have developed to help evaluate a potential device design for usability by a robot, by allowing a designer to in effect teleoperate a virtual robot and attempt the operation of the device. The tool uses a 3D physics-based simulation built in Unity, and a Phantom Omni / Geomagic Touch haptic device that controls the virtual robot's end-effector and provides force feedback. Through user studies, we show that the use of this tool can significantly improve a user's estimation of the suitability of a design for robot operation, in two case studies involving replacing a unit in a modular hardware system and unzipping a canvas bag. By incorporating the use of such a tool early in the design cycle, designers can more effectively develop equipment to be used by autonomous robots without themselves needing direct robotics experience; as a result, robots will be able to take on more tasks in the nearer term with current robot technology.

Abstract:
In this paper, we propose a closed-loop sensitivity-based approach to enhance the robustness of robotic non-prehensile dynamic manipulation tasks. The proposed method aims at fulfilling the transportation of an object, that is free to move on a tray-shaped robot end-effector, in face of not perfectly known nominal dynamic parameters. The approach is built up on taking the parameterized reference trajectory to be tracked as the optimization variable minimizing a norm of the task closed-loop sensitivity. The resulting optimal reference trajectory is inherently more robust to the parametric variations of object dynamic properties compared to a baseline trajectory execution. The tracking performance is assessed and validated along hardware experiments and an extensive simulation campaign assessing the superior robustness of our approach.

Abstract:
We consider the problem of safe real-time navigation of a robot in a dynamic environment with moving obstacles of arbitrary smooth geometries and input saturation constraints. We assume that the robot detects and models nearby obstacle boundaries with a short-range sensor and that this detection is error-free. This problem presents three main challenges: i) input constraints, ii) safety, and iii) real-time computation. To tackle all three challenges, we present a layered control architecture (LCA) consisting of an offline path library generation layer, and an online path selection and safety layer. To overcome the limitations of reactive methods, our offline path library consists of feasible controllers, feedback gains, and reference trajectories. To handle computational burden and safety, we solve online path selection and generate safe inputs that run at 100 Hz. Through simulations on Gazebo and Fetch hardware in an indoor environment, we evaluate our approach against baselines that are layered, end - to-end, or reactive. Our experiments demonstrate that among all algorithms, only our proposed LCA is able to complete tasks such as reaching a goal, safely. When comparing metrics such as safety, input error, and success rate, we show that our approach generates safe and feasible inputs throughout the robot execution.

Abstract:
Koopman operator theory provides a powerful data-driven technique for modeling nonlinear dynamical systems in a linear framework, in comparison to computationally expensive and highly nonlinear physics-based simulations. However, Koopman operator-based models for soft robots are very high dimensional and require considerable amounts of data to properly resolve. Inspired by physics-informed techniques from machine learning, we present a novel physics-informed Koopman operator identification method that improves simulation accuracy for small dataset sizes. Through Strang splitting, the method takes advantage of both continuous and discrete Koopman operator approximation to obtain information both from trajectory and phase space data. The method is validated on a tendon-driven soft robotic arm, showing orders of magnitude improvement over standard methods in terms of the shape error. We envision this method can significantly reduce the data requirement of Koopman operators for systems with partially known physical models, and thus reduce the cost of obtaining data. More info: https://sunrobotics.lab.asu.edu/blog/2024/ristich-icra-2025/

Abstract:
A novel framework for training a robotic fish to learn how to swim, even in the presence of degradations or failures in actuators is developed. Robotic underwater robots, particularly soft fish-inspired designs have gained significant attention due to their distinct benefits, including superior maneuverability, energy efficiency, versatile applications, and seamless integration with marine environments. However, their material properties and actuators can degrade, leading to pre-mature system failures. In this paper, we introduce the concept of actuator drop-out during training, to enable the robot to learn how to swim even when one or more actuators are degraded or non-functional. A Soft Actor-Critic Deep Reinforcement Learning architecture is used to learn a policy, with actuator degradations/failures introduced during training. A four actuator koi fish is modeled and simulated using the FishGym environment. Navigation-based validation tests show little degradation with one actuator failure, and much more robust swimming behaviors and performance compared to training with no failures, even when two or three actuators fail. These results will improve long-term operational reliability, ensuring robot fish functionality even in challenging underwater conditions.

Abstract:
This article presents an interactive control approach allowing a human user to teleoperate a robotic manipulator located nearby. With this approach, the user keeps his/her hands free, as only head movements are exploited to control the robot. The controller maps the 6 Degrees of Freedom (DoF) user's head position and orientation into the 6 DoF robot endeffector position and orientation. The robot can reach a large workspace thanks to the combination of two features. Firstly, a virtual wand between the user's head and the robot end-effector converts user's head pantilt rotations into large displacements of the robot end-effector center perpendicularly to the wand axis (2 DoF). Secondly, for the remaining 4 DoF (robot end-effector center displacement along the wand axis and robot en-effector orientation), realtime deformation of the virtual wand is triggered when the user reaches uncomfortable configurations due to his/her head workspace limitations. Additionally, the user gets, through an Augmented Reality (AR) Headset, a non-delayed visual feedback of the current virtual wand geometry and location. The paper includes a description of the setup and the proposed controller, detailing how the robot position/orientation is coupled to the user's head position/orientation. A set of elementary experiments with a constant-geometry wand is first presented, showing workspace limitations for some DoF. Then the wand reconfiguration is introduced in the experiments, leading to full control of 6 DoF manipulation tasks throughout a large workspace.

Abstract:
As entertainment robots gain popularity, the demand for natural and expressive motion, particularly in dancing, continues to rise. Traditionally, dancing motions have been manually designed by artists, a process that is both labor-intensive and restricted to simple motion playback, lacking the flexibility to incorporate additional tasks such as locomotion or gaze control during dancing. To overcome these challenges, we introduce Deep Fourier Mimic (DFM), a novel method that combines advanced motion representation with Reinforcement Learning (RL) to enable smooth transitions between motions while concurrently managing auxiliary tasks during dance sequences. While previous frequency domain based motion representations have successfully encoded dance motions into latent parameters, they often impose overly rigid periodic assumptions at the local level, resulting in reduced tracking accuracy and motion expressiveness, which is a critical aspect for entertainment robots. By relaxing these locally periodic constraints, our approach not only enhances tracking precision but also facilitates smooth transitions between different motions. Furthermore, the learned RL policy that supports simultaneous base activities, such as locomotion and gaze control, allows entertainment robots to engage more dynamically and interactively with users rather than merely replaying static, predesigned dance routines.

Abstract:
Safety is one of the most crucial challenges of autonomous driving vehicles, and one solution to guarantee safety is to employ an additional control revision module after the planning backbone. Control Barrier Function (CBF) has been widely used because of its strong mathematical foundation on safety. However, the incompatibility with heterogeneous perception data and incomplete consideration of traffic scene elements make existing systems hard to be applied in dynamic and complex real-world scenarios. In this study, we introduce a generalized control revision method for autonomous driving safety, which adopts both vectorized perception and occupancy grid map as inputs and comprehensively models multiple types of traffic scene constraints based on a new proposed barrier function. Traffic elements are integrated into one unified framework, decoupled from specific scenario settings or rules. Experiments on CARLA, SUMO, and OnSite simulator prove that the proposed algorithm could realize safe control revision under complicated scenes, adapting to various planning backbones, road topologies, and risk types. Physical platform validation also verifies the real-world application feasibility.

Abstract:
A high-precision, high-efficiency, and lightweight panoptic driving perception system is an essential part of autonomous driving for optimal maneuver planning of the autonomous vehicle. We propose a simple, lightweight, and efficient SCAM-P multi-task learning network that accomplishes three crucial tasks simultaneously for panoptic driving: vehicle detection, drivable area segmentation, and lane segmentation. To increase the representation power of the shared backbone of our multi-task network, we designed a novel SCAM module with spatially localized channel attention and channel localized spatial attention blocks. SCAM is a lightweight module that can be plugged into any CNN architecture to enhance the semantic features with negligible computational overhead. We integrate our SCAM module and design the SCAM-P network, which has a shared backbone for feature extraction and three independent heads to handle three tasks at the same time. We also designed a nano variant of our SCAM-P network to make it deployment-friendly on edge devices. Our SCAM-P network obtains competitive results on the BDD100K dataset with 81.1 % mAP50 for object detection, 91.6 % mIoU for drivable area segmentation, and 28.8 % IoU for lane segmentation. Our model is robust in various adverse weather conditions, such as rainy, snowy, and at night. Our SCAM-P network not only achieves improved performance but also runs efficiently in real-time at 230.5 FPS on the RTX 4090 GPU and 112.1 FPS on the Jetson Orin edge device.

Abstract:
Visual pattern recognition usually plays important roles in robotics and automation society where the pattern recognition relies on representation learning. Existing representation learning often neglects two important issues, the diversity of intra-class representation and under-exploited label utilization, especially the negative feedback during training process. Fortunately, prototype learning potentially raises label utilization and encourages intra-class diversity. In this paper, we investigate the intra-class diversity and effective updates in prototype learning for enhanced visual pattern recognition. Specifically, we propose a Label-aware multi-Prototype learning, LamPro, by incorporating the label awareness into both prototype formation and update to improve the representation quality. Firstly, we design a supervised contrastive learning to achieve class-discriminative representations. Secondly, we randomly initialize multiple prototypes and update the nearest prototype upon the arrival of instance, to preserve intra-class diversity. Thirdly, we propose a novel Label-guided Adaptive Updating. We separate the prototype updates from the representation optimization and exploit the label indexes to directly implement the prediction feedback. To correct the model optimization directions, we identify the negative feedback, and correct the prototype updates via queries of labels. Finally, we design a memory-based counter to alternately update these deviated prototypes. Experiments verify the effectiveness of our label-aware and joint multi-prototype updating strategies.

Abstract:
Fleets of autonomous robots have been deployed for exploration of unknown scenes for features of interest, e.g., subterranean exploration, reconnaissance, search and rescue missions. During exploration, the robots may encounter un-identified targets, blocked passages, interactive objects, temporary failure, or other unexpected events, all of which require consistent human assistance with reliable communication for a time period. This however can be particularly challenging if the communication among the robots is severely restricted to only close-range exchange via ad-hoc networks, especially in extreme environments like caves and underground tunnels. This paper presents a novel human-centric interactive exploration and assistance framework called FlyKites, for multi-robot systems under limited communication. It consists of three interleaved components: (I) the distributed exploration and intermittent communication (called the “spread mode”), where the robots collaboratively explore the environment and exchange local data among the fleet and with the operator; (II) the simultaneous optimization of the relay topology, the operator path, and the assignment of robots to relay roles (called the”relay mode”), such that all requested assistance can be provided with minimum delay; (III) the human-in-the-loop online execution, where the robots switch between different roles and interact with the operator adaptively. Extensive human-in-the-loop simulations and hardware experiments are performed over numerous challenging scenes.

Abstract:
Capsulorhexis is challenging in cataract surgery, since the size, centering, and circularity of the capsule are important. Those indicators are closely related to the subsequent step of phacoemulsification and the postoperative position of the intraocular lens. It takes 3-5 years for a resident to practice, while the occurrence of deficient capsulorhexis is still inevitable. This paper proposes a robotic system to automate Continuous Curvilinear Capsulorhexis (CCC) in cataract surgery. A typical ophthalmic microscope system and a triaxial force sensor are utilized to guide the robot system with a force-vision method. The constraint of a Remote Center of Motion (RCM) is designed to perform the surgery route. The experimental results on exvivo porcine eyes show our autonomous method can achieve a satisfactory 6 mm capsule. With an average centering deviation below 7.6 % and circularity of 0.993, the consistency of the capsulorhexis is comparable to a surgeon-made one.

Abstract:
Learning to perform accurate and rich simulations of human driving behaviors from data for autonomous vehicle testing remains challenging due to human driving styles' high diversity and variance. We address this challenge by proposing a novel approach that leverages contrastive learning to extract a dictionary of driving styles from pre-existing human driving data. We discretize these styles with quantization, and the styles are used to learn a conditional diffusion policy for simulating human drivers. Our empirical evaluation confirms that the behaviors generated by our approach are both safer and more human-like than those of the machine-learning-based baseline methods. We believe this has the potential to enable higher realism and more effective techniques for evaluating and improving the performance of autonomous vehicles.

Abstract:
Steerable needles offer a minimally invasive method to deliver treatment to hard-to-reach tissue regions. We introduce a new class of tape-spring steerable needles capable of sharp turns ranging from 15 to 150 degrees with a turn radius as low as 3 mm, which minimizes surrounding tissue damage. In this work, we derive and experimentally validate a geometric model for our steerable needle design. We evaluate both manual and robotic steering of the needle along a Dubins path in 7 kPa and 13 kPa tissue phantoms, simulating our target clinical application in healthy and unhealthy liver tissue. We conduct experiments to measure needle robustness to stiffness transitions between non-homogeneous tissues. We demonstrate progress towards clinical use with needle tip tracking via ultrasound imaging, navigation around anatomical obstacles, and integration with a robotic autonomous steering system.

Abstract:
The increasing demand in e-commerce, combined with labor shortages and rising wages, is driving the rapid automation of warehouse operations. A critical aspect of this shift is bin packing, where diverse unknown items of varying sizes and shapes must be optimally arranged within a bin or container. Robot bin packing is receiving growing attention and presents unique challenges due to the broad range of objects, packing rules, and task-specific requirements. In response, we propose So-Pack, a generalist packing heuristic for irregularly shaped objects integrated into a flexible, weighted multi-heuristic planning system. The system demonstrates robust performance across general packing scenarios and flexibility to adapt to changing packing rules and specific end-user requirements. Experimental results show that the system outperforms state-of-the-art approaches in key metrics on a new challenging dataset of retail objects in real-world applications.

Abstract:
Traffic simulation, complementing real-world data with a long-tail distribution, allows for effective evaluation and enhancement of the ability of autonomous vehicles to handle accident-prone scenarios. Simulating such safety-critical scenarios is nontrivial, however, from log data that are typically regular scenarios, especially in consideration of dynamic adversarial interactions between the future motions of autonomous vehicles and surrounding traffic participants. To address it, this paper proposes an innovative and efficient strategy, termed IntSim, that explicitly decouples the driving intentions of surrounding actors from their motion planning for realistic and efficient safety-critical simulation. We formulate the adversarial transfer of driving intention as an optimization problem, facilitating extensive exploration of diverse attack behaviors and efficient solution convergence. Simultaneously, intention-conditioned motion planning benefits from powerful deep models and large-scale real-world data, permitting the simulation of realistic motion behaviors for actors. Specially, through adapting driving intentions based on environments, IntSim facilitates the flexible realization of dynamic adversarial interactions with autonomous vehicles. Finally, extensive open-loop and closed-loop experiments on real-world datasets, including nuScenes and Waymo, demonstrate that the proposed IntSim achieves state-of-the-art performance in simulating realistic safety-critical scenarios and further improves planners in handling such scenarios.

Abstract:
The robot-based tracking of highly dynamic end point motions of deformable linear objects (DLO) remains challenging due to its non-linear behavior. Since simple feedback control is infeasible, model-based control offers potential to account for the non-linear effects, but requires computation efficient and accurate models. Promising results have been achieved utilizing data-driven models that introduce a latent kinematic chain as model of the DLO and mapping measurements of the tip position in its latent joint space, in which the dynamic motion model is learned. So far, this approach has the limitation that it can not handle situations of incomplete sensory information, for instance if occlusion occurs. Consequently, this paper introduces a fusion network architecture capable of making predictions even if sensory information is incomplete. We achieve additional state estimation of the latent joint state by learning a data driven inverse kinematics with help of wrench measurements at the DLO base and evaluate our approach by simulating occlusion. We demonstrate the computational effectiveness of our approach for in the loop control tasks.

Abstract:
The 3D Gaussian Splatting (3DGS)-based SLAM system has garnered widespread attention due to its excellent performance in real-time high-fidelity rendering. However, in real-world environments with dynamic objects, existing 3DGS-based SLAM systems often face mapping errors and tracking drift issues. To address these problems, we propose GARAD-SLAM, a real-time 3DGS-based SLAM system tailored for dynamic scenes. In terms of tracking, unlike traditional methods, we directly perform dynamic segmentation on Gaussians and map them back to the front-end to obtain dynamic point labels through a Gaussian pyramid network, achieving precise dynamic removal and robust tracking. For mapping, we impose rendering penalties on dynamically labeled Gaussians, which are updated through the network, to avoid irreversible erroneous removal caused by simple pruning. Our results on real-world datasets demonstrate that our method is competitive in tracking compared to baseline methods, generating fewer artifacts and higher-quality reconstructions in rendering.

Abstract:
Multi-modal 3D semantic segmentation is vital for applications such as autonomous driving and virtual reality (VR). To effectively deploy these models in real-world scenarios, it is essential to employ cross-domain adaptation techniques that bridge the gap between training data and real-world data. Recently, self-training with pseudo-labels has emerged as a predominant method for cross-domain adaptation in multi-modal 3D semantic segmentation. However, generating reliable pseudo-labels necessitates stringent constraints, which often result in sparse pseudo-labels after pruning. This sparsity can potentially hinder performance improvement during the adaptation process. We propose an image-guided pseudo-label enhancement approach that leverages the complementary 2D prior knowledge from the Segment Anything Model (SAM) to introduce more reliable pseudo-labels, thereby boosting domain adaptation performance. Specifically, given a 3D point cloud and the SAM masks from its paired image data, we collect all 3D points covered by each SAM mask that potentially belong to the same object. Then our method refines the pseudo-labels within each SAM mask in two steps. First, we determine the class label for each mask using majority voting and employ various constraints to filter out unreliable mask labels. Next, we introduce Geometry-Aware Progressive Propagation (GAPP) which propagates the mask label to all 3D points within the SAM mask while avoiding outliers caused by 2D-3D misalignment. Experiments conducted across multiple datasets and domain adaptation scenarios demonstrate that our proposed method significantly increases the quantity of high-quality pseudo-labels and enhances the adaptation performance over baseline methods.

Abstract:
In autonomous driving, thermal image semantic segmentation has emerged as a critical research area, owing to its ability to provide robust scene understanding under adverse visual conditions. In particular, unsupervised domain adaptation (UDA) for thermal image segmentation can be an efficient solution to address the lack of labeled thermal datasets. Nevertheless, since these methods do not effectively utilize the complementary information between RGB and thermal images, they significantly decrease performance during domain adaptation. In this paper, we present a comprehensive study on cross-spectral UDA for thermal image semantic segmentation. We first propose a novel masked mutual learning strategy that promotes complementary information exchange by selectively transferring results between each spectral model while masking out uncertain regions. Additionally, we introduce a novel prototypical self-supervised loss designed to enhance the performance of the thermal segmentation model in nighttime scenarios. This approach addresses the limitations of RGB pre-trained networks, which cannot effectively transfer knowledge under low illumination due to the inherent constraints of RGB sensors. In experiments, our method achieves higher performance over previous UDA methods and comparable performance to state-of-the-art supervised methods.

Abstract:
In this paper, we propose a novel cross-platform fault-tolerant surfacing controller for underwater robots, based on reinforcement learning (RL). Unlike conventional approaches, which require explicit identification of malfunctioning actuators, our method allows the robot to surface using only the remaining operational actuators without needing to pinpoint the failures. The proposed controller learns a robust policy capable of handling diverse failure scenarios across different actuator configurations. Moreover, we introduce a transfer learning mechanism that shares a part of the control policy across various underwater robots with different actuators, thus improving learning efficiency and generalization across platforms. To validate our approach, we conduct simulations on three different types of underwater robots: a hoveringtype AUV, a torpedo shaped AUV, and a turtle-shaped robot (U-CAT). Additionally, real-world experiments are performed, successfully transferring the learned policy from simulation to a physical U-CAT in a controlled environment. Our RLbased controller demonstrates superior performance in terms of stability and success rate compared to a baseline controller, achieving an 85.7 percent success rate in real-world tests compared to \mathbf5 7. 1 percent with a baseline controller. This research provides a scalable and efficient solution for fault-tolerant control for diverse underwater platforms, with potential applications in real-world aquatic missions.

Abstract:
When navigating and interacting in challenging environments where sensory information is imperfect and incomplete, robots must make decisions that account for these shortcomings. We propose a novel method for quantifying and representing such perceptual uncertainty in 3D reconstruction through occupancy uncertainty estimation. We develop a framework to incorporate it into grasp selection for autonomous manipulation in underwater environments. Instead of treating each measurement equally when deciding which location to grasp from, we present a framework that propagates uncertainty inherent in the multi-view reconstruction process into the grasp selection. We evaluate our method with both simulated and the real world data, showing that by accounting for uncertainty, the grasp selection becomes robust against partial and noisy measurements. Code will be made available at https://onurbagoren.github.io/PUGS/

Abstract:
Flying through body-size narrow gaps in the environment is one of the most challenging moments for an underactuated multirotor. We explore a purely data-driven method to master this flight skill in simulation, where a neural network directly maps pixels and proprioception to continuous low-level control commands. This learned policy enables wholebody control through gaps with different geometries demanding sharp attitude changes (e.g., near-vertical roll angle). The policy is achieved by successive model-free reinforcement learning (RL) and online observation space distillation. The RL policy receives (virtual) point clouds of the gaps' edges for scalable simulation and is then distilled into a high-dimensional pixel space. However, this flight skill is fundamentally expensive to learn by exploring in RL due to restricted feasible solution space. We propose to reset the agent as states on the trajectories generated by a model-based trajectory optimizer to alleviate this problem. The presented training pipeline is compared with baseline methods, and ablation studies are conducted to identify the key ingredients of the method. The immediate next step is to demonstrate the sim-to-real transformation, which can be challenging due to the high precision demands by this extreme flight skill.

Abstract:
Reinforcement learning (RL) has achieved outstanding success in complex robot control tasks, such as drone racing, where the RL agents have outperformed human champions in a known racing track. However, these agents fail in unseen track configurations, always requiring complete retraining when presented with new track layouts. This work aims to develop RL agents that generalize effectively to novel track configurations without retraining. The naïve solution of training directly on a diverse set of track layouts can overburden the agent, resulting in suboptimal policy learning as the increased complexity of the environment impairs the agent's ability to learn to fly. To enhance the generalizability of the RL agent, we propose an adaptive environment-shaping framework that dynamically adjusts the training environment based on the agent's performance. We achieve this by leveraging a secondary RL policy to design environments that strike a balance between being challenging and achievable, allowing the agent to adapt and improve progressively. Using our adaptive environment shaping, one single racing policy efficiently learns to race in diverse challenging tracks. Experimental results validated in both simulation and the real world show that our method enables drones to successfully fly complex and unseen race tracks, outperforming existing environment-shaping techniques. Website: http://rpg.ifi.uzh.ch/env_as_policy.

Abstract:
Expressive robotic behavior is essential for the widespread acceptance of robots in social environments. Recent advancements in learned legged locomotion controllers have enabled more dynamic and versatile robot behaviors. However, determining the optimal behavior for interactions with different users across varied scenarios remains a challenge. Current methods either rely on natural language input, which is efficient but low-resolution, or learn from human preferences, which, although high-resolution, is sample inefficient. This paper introduces a novel approach that leverages priors generated by pre- trained LLMs alongside the precision of preference learning. Our method, termed Language-Guided Preference Learning (LGPL), uses LLMs to generate initial behavior samples, which are then refined through preference-based feedback to learn behaviors that closely align with human expectations. Our core insight is that LLMs can guide the sampling process for preference learning, leading to a substantial improvement in sample efficiency. We demonstrate that LGPL can quickly learn accurate and expressive behaviors with as few as four queries, outperforming both purely language-parameterized models and traditional preference learning approaches. Website with videos: this http url.

Abstract:
State-of-the-art hierarchical localisation pipelines (HLoc) employ image retrieval (IR) to establish 2D-3D correspondences by selecting the top-k most similar images from a reference database. While increasing k improves localisation robustness, it also linearly increases computational cost and runtime, creating a significant bottleneck. This paper investigates the relationship between global and local descriptors, showing that greater similarity between the global descriptors of query and database images increases the proportion of feature matches. Low similarity queries significantly benefit from increasing k, while high similarity queries rapidly experience diminishing returns. Building on these observations, we propose an adaptive strategy that adjusts k based on the similarity between the query's global descriptor and those in the database, effectively mitigating the feature-matching bottleneck. Our approach reduces computational costs and processing time without sacrificing accuracy. Experiments on three indoor and outdoor datasets show that AIR-HLoc reduces feature matching time by up to 30% while preserving state-of-the-art accuracy. The results demonstrate that AIR-HLoc facilitates a latency-sensitive localisation system.

Abstract:
Reliable automated driving technology is challenged by various sources of uncertainties, in particular, behavioral uncertainties of traffic agents. It is common for traffic agents to have intentions that are unknown to others, leaving an automated driving car to reason over multiple possible behaviors. This paper formalizes a behavior planning scheme in the presence of multiple possible futures with corresponding probabilities. We present a maximum entropy formulation and show how, under certain assumptions, this allows delayed decision-making to improve safety. The general formulation is then turned into a model predictive control formulation, which is solved as a quadratic program or a set of quadratic programs. We discuss implementation details for improving computation and verify operation in simulation and on a mobile robot.

Abstract:
To deploy safe and agile robots in cluttered environments, there is a need to develop fully decentralized controllers that guarantee safety, respect actuation limits, prevent deadlocks, and scale to thousands of agents. Current approaches fall short of meeting all these goals: optimization-based methods ensure safety but lack scalability, while learning-based methods scale but do not guarantee safety. We propose a novel algorithm to achieve safe and scalable control for multiple agents under limited actuation. Specifically, our approach includes: (i) learning a decentralized neural Integral Control Barrier function (neural ICBF) for scalable, input-constrained control, (ii) embedding a lightweight decentralized Model Predictive Control-based Integral Control Barrier Function (MPC-ICBF) into the neural network policy to ensure safety while maintaining scalability, and (iii) introducing a novel method to minimize deadlocks based on gradient-based optimization techniques from machine learning to address local minima in deadlocks. Our numerical simulations show that this approach outperforms state-of-the-art multi-agent control algorithms in terms of safety, input constraint satisfaction, and minimizing deadlocks. Additionally, we demonstrate strong generalization across scenarios with varying agent counts, scaling up to 1000 agents. Videos and code are available at https://maicbf.github.io/.

Abstract:
This paper presents the results of a within-subjects user study with 27 participants over the age of 60, comparing the use of two different user interfaces for an assistive robot scooter. The graphical user interface (GUI) shows a representation of the environment on a 10-inch touchscreen. The tangible user interface (TUI) consists of a joystick, a box of buttons, and a projector - designed to keep the user's attention in the real world. Trends suggest that the TUI could help mitigate difficulty caused by highly cluttered environments, as well as differences in individual spatial reasoning ability, but additional studies are needed.

Abstract:
Multi-view stereo (MVS) implicitly encodes photometric and geometric cues into the cost volume for multi-view correspondence matching, transferring insufficient geometric cues essential to depth estimation and reconstruction. This paper proposes GE-MVS, a novel multi-view stereo network with geometric encoding for more accurate and complete depth estimation and point cloud reconstruction. First, the cross-view adaptive cost volume aggregation module is proposed to strengthen multi-view geometric cues encoding during cost volume construction. Then, the depth consistency optimization is performed in the 3D point space during learning by invoking ground-truth depth cues from adjacent views. Finally, the surface normal geometries are explicitly encoded to refine the sampled depth hypotheses to be consistent in the local neighbor regions. Extensive experiments on the standard MVS benchmarks including DTU, Tanks and Temples, and BlendedMVS demonstrate the state-of-the-art depth estimation and point cloud reconstruction performance of GE-MVS. The GE-MVS is further deployed in real-world experiments for UAV-based large-scale reconstruction, where our method outperforms the prevalent industrial reconstruction solutions concerning reconstruction efficiency and efficacy. Our project page is: https://cuhk-usr-group.github.io/GE-MVS/

Abstract:
Task and motion planning (TAMP) frameworks address long and complex planning problems by integrating high-level task planners with low-level motion planners. However, existing TAMP methods rely heavily on the manual design of planning domains that specify the preconditions and postconditions of all high-level actions. This paper proposes a method to automate planning domain inference from a handful of test-time trajectory demonstrations, reducing the reliance on human design. Our approach incorporates a deep learning-based estimator that predicts the appropriate components of a domain for a new task and a search algorithm that refines this prediction, reducing the size and ensuring the utility of the inferred domain. Our method can generate new domains from minimal test time demonstrations, enabling robots to handle complex tasks more efficiently. We demonstrate that our approach outperforms behaviour cloning baselines, which directly imitate planner behaviour, in terms of planning performance and generalization across a variety of tasks. Additionally, our method reduces computational costs and data amount requirements at test time for inferring new planning domains.

Abstract:
Vision-based target motion estimation is a fundamental problem in many robotic tasks. The existing methods have the limitation of low observability and, hence, face challenges in tracking highly maneuverable targets. Motivated by the aerial target pursuit task where a target may maneuver in 3D space, this paper studies how to further enhance observability by incorporating the bearing rate information that has not been well explored in the literature. The main contribution of this paper is to propose a new cooperative estimator called STT-R (Spatial-Temporal Triangulation with bearing Rate), which is designed under the framework of distributed recursive least squares. This theoretical result is further verified by numerical simulation and real-world experiments. It is shown that the proposed STT-R algorithm can effectively generate more accurate estimations and effectively reduce the lag in velocity estimation, enabling tracking of more maneuverable targets.

Abstract:
For multi-robot teams with limited communication, the ability to rapidly recognize the intention of a teammate via its exhibited behavior is key to achieving effective collaboration. While current research on plan and goal recognition provide powerful tools, most of them rely on a high-level abstraction of the environment and of its dynamics. We propose online waypoint recognition (OWR) that incorporates knowledge about the dynamic models into the analysis of the observed agent behavior. Our algorithm takes the form of a Kalman filter and performs recognition of the agent's intended waypoint at high frequency. The approach is robust to uncertainties in dynamics and observations. Moreover, it does not require the agent to reach the next waypoint to perform recognition, which saves valuable time. Our empirical evaluation shows the ability of our proposed algorithm to expedite recognition of both simulated and real-world mobile robots.

Abstract:
The task of vision-based 3D occupancy prediction aims to reconstruct 3D geometry and estimate its semantic classes from 2D color images, where the 2D-to-3D view transformation is an indispensable step. Most previous methods conduct forward projection, such as BEVPooling and VoxelPooling, both of which map the 2D image features into 3D grids. However, the current grid representing features within a certain height range usually introduces many confusing features that belong to other height ranges. To address this challenge, we present Deep Height Decoupling (DHD), a novel framework that incorporates explicit height prior to filter out the confusing features. Specifically, DHD first predicts height maps via explicit supervision. Based on the height distribution statistics, DHD designs Mask Guided Height Sampling (MGHS) to adaptively decouple the height map into multiple binary masks. MGHS projects the 2D image features into multiple subspaces, where each grid contains features within reasonable height ranges. Finally, a Synergistic Feature Aggregation (SFA) module is deployed to enhance the feature representation through channel and spatial affinities, enabling further occupancy refinement. On the popular Occ3D-nuScenes benchmark, our method achieves state-of-the-art performance even with minimal input frames. Source code is released at https://github.com/yanzq95/DHD.

Abstract:
Autonomous driving datasets are essential for validating the progress of intelligent vehicle algorithms, which include localization, perception, and prediction. However, existing datasets are predominantly focused on structured urban environments, which limits the exploration of unstructured and specialized scenarios, particularly those characterized by significant dust levels. This paper introduces the LiDARDustX dataset, which is specifically designed for perception tasks under high-dust conditions, such as those encountered in mining areas. The LiDARDustX dataset consists of 30,000 LiDAR frames captured by six different LiDAR sensors, each accompanied by 3D bounding box annotations and point cloud semantic segmentation. Notably, over 80% of the dataset comprises dust-affected scenes. By utilizing this dataset, we have established a benchmark for evaluating the performance of state-of-the-art 3D detection and segmentation algorithms. Additionally, we have analyzed the impact of dust on perception accuracy and delved into the causes of these effects. The data and further information can be accessed at: https://github.com/vincentweikey/LiDARDustX.

Abstract:
Semantic maps are fundamental for robotics tasks such as navigation and manipulation. They also enable yield prediction and phenotyping in agricultural settings. In this paper, we introduce an efficient and scalable approach for active semantic mapping in horticultural environments, employing a mobile robot manipulator equipped with an RGB-D camera. Our method leverages probabilistic semantic maps to detect semantic targets, generate candidate viewpoints, and compute the corresponding information gain. We present an efficient ray-casting strategy and a novel information utility function that accounts for both semantics and occlusions. The proposed approach reduces total runtime by 8 % compared to previous baselines. Furthermore, our information metric surpasses other metrics in reducing multiclass entropy and improving surface coverage, particularly in the presence of segmentation noise. Real-world experiments validate our method's effectiveness but also reveal challenges such as depth sensor noise and varying environmental conditions, requiring further research. https://github.com/jrcuaranv/nbv_planning.

Abstract:
Legged robots are physically capable of navigating a diverse variety of environments and overcoming a wide range of obstructions. For example, in a search and rescue mission, a legged robot could climb over debris, crawl through gaps, and navigate out of dead ends. However, the robot's controller needs to respond intelligently to such varied obstacles, and this requires handling unexpected and unusual scenarios successfully. This presents an open challenge to current learning methods, which often struggle with generalization to the long tail of unexpected situations without heavy human supervision. To address this issue, we investigate how to leverage the broad knowledge about the structure of the world and commonsense reasoning capabilities of vision-language models (VLMs) to aid legged robots in handling difficult, ambiguous situations. We propose a system, VLM-Predictive Control (VLM-PC), combining two key components that we find to be crucial for eliciting on-the-fly, adaptive behavior selection with VLMs: (1) in-context adaptation over previous robot interactions and (2) planning multiple skills into the future and replanning. We evaluate VLMPC on several challenging real-world obstacle courses, involving dead ends and climbing and crawling, on a Go1 quadruped robot. Our experiments show that by reasoning over the history of interactions and future plans, VLMs enable the robot to autonomously perceive, navigate, and act in a wide range of complex scenarios that would otherwise require environmentspecific engineering or human guidance.

Abstract:
Learning from demonstrations faces challenges in generalizing beyond the training data and often lacks collision awareness. This paper introduces Lan-o3dp, a language-guided object-centric diffusion policy framework that can adapt to unseen situations such as cluttered scenes, shifting camera views, and ambiguous similar objects while offering trainingfree collision avoidance and achieving a high success rate with few demonstrations. We train a diffusion model conditioned on 3D point clouds of task-relevant objects to predict the robot's end-effector trajectories, enabling it to complete the tasks. During inference, we incorporate cost optimization into denoising steps to guide the generated trajectory to be collisionfree. We leverage open-set segmentation to obtain the 3D point clouds of related objects. We use a large language model to identify the target objects and possible obstacles by interpreting the user's natural language instructions. To effectively guide the conditional diffusion model using a time-independent cost function, we proposed a novel guided generation mechanism based on the estimated clean trajectories. In the simulation, we showed that diffusion policy based on the object-centric 3D representation achieves a much higher success rate (68.7%) compared to baselines with simple 2D (39.3%) and 3D scene (43.6%) representations across 21 challenging RLBench tasks with only 40 demonstrations. In real-world experiments, we extensively evaluated the generalization in various unseen situations and validated the effectiveness of the proposed zeroshot cost-guided collision avoidance.

Abstract:
Microfluidic and pneumatic logic systems are valuable for applications such as lab-on-a-chip devices, soft robotics, and factory automation. These systems are particularly advantageous when metal or electronic components are impractical or when there are constraints on the control system volume or weight. This paper introduces a novel individual membrane valve that functions as a set-reset latch and can reduce the number of valves required for some pneumatic or microfluidic logic systems. An application of pneumatic logic systems in soft robotics is the access to multiple tethered pneumatic elements through a reduced number of pneumatic lines. To this end, this paper proposes two pneumatic logic systems capable of selecting among multiple distributed sets of pneumatic elements and operating the elements of the set simultaneously and independently through the different pneumatic lines. The selection is achieved via a sequence of pressure pulses applied on the same lines used afterwards for operation. Two prototypes of these pneumatic logic systems were built and successfully demonstrated, consisting primarily of set-reset membrane valves and powered by binary high/low pressure sources. The first prototype features a hierarchical network with four lines and five sets of three pneumatic elements each; the second prototype features a non-hierarchical network with five lines and twelve sets of four pneumatic elements each.

Abstract:
Deep reinforcement learning (DRL) methods have achieved remarkable success in solving static traveling salesman problems (TSP). However, dynamic TSP (DTSP), with the random appearance of new customers over time, introduces additional complexities that challenge DRL methods by the difficulty of obtaining optimized routing policy which lead to sub-optimal results and reduced training efficiency. To address these issues, we propose a decoupled training neural solver (DTNS) based on the encoder-decoder architecture, which is a novel approach that decouples the optimization of encoder and decoder, enhancing the model's ability to handle dynamic changes. Our method involves training under an Fore-Reveal condition first where the information of all customers nodes are known in advance to obtain optimized encoder and initialization for decoder and then fine-tuning the decoder in dynamic scenarios where dynamic customers are revealed over time. This training paradigm results in a flexible and globally optimized routing policy. Experimental results demonstrate that DTNS efficiently adapts to new customer requests in dynamic scenario, outperforming existing methods in dynamic routing environments.

Abstract:
Maximizing the endurance of unmanned aerial vehicles (UAVs) in large-scale monitoring missions spanning over large areas requires addressing their limited battery capacity. Deploying unmanned ground vehicles (UGVs) as mobile recharging stations offers a practical solution, extending UAVs' operational range. This introduces the challenge of optimizing UAV-UGV routes for efficient mission point coverage and seamless recharging coordination. In this paper, we present a risk-aware deep reinforcement learning (Ra-DRL) framework with a multi-head attention mechanism within an encoder-decoder transformer architecture to solve this cooperative routing problem for a UAV-UGV team. Our model minimizes mission time while accounting for the stochastic fuel consumption of the UAV, influenced by environmental factors like wind velocity, ensuring adherence to a risk threshold to avoid mid-mission energy depletion. Extensive evaluations on various problem sizes show that our method significantly outperforms nearest-neighbor heuristics in both solution quality and risk management. We validate the Ra-DRL policy in a Gazebo-ROS SITL environment with a PX4-based custom UAV and Clearpath Husky UGV. The results demonstrate the robustness and adaptability of our policy, making it highly effective for mission planning in dynamic, uncertain scenarios.

Abstract:
Frequency-modulated continuous-wave (FMCW) scanning radar has emerged as an alternative to spinning LiDAR for state estimation on mobile robots. Radar's longer wavelength is less affected by small particulates, providing operational advantages in challenging environments such as dust, smoke, and fog. This paper presents Radar Teach and Repeat (RT&R): a full-stack radar system for long-term off-road robot autonomy. RT&R can drive routes reliably in off-road cluttered areas without any GPS. We benchmark the radar system's closed-loop path-tracking performance and compare it to its 3D LiDAR counterpart. 11.8 km of autonomous driving was completed without interventions using only radar and gyro for navigation. RT&R was evaluated on four different routes with progressively less structured scene geometry. RT&R achieved lateral path-tracking root mean squared errors (RMSE) of 5.6 cm, 7.5 cm, and 12.1 cm as the routes became more challenging. These RMSE values are less than half of the width of one tire (24 cm) on our robot testing platform. These same routes have worst-case errors of 21.7 cm, 24.0 cm, and 43.8 cm. We conclude that radar is a viable alternative to LiDAR for long-term autonomy in challenging off-road scenarios. The implementation of RT&R is open-source and available at: https://github.com/utiasASRL/vtr3.

Abstract:
We present a novel framework for non-prehensile shape manipulation of deformable objects using Deep Reinforcement Learning. Unlike previous approaches that rely on grasping, our method employs a sequence of gentle pushing actions to deform objects into target shapes. We introduce a continuous parametrization of pushing actions that allows for precise control over pushing trajectories, enabling more flexible and efficient manipulation. The framework is applicable to a wide range of objects by representing them as sampled boundary coordinates, removing the need for predefined object partitions. Trained entirely in simulation, our controller demonstrates zero-shot transfer to real-world scenarios without additional training. Extensive evaluations show that our approach not only matches but substantially exceeds the performance of previous methods, while being more gentle and efficient. We demonstrate successful manipulation across various deformable objects and materials, including food items like salmon and pork loin. This work represents a significant advancement in robotic manipulation of deformable objects, with potential applications in food processing, manufacturing, and beyond.

Abstract:
Spiking neural networks (SNNs) show great potential in mapless navigation tasks due to their low power consumption, but the continuous representation of spatial information poses a challenge to SNN training. Neuroscience findings reveal that spatial cognition cells encode spatial information through population spike patterns. Inspired by this, we propose a navigation method based on SNNs, leveraging spatial cognition cells, which include grid cells (GCs), head direction cells (HDCs), and boundary vector cells (BVCs). Our method integrates spike-based information to achieve precise navigation goal encoding and egocentric environment perception, significantly improving SNN navigation capabilities in complex environments. Simulation and real-world experiments demonstrate that our method achieves significant improvements in navigation success rate and energy efficiency, showcasing superior adaptability across environments. Our work provides a novel approach to developing efficient brain-inspired navigation systems.

Abstract:
We present LuVo, an initialization-free stereo visual odometry (VO) method developed for the VIPER lunar rover. We provide a novel stereo registration method using LightGlue image feature matching in a warped, locally planar space that improves matching robustness to larger baseline stereo sequences and repetitive terrain that traditionally challenge odometry approaches. We additionally introduce methods that increase the usable image region for matching by estimating a horizon cutoff in image space and enhance robustness to stereo correspondence failures using a Manhattan distance search for valid stereo points during cloud alignment. We evaluate the performance of LuVo on a dataset of 155 simulated lunar stereo sequences and show that it significantly improves registration accuracy and success rates for clouds separated by both expected driving ranges below eight meters and longer distance translations of up to 16 meters. While LuVo is developed for VIPER, it can be used in other environments featuring slip-prone and repetitive terrain that limit rover travel.

Abstract:
This paper addresses rock hazard detection for in-situ resource utilization (ISRU) robotic navigation in the challenging visual environment of the lunar south pole (LSP). We evaluate three state-of-the-art instance segmentation mod-els-Mask R-CNN, YOLOv8, and SAM-using a novel, synthetically generated dataset that simulates LSP-specific illumination challenges at sun angles of 2.5^\circ, 5^\circ, and 7.5°. Additionally, we evaluate these approaches in both up and downsun driving with low solar angle light. This study highlights the potential of deep learning-based approaches for improving ISRU operations by reliably identifying visual surface hazards, such as rocks, which may impede robotic navigation and excavation in future lunar missions.

Abstract:
This paper introduces Quarry-Bot, a Reconfigurable Cable-Suspended Robot developed to support the NASA Artemis program's efforts in preparing for the long-term colonization of the Moon and Mars. Quarry-Bot autonomously clears debris on the lunar surface, a key step in site preparation for future habitats and infrastructure. The system utilizes active control strategies, combined with the Moon's lower gravity, to perform underhand rock tosses as a scalable approach to extraterrestrial site preparation. Its reconfigurable structure, including motorized anchor points and a lightweight tripod design, adjusts cable tensions to generate swing motions for debris displacement. The system is driven by two Dynamixel MX-106 motors for movement and steering, along with a NEMA 17 stepper motor for cable adjustments. Simulations and experiments conducted under both Earth and lunar gravity conditions demonstrate the effectiveness of Linear Quadratic Regulator (LQR) and Model Predictive Control (MPC) strategies in achieving rock throws. Quarry-Bot reaches swing angles and projects rocks over distances that may support lunar site clearing and overall engineering purposes. The paper concludes by discussing potential areas for further system refinement, including adjustments for different terrain conditions and improved actuation strategies for lunar missions.

Abstract:
In this paper, we present a novel, scalable approach for constructing open set, instance-level 3D scene representations, advancing open world understanding of 3D environments. Existing methods require pre-constructed 3D scenes and face scalability issues due to per-point feature representation, additionally struggle with contextual queries. Our method overcomes these limitations by incrementally building instancelevel 3D scene representations using 2D foundation models, and efficiently aggregating instance-level details such as masks, feature vectors, names, and captions. We introduce fusion schemes for feature vectors to enhance their contextual knowledge and performance on complex queries. Additionally, we explore large language models for robust automatic annotation and spatial reasoning tasks. We evaluate our proposed approach on multiple scenes from ScanNet [1] and Replica [2] datasets demonstrating zero-shot generalization capabilities, exceeding current state-of-the-art methods in open world 3D scene understanding. Project page: https://opensu3d.github.io/

Abstract:
With recent advancements in Large Language Models, task planning methods that interpret human commands have garnered significant attention. However, as home robots become more common, specifying every daily task could become impractical. This paper introduces a novel semantic map called the Task-Aware Semantic Map (TASMap), which enables robots to autonomously assign and propose necessary tasks in a scene without explicit human commands. The core innovation of this approach is the ability of TASMap to comprehend the context of objects within a scene and autonomously generate task proposals. This capability significantly advances autonomous robotic assistance, reducing the dependency on specific commands and enhancing interaction with environments. We present two key applications of TASMap: contextual task proposal and spatial task proposal. Our results, verified across 35 diverse and realistically disordered scenes, underscore the effectiveness of TASMap in both simulation and real-world environments.

Abstract:
For seamless robot navigation, it's vital to thoroughly understand multi-person scenes, which requires moving beyond simple tasks such as detection and tracking. Higherlevel tasks, such as understanding the interactions and social activities among individuals, are also crucial. Progress towards models that can fully understand scenes involving multiple people is hindered by a lack of sufficient annotated data for such high-level tasks. To address this challenge, we introduce Social-MAE, a simple yet effective transformer-based masked autoencoder framework for multi-person human motion data. The framework uses masked modeling to pre-train the encoder to reconstruct masked human joint trajectories, enabling it to learn generalizable representations of motion in human crowded scenes. Social-MAE comprises a transformer as the MAE encoder and a lighter-weight transformer as the MAE decoder which operates on multi-person joints' trajectory. After the reconstruction task, the MAE decoder is replaced with a task-specific decoder and the model is fine-tuned end-to-end for a variety of high-level social tasks. Our proposed model combined with our pre-training approach achieves the state-of-the-art results on various high-level social tasks, including multi-person pose forecasting, social grouping, and social action understanding. These improvements are demonstrated across four popular multi-person datasets encompassing both human 2D and 3D body pose.

Abstract:
Geo-localization is an essential component of Unmanned Aerial Vehicle (UAV) navigation systems to ensure precise absolute self-localization in outdoor environments. To address the challenges of GPS signal interruptions or low illumination, Thermal Geo-localization (TG) employs aerial thermal imagery to align with reference satellite maps to accurately determine the UAV's location. However, existing TG methods lack uncertainty measurement in their outputs, compromising system robustness in the presence of textureless or corrupted thermal images, self-similar or outdated satellite maps, geometric noises, or thermal images exceeding satellite maps. To overcome these limitations, this paper presents UASTHN, a novel approach for Uncertainty Estimation (UE) in Deep Homography Estimation (DHE) tasks for TG applications. Specifically, we introduce a novel Crop-based Test-Time Augmentation (CropTTA) strategy, which leverages the homography consensus of cropped image views to effectively measure data uncertainty. This approach is complemented by Deep Ensembles (DE) employed for model uncertainty, offering comparable performance with improved efficiency and seamless integration with any DHE model. Extensive experiments across multiple DHE models demonstrate the effectiveness and efficiency of CropTTA in TG applications. Analysis of detected failure cases underscores the improved reliability of CropTTA under challenging conditions. Finally, we demonstrate the capability of combining CropTTA and DE for a comprehensive assessment of both data and model uncertainty. Our research provides profound insights into the broader intersection of localization and uncertainty estimation. The code and models are publicly available.

Abstract:
Visual Odometry (VO) and Visual SLAM (VSLAM) systems often struggle in low-light and dark environments due to the lack of robust visual features. In this paper, we propose a novel active illumination framework to enhance the performance of VO and V-SLAM algorithms in these challenging conditions. The developed approach dynamically controls a moving light source to illuminate highly textured areas, thereby improving feature extraction and tracking. Specifically, a detector block, which incorporates a deep learning-based enhancing network, identifies regions with relevant features. Then, a pan-tilt controller is responsible for guiding the light beam toward these areas, so that to provide information-rich images to the ego-motion estimation algorithm. Experimental results on a real robotic platform demonstrate the effectiveness of the proposed method, showing a reduction in the pose estimation error up to 75 % with respect to a traditional fixed lighting technique.

Abstract:
Autonomous mobile robots (AMRs) equipped with high-quality cameras are revolutionizing the field of autonomous photography by delivering efficient and cost-effective methods for capturing dynamic visual content. As AMRs are deployed in increasingly diverse environments, the challenge of consistently producing high-quality photographic content remains. Traditional approaches often involve AMRs following a predetermined path while capturing data-intensive imagery, which can be suboptimal, especially in environments with limited connectivity or physical obstructions. These drawbacks necessitate intelligent decision-making to pinpoint optimal vantage points for image capture. Inspired by Next Best View studies, we propose a novel autonomous photography framework that enhances image quality and minimizes the number of photos needed. This framework incorporates a proposed evaluation metric that leverages ray-tracing and Gaussian process inter-polation, enabling the assessment of potential visual information from the target in partially known environments. A derivative-free optimization (DFO) method is then proposed to sample candidate views and identify the optimal viewpoint. The effectiveness of our approach is demonstrated by comparing it with existing methods and further validated through simulations and experiments with various vehicles. Note–Code and videos of the simulations and experiments are provided in the supplementary material and can be accessed at https://www.bezzorobotics.com/sg-lb-icra25.

Abstract:
3D human motion forecasting aims to enable autonomous applications. Estimating uncertainty for each prediction (i.e., confidence based on probability density or quantile) is essential for safety-critical contexts like human-robot collaboration to minimize risks. However, existing diverse motion fore-casting approaches struggle with uncertainty quantification due to implicit probabilistic representations hindering uncertainty modeling. We propose ProbHMI, which introduces invertible networks to parameterize poses in a disentangled latent space, enabling probabilistic dynamics modeling. A forecasting module then explicitly predicts future latent distributions, allowing effective uncertainty quantification. Evaluated on benchmarks, ProbHMI achieves strong performance for both deterministic and diverse prediction while validating uncertainty calibration, critical for risk-aware decision making.

Abstract:
The increasing integration of robots in close human environments necessitates robust safety measures that can adapt to evolving tasks and conditions. Current standards rely on task-specific safety evaluations that are often inflexible, requiring repeated assessments whenever task parameters change. This work proposes MonLog, a data-driven, probabilistic method to automatically derive safety curves (SCs) from recent injury protection data sets. By leveraging non-linear modeling techniques, our approach addresses the limitations of conventional linear SCs, which often result in overly conservative speed restrictions. We present a comprehensive test routine to validate our method, highlighting improvements in both compliance with safety constraints and operational efficiency. Our findings demonstrate that the proposed approach not only enhances safety but also optimizes robotic performance, making it suitable for a wide range of applications.

Abstract:
In robotic teleoperation, it is crucial to be able to dynamically adjust interactions with the environment. Drawing inspiration from human behavior during interactions, Variable Impedance Control (VIC) has been widely adopted to enhance robotic flexibility and adaptability. However, maintaining the passivity of such control systems remains a critical safety concern. This paper introduces an optimization-based framework for passive variable impedance control in bilateral teleoperation, combining the advantages of Passivity Filters (PFs), Time-Domain Passivity (TDP) control, and Passive-Set-Position-Modulation (PSPM). The method solves an optimization problem aimed at dissipating the energy that could lead to a lack of passivity. The proposed method is assessed through experiments, illustrating its ability to keep the teleoperation system passive and safe under a variable impedance profile.

Abstract:
Recent research on Simultaneous Localization and Mapping (SLAM) based on implicit representation has shown promising results in indoor environments. However, some challenges remain: the limited scene representation capability of implicit encoding, the uncertainty in the rendering process from implicit representations, and the disruption of consistency by dynamic objects. To address these challenges, we propose a dynamic visual SLAM system based on local-global fusion neural implicit representation, named DVN-SLAM. To improve the scene representation capability, we introduce a local-global fusion neural implicit representation that enables the construction of an implicit map while considering both global structure and local details. To tackle uncertainties arising from the rendering process, we design an information concentration loss for optimization, aiming to concentrate scene information on object surfaces. The proposed DVN-SLAM achieves competitive performance in localization and mapping across multiple datasets. More importantly, DVN-SLAM demonstrates robustness without semantic and optical flow prior in dynamic scenes, which sets it apart from other NeRF-based methods.

Abstract:
Current Simultaneous Localization and Mapping (SLAM) methods based on Neural Radiance Fields (NeRF) or 3D Gaussian Splatting excel in reconstructing static 3D scenes but struggle with tracking and reconstruction in dynamic environments, such as real-world scenes with moving elements. Existing NeRF-based SLAM approaches addressing dynamic challenges typically rely on RGB-D inputs, with few methods accommodating pure RGB input. To overcome these limitations, we propose Dy3DGS-SLAM, the first 3D Gaussian Splatting (3DGS) SLAM method for dynamic scenes using monocular RGB input. To address dynamic interference, we fuse optical flow masks and depth masks through a probabilistic model to obtain a fused dynamic mask. With only a single network iteration, this can constrain tracking scales and refine rendered geometry. Based on the fused dynamic mask, we designed a novel motion loss to constrain the pose estimation network for tracking. In mapping, we use the rendering loss of dynamic pixels, color, and depth to eliminate transient interference and occlusion caused by dynamic objects. Experimental results demonstrate that Dy3DGS-SLAM achieves state-of-the-art tracking and rendering in dynamic environments, outperforming or matching existing RGB-D methods.

Abstract:
Autonomous robots navigating in off-road terrain like forests open new opportunities for automation. While off-road navigation has been studied, existing work often relies on clearly delineated pathways. We present a method allowing for long-range planning, exploration and low-level control in unknown off-trail forest terrain, using vision and GPS only. We represent outdoor terrain with a topological map, which is a set of panoramic snapshots connected with edges containing traversability information. A novel traversability analysis method is demonstrated, predicting the existence of a safe path towards a target in an image. Navigating between nodes is done using goal-conditioned behavior cloning, leveraging the power of a pretrained vision transformer. An exploration planner is presented, efficiently covering an unknown off-road area with unknown traversability using a frontiers-based approach. The approach is successfully deployed to autonomously explore two 400 m2 forest sites unseen during training, in difficult conditions for navigation.

Abstract:
The rapid advancement of autonomous driving technology, particularly in autonomous trucking on highways, shows great value for enhancing efficiency and reducing costs in the logistics industry. In this work, we define the full-trip speed planning problem for autonomous trucks under delivery time and fuel consumption constraints, referred to as the Operational Speed Planning (OSP) problem. To support and accelerate research on the OSP problem, we have developed a comprehensive dataset using a fleet of over 400 trucks. The dataset contains rich, diverse information covering more than 22 million kilometers of real-world highway driving data. In addition to this static dataset, we have developed a closed-loop simulator that allows for the interactive evaluation of OSP solutions, enabling researchers to test speed planning strategies in a realistic environment. Furthermore, we provide an OSP baseline method based on dynamic programming to optimize speed planning, balancing the delivery time requirements and fuel consumption. Our extensive experiments demonstrate both the accuracy of the simulation and the effectiveness of the OSP baseline in planning optimal speeds, proving its capability to meet time constraints while improving fuel efficiency. The dataset, simulator, and baseline will be made publicly available to foster further research and innovation in this area.

Abstract:
Conventional human trajectory prediction models rely on clean curated data, requiring specialized equipment or manual labeling, which is often impractical for robotic applications. The existing predictors tend to overfit to clean observation affecting their robustness when used with noisy inputs. In this work, we propose MonoTransmotion (MT), a Transformerbased framework that uses only a monocular camera to jointly solve localization and prediction tasks. Our framework has two main modules: Bird's Eye View (BEV) localization and trajectory prediction. The BEV localization module estimates the position of a person using 2D human poses, enhanced by a novel directional loss for smoother sequential localizations. The trajectory prediction module predicts future motion from these estimates. We show that by jointly training both tasks with our unified framework, our method is more robust in real-world scenarios made of noisy inputs. We validate our MT network on both curated and non-curated datasets. On the curated dataset, MT achieves around 12% improvement over baseline models on BEV localization and trajectory prediction. On real-world non-curated dataset, experimental results indicate that MT maintains similar performance levels, highlighting its robustness and generalization capability. The code is available at https://github.com/vita-epfl/MonoTransmotion.

Abstract:
Depth estimation is a fundamental task in 3D geometry. While stereo depth estimation can be achieved through triangulation methods, it is not as straightforward for monocular methods, which require the integration of global and local information. The Depth from Defocus (DFD) method utilizes camera lens models and parameters to recover depth information from blurred images and has been proven to perform well. However, these methods rely on All-In-Focus (AIF) images for depth estimation, which is nearly impossible to obtain in real-world applications. To address this issue, we propose a self-supervised framework based on 3D Gaussian splatting and Siamese networks. By learning the blur levels at different focal distances of the same scene in the focal stack, the framework predicts the defocus map and Circle of Confusion (CoC) from a single defocused image, using the defocus map as input to DepthNet for monocular depth estimation. The 3D Gaussian splatting model renders defocused images using the predicted CoC, and the differences between these and the real defocused images provide additional supervision signals for the Siamese Defocus self-supervised network. This framework has been validated on both artificially synthesized and real blurred datasets. Subsequent quantitative and visualization experiments demonstrate that our proposed framework is highly effective as a DFD method.

Abstract:
The Robot Operating System (ROS) has become the de facto standard middleware in robotics, widely adopted across domains ranging from education to industrial applications. The RoboStack distribution, a conda-based packaging system for ROS, has extended ROS's accessibility by facilitating installation across all major operating systems and architectures, integrating seamlessly with scientific tools such as PyTorch and Open3D. This paper presents ROS2WASM, a novel integration of RoboStack with WebAssembly, enabling the execution of ROS 2 and its associated software directly within web browsers, without requiring local installations. ROS2WASM significantly enhances the reproducibility and shareability of research, lowers barriers to robotics education, and leverages WebAssembly's robust security framework to protect against malicious code. We detail our methodology for cross-compiling ROS 2 packages into WebAssembly, the development of a specialized middleware for ROS 2 communication within browsers, and the implementation of www.ros2wasm.dev, a web platform enabling users to interact with ROS 2 environments. Additionally, we extend support to the Robotics Toolbox for Python and adapt its Swift simulator for browser compatibility. Our work paves the way for unprecedented accessibility in robotics, offering scalable, secure, and reproducible environments that have the potential to transform educational and research paradigms.

Abstract:
The inherent flexibility and real-time deformation of breast tissue pose significant challenges for achieving full coverage and accurate lesion localization in autonomous breast ultrasound scanning. This paper introduces a robust finite state machine-based framework that mimics the decision-making process of an experienced physician, dynamically transitioning between the global breast scan and the fine lesion scan. An autonomous radial and anti-radial global scan pattern ensures comprehensive breast coverage. To avoid lesion misidentification caused by soft tissue movement, a real-time lesion fine scan method is proposed for lesion detection and localization. Experimental results demonstrate that the system in full coverage tests achieves 7 identified lesions out of 7 existing lesions and maintains a robust localization accuracy of \mathbf3. 2 3 ~ m m across phantoms with varying stiffnesses.

Abstract:
The small scale of urban farms and the commercial availability of low-cost robots (such as the FarmBot) that automate simple tending tasks enable an accessible platform for plant phenotyping. We have used a FarmBot with a custom camera end-effector to estimate strawberry plant flower pose (for robotic pollination) from acquired 3D point cloud models. We describe a novel algorithm that translates individual occupancy grids along orthogonal axes of a point cloud to obtain 2D images corresponding to the six viewpoints. For each image, 2D object detection models for flowers are used to identify 2D bounding boxes which can be converted into the 3D space to extract flower point clouds. Pose estimation is performed by fitting three shapes (superellipsoids, paraboloids and planes) to the flower point clouds and compared with manually labeled ground truth. Our method successfully finds approximately 80% of flowers scanned using our customized FarmBot platform and has a mean flower pose error of 7.7 degrees, which is sufficient for robotic pollination and rivals previous results. All code will be made available at https://github.com/harshmuriki/flowerPose.git.

Abstract:
Developing robust and correctable visuomotor policies for robotic manipulation is challenging due to the lack of self-recovery mechanisms from failures and the limitations of simple language instructions in guiding robot actions. To address these issues, we propose a scalable data generation pipeline that automatically augments expert demonstrations with failure recovery trajectories and fine-grained language annotations for training. We then introduce Rich languAge-guided failure reCovERy (RACER), a supervisor-actor frame-work, which combines failure recovery data with rich language descriptions to enhance robot control. RACER features a vision-language model (VLM) that acts as an online supervisor, providing detailed language guidance for error correction and task execution, and a language-conditioned visuomotor policy as an actor to predict the next actions. Our experimental results show that RACER outperforms the state-of-the-art Robotic View Transformer (RVT) on RLbench across various evaluation settings, including standard long-horizon tasks, dynamic goal-change tasks and zero-shot unseen tasks, achieving superior performance in both simulated and real world environments. Videos and code are available at: https://rich-language-failure-recovery.github.io.

Abstract:
Semantic segmentation is a critical technique for effective scene understanding. Traditional RGB-T semantic segmentation models often struggle to generalize across diverse scenarios due to their reliance on pretrained models and predefined categories. Recent advancements in Visual Language Models (VLMs) have facilitated a shift from closedset to open-vocabulary semantic segmentation methods. However, these models face challenges in dealing with intricate scenes, primarily due to the heterogeneity between RGB and thermal modalities. To address this gap, we present Open-RGBT, a novel open-vocabulary RGB-T semantic segmentation model. Specifically, we obtain instance-level detection proposals by incorporating visual prompts to enhance category understanding. Additionally, we employ the CLIP model to assess image-text similarity, which helps correct semantic consistency and mitigates ambiguities in category identification. Empirical evaluations demonstrate that Open-RGBT achieves superior performance in diverse and challenging real-world scenarios, even in the wild, significantly advancing the field of RGB-T semantic segmentation. The project page of Open-RGBT is available at https://OpenRGBT.github.io/.

Abstract:
Online monocular 3D reconstruction has attracted widespread attention as it promotes the application of robots in interactive scenarios. Most existing methods focus on 1) real-time reconstruction, 2) accurate voxel featuring learning, and 3) effective voxel sparsification algorithm. To this end, 1) they adopt a coarse-to-fine pipeline, where all non-empty voxels are sent to the next level for refinement. However, this results in over-refinement of flat regions, leading to unnecessary computational overhead. Furthermore, 2) advanced methods focus on exploring view visibility but overlook the discriminability among visible views, which limits the representation of learned voxel features. Moreover, 3) existing sparsification algorithms struggle to distinguish detailed and empty voxels, resulting in either the loss of detailed voxels or the retention of empty voxels. To tackle these challenges, 1) we present Dynamic Detail Refinement (DDR) to allocate more voxels to detailed regions for refinement, which could alleviate the computational burden. Furthermore, 2) we propose Discriminability-Aware Fusion (DAF) to focus on discriminative views, which helps to capture accurate voxel features. In addition, 3) we propose Hierarchical Hybrid Sparsification (HHS) to balance global completeness and local refinement, which helps to preserve detailed voxels at hierarchical levels effectively. Extensive experiments conducted on the representative ScanNet (V2) and 7-Scenes datasets demonstrate the superiority of the proposed method.

Abstract:
Cooperative localization and target tracking are essential for multi-robot systems to implement high-level tasks. To this end, we propose a distributed invariant Kalman filter (KF) based on covariance intersection (CI) for effective multi-robot pose estimation. The paper utilizes the object-level measurement models, which have condensed information further reducing the communication burden. Besides, by modeling states on special Lie groups, and representing uncertainty in corresponding Lie algebras, better linearity and consistency are obtained under the invariant KF framework. We also combine CI and invariant KF to avoid overly confident or conservative estimates in multi-robot systems with intricate and unknown correlations, and some level of robot degradation is acceptable through multi-robot collaboration. The simulation and real data experiment validate the practicability and superiority of the proposed algorithm. The source code is publicly available 11https://github.com/LIAS-CUHKSZ/Distributed-object-based-SLAM.

Abstract:
Quadruped robots must exhibit robust walking capabilities in practical applications. In this work, we propose a novel approach that enables quadruped robots to pass various small obstacles, or “tiny traps”. Existing methods often rely on exteroceptive sensors, which can be unreliable for detecting such tiny traps. To overcome this limitation, our approach focuses solely on proprioceptive inputs. We introduce a two-stage training framework incorporating a contact encoder and a classification head to learn implicit representations of different traps. Additionally, we design a set of tailored reward functions to improve both the stability of training and the ease of deployment for goal-tracking tasks. To benefit further research, we design a new benchmark for tiny trap task. Extensive experiments in both simulation and real-world settings demonstrate the effectiveness and robustness of our method. Videos and Appendix are on page: https://robust-robot-walker.github.io/

Abstract:
Multiple rotation averaging plays a crucial role in computer vision and robotics domains. The conventional optimization-based methods optimize a nonlinear cost function based on certain noise assumptions, while most previous learning-based methods require ground truth labels in the supervised training process. Recognizing the handcrafted noise assumption may not be reasonable in all real-world scenarios, this paper proposes an effective rotation averaging method for mining data patterns in a learning manner while avoiding the requirement of labels. Specifically, we apply deep matrix factorization to directly solve the multiple rotation averaging problem in free linear space. For deep matrix factorization, we design a neural network model, which is explicitly low-rank and symmetric to better suit the background of multiple rotation averaging. Meanwhile, we utilize a spanning tree-based edge filtering to suppress the influence of rotation outliers. What's more, we also adopt a reweighting scheme and dynamic depth selection strategy to further improve the robustness. Our method synthesizes the merit of both optimization-based and learning-based methods. Experimental results on various datasets validate the effectiveness of our proposed method.

Abstract:
Inertial tracking is vital for autonomous robots and has gained popularity with the ubiquity of low-cost Inertial Measurement Units (IMUs) and deep learning-powered tracking algorithms. Existing works, however, have not fully utilized IMU measurements, particularly magnetometers, nor maximized the potential of deep learning to achieve the desired accuracy. To bridge the gap, we introduce NeurIT, which employs a Time-Frequency Block-recurrent Transformer (TF-BRT) at its core, combining RNN and Transformer to learn both time-frequency representative features. To fully utilize IMU information, we strategically employ differentiation of body-frame magnetometers for orientation calibration in a sensor fusion manner. Experiments conducted in diverse environments show that NeurIT maintains a mere 1 -meter tracking error over a 300 - meter distance, surpassing state-of-the-art baselines by 48.21 % on unseen data. NeurIT also performs comparably to the visual-inertial approach (Tango Phone) in vision-favored conditions and surpasses it in plain environments. We share the code and data to promote further research: https://github.com/aiot-lab/NeurIT.

Abstract:
We propose an advanced method for controlling the motion of a manipulator robot with strict collision avoidance in dynamic environments, leveraging a velocity damper constraint. Unlike conventional distance-based constraints, which tend to saturate near obstacles to reach optimality, the velocity damper constraint considers both distance and relative velocity, ensuring a safer separation. This constraint is incorporated into a model predictive control framework and enforced as a hard constraint through analytical derivatives supplied to the numerical solver. The approach has been fully implemented on a Franka Emika Panda robot and validated through experimental trials, demonstrating effective collision avoidance during dynamic tasks and robustness to unmodeled disturbances. An efficient open-source implementation along examples are provided here: https://gepettoweb.laas.fr/articles/haffemayer2025.html.

Abstract:
Fast and efficient collision detection is essential for motion generation in robotics. In this paper, we propose an efficient collision detection framework based on the Signed Distance Field (SDF) of robots, seamlessly integrated with a self-collision detection module. Firstly, we decompose the robot's SDF using forward kinematics and leverage multiple extremely lightweight networks in parallel to efficiently approximate the SDF. Moreover, we introduce support vector machines to integrate the self-collision detection module into the framework, which we refer to as the SDF-SC framework. Using statistical features, our approach unifies the representation of collision distance for both SDF and self-collision detection. During this process, we maintain and utilize the differentiable properties of the framework to optimize collision-free robot trajectories. Finally, we develop a reactive motion controller based on our framework, enabling real-time avoidance of multiple dynamic obstacles. While maintaining high accuracy, our framework achieves inference speeds up to five times faster than previous methods. Experimental results on the Franka robotic arm demonstrate the effectiveness of our approach. Project page: https://sites.google.com/view/icra2025-sdfsc.

Abstract:
Embodied Instruction Following (EIF) is the task of executing natural language instructions by navigating and interacting with objects in interactive environments. A key challenge in EIF is compositional task planning, typically addressed through supervised learning or few-shot in-context learning with labeled data. To this end, we introduce the Socratic Planner, a self-QA-based zero-shot planning method that infers an appropriate plan without any further training. The Socratic Planner first facilitates self-questioning and answering by the Large Language Model (LLM), which in turn helps generate a sequence of subgoals. While executing the subgoals, an embodied agent may encounter unexpected situations, such as unforeseen obstacles. The Socratic Planner then adjusts plans based on dense visual feedback through a visuallygrounded re-planning mechanism. Experiments demonstrate the effectiveness of the Socratic Planner, outperforming current state-of-the-art planning models on the ALFRED benchmark across all metrics, particularly excelling in long-horizon tasks that demand complex inference. We further demonstrate its real-world applicability through deployment on a physical robot for long-horizon tasks.

Abstract:
When independently trained or designed robots are deployed in a shared environment, their combined actions can lead to unintended negative side effects (NSEs). To ensure safe and efficient operation, robots must optimize task performance while minimizing the penalties associated with NSEs, balancing individual objectives with collective impact. We model the problem of mitigating NSEs in a cooperative multi-agent system as a bi-objective lexicographic decentralized Markov decision process. We assume independence of transitions and rewards with respect to the robots' tasks, but the joint NSE penalty creates a form of dependence in this setting. To improve scalability, the joint NSE penalty is decomposed into individual penalties for each robot using credit assignment, which facilitates decentralized policy computation. We empirically demonstrate, using mobile robots and in simulation, the effectiveness and scalability of our approach in mitigating NSEs. Code: https://tinyurl.com/RECON-NSE-Mitigation

Abstract:
Transparent objects are common in daily life, while their optical properties pose challenges for RGB-D cameras to capture accurate depth information. This issue is further amplified when these objects are hand-held, as hand occlusions further complicate depth estimation. For assistant robots, however, accurately perceiving hand-held transparent objects is critical to effective human-robot interaction. This paper presents a Hand-Aware Depth Restoration (HADR) method based on creating an implicit neural representation function from a single RGB-D image. The proposed method utilizes hand posture as an important guidance to leverage semantic and geometric information of hand-object interaction. To train and evaluate the proposed method, we create a highfidelity synthetic dataset named TransHand- \mathbf1 4 K with a real-tosim data generation scheme. Experiments show that our method has better performance and generalization ability compared with existing methods. We further develop a real-world human-to-robot handover system based on HADR, demonstrating its potential in human-robot interaction applications.

Abstract:
Human-robot interaction (HRI) encompasses a wide range of collaborative tasks, with handover being one of the most fundamental. As robots become more integrated into human environments, the potential for service robots to assist in handing objects to humans is increasingly promising. In robot-to-human (R2H) handover, selecting the optimal grasp is crucial for success, as it requires avoiding interference with the human's preferred grasp region and minimizing intrusion into their workspace. Existing methods either inadequately consider geometric information or rely on data-driven approaches, which often struggle to generalize across diverse objects. To address these limitations, we propose a novel zero-shot system that combines semantic and geometric information to generate optimal handover grasps. Our method first identifies grasp regions using semantic knowledge from vision-language models (VLMs) and, by incorporating customized visual prompts, achieves finer granularity in region grounding. A grasp is then selected based on grasp distance and approach angle to maximize human ease and avoid interference. We validate our approach through ablation studies and real-world comparison experiments. Results demonstrate that our system improves handover success rates and provides a more user-preferred interaction experience. Videos, appendixes and more are available at https://sites.google.com/view/vlm-handover.

Abstract:
Ensuring the Safety of the Intended Functionality (SOTIF) for autonomous vehicles (AVs) is critical. Effective risk assessment helps AVs make decisions and avoid risks. However, existing methods face challenges due to environmental uncertainties, insufficient multi-dimensional risk quantification, and limited predictive accuracy. To address this challenge, we propose an uncertainty-aware probabilistic risk assessment framework that quantifies the risk of AVs violating safety constraints and calculates the expected average severity of such violations in uncertain environments. We first establish a general SOTIF risk model to characterize the static risk of the AV and surrounding traffic participants. Following this, we introduce a method for predicting dynamic uncertainty risks, resulting in probabilistic risk quantification. This framework accounts for multi-dimensional uncertainties and enhances safety under dynamic conditions. Extensive evaluations across typical traffic scenarios-including highways, intersections, and roundabouts-demonstrate that our method outperforms typical algorithms like Time Headway (THW) and Time-toCollision (TTC). Empirical studies in extreme scenarios further validate the framework's ability to reduce risks and improve system generalization. The related code is available at: https://github.com/idslab-autosec/risk_uncertainty.

Abstract:
Anomaly detection is critical in industrial manufacturing for ensuring product quality and improving efficiency in automated processes. The scarcity of anomalous samples limits traditional detection methods, making anomaly generation essential for expanding the data repository. However, recent generative models often produce unrealistic anomalies increasing false positives, or require real-world anomaly samples for training. In this work, we treat anomaly generation as a compositional problem and propose ComGEN, a component-aware and unsupervised framework that addresses the gap in logical anomaly generation. Our method comprises a multi-component learning strategy to disentangle visual components, followed by subsequent generation editing procedures. Disentangled text-to-component pairs, revealing intrinsic logical constraints, conduct attention-guided residual mapping and model training with iteratively matched references across multiple scales. Experiments on the MVTecLOCO dataset confirm the efficacy of ComGEN, achieving the best AUROC score of \mathbf9 1. 2 %. Additional experiments on the real-world scenario of Diesel Engine and widelyused MVTecAD dataset demonstrate significant performance improvements when integrating simulated anomalies generated by ComGEN into automated production workflows.

Abstract:
With the increasing demand for efficient and flexible robotic exploration solutions, Reinforcement Learning (RL) is becoming a promising approach in the field of autonomous robotic exploration. However, current RL-based exploration algorithms often face limited environmental reasoning capabilities, slow convergence rates, and substantial challenges in Sim-To-Real (S2R) transfer. To address these issues, we propose a Curriculum Learning-based Transformer Reinforcement Learning Algorithm (CTSAC) aimed at improving both exploration efficiency and transfer performance. To enhance the robot's reasoning ability, a Transformer is integrated into the perception network of the Soft Actor-Critic (SAC) framework, leveraging historical information to improve the farsightedness of the strategy. A periodic review-based curriculum learning is proposed, which enhances training efficiency while mitigating catastrophic forgetting during curriculum transitions. Training is conducted on the ROS-Gazebo continuous robotic simulation platform, with LiDAR clustering optimization to further reduce the S2R gap. Experimental results demonstrate the CTSAC algorithm outperforms the state-of-the-art non-learning and learning-based algorithms in terms of success rate and success rate-weighted exploration time. Moreover, real-world experiments validate the strong S2R transfer capabilities of CTSAC.

Abstract:
Safe navigation is essential for autonomous systems operating in hazardous environments. Traditional planning methods are effective for solving long-horizon tasks but depend on the availability of a graph representation with prede-fined distance metrics. In contrast, safe Reinforcement Learning (RL) is capable of learning complex behaviors without relying on manual heuristics but fails to solve long-horizon tasks, particularly in goal-conditioned and multi-agent scenarios. In this paper, we introduce a novel method that integrates the strengths of both planning and safe RL. Our method leverages goal-conditioned RL (GCRL) and safe RL to learn a goal-conditioned policy for navigation while concurrently estimating cumulative distance and safety levels using learned value functions via an automated self-training algorithm. By constructing a graph with states from the replay buffer, our method prunes unsafe edges and generates a waypoint-based plan that the agent then executes by following those waypoints sequentially until their goal locations are reached. This graph pruning and planning approach via the learned value functions allows our approach to flexibly balance the trade-off between faster and safer routes especially over extended horizons. Utilizing this unified high-level graph and a shared low-level safe GCRL policy, we extend this approach to address the multi-agent safe navigation problem. In particular, we leverage Conflict-Based Search (CBS) to create waypoint-based plans for multiple agents allowing for their safer navigation over extended horizons. This integration enhances the scalability of goal-conditioned safe RL in multi-agent scenarios, enabling efficient coordination among agents. Extensive benchmarking against state-of-the-art baselines demonstrates the effectiveness of our method in achieving distance goals safely for multiple agents in complex and hazardous environments. Our code and further details about or work is available at https://safe-visual-mapf-mers.csail.mit.edu/.

Abstract:
Improving data utilization, especially for imperfect data from task failures, is crucial for robotic manipulation due to the challenging, time-consuming, and expensive data collection process in the real world. Current imitation learning (IL) typically discards imperfect data, focusing solely on successful expert data. While reinforcement learning (RL) can learn from explorations and failures, the sim2real gap and its reliance on dense reward and online exploration make it difficult to apply effectively in real-world scenarios. In this work, we aim to conquer the challenge of leveraging imperfect data without the need for reward information to improve the model performance for robotic manipulation in an offline manner. Specifically, we introduce a Self-Supervised Data Filtering framework (SSDF) that combines expert and imperfect data to compute quality scores for failed trajectory segments. High-quality segments from the failed data are used to expand the training dataset. Then, the enhanced dataset can be used with any downstream policy learning method for robotic manipulation tasks. Extensive experiments on the ManiSkill2 benchmark built on the high-fidelity Sapien simulator and real-world robotic manipulation tasks using the Franka robot arm demonstrated that the SSDF can accurately expand the training dataset with high-quality imperfect data and improve the success rates for all robotic manipulation tasks.

Abstract:
In Advanced Driver Assistance Systems (ADAS), environmental perception and object detection are crucial for ensuring safe autonomous driving. Single-modality systems often struggle under adverse weather conditions, underscoring the need for multi-modal approaches. Current fusion methods typically rely on simplistic concatenation of multi-modal features, which neglects semantic alignment and does not fully exploit inter-modal correlations. This paper proposes a crossattention feature fusion specifically designed to enhance the global correlation between camera and radar features. By dynamically adjusting feature weights through cross-attention, our approach significantly improves feature integration. Furthermore, we propose a depth-weighted voting fusion strategy to select the most accurate sensor depth, thereby enhancing decision-making stability. Experimental results on the nuScenes dataset show substantial improvements, with mean Average Precision (mAP) of 0.399 and mean Average Translation Error (mATE) of 0.602, highlighting the effectiveness of our approach in enhancing the robustness and accuracy of multi-modal fusion.

Abstract:
Test-Time Adaptation (TTA) adjusts pre-trained models in unlabeled unseen environments during the test phase, making it more practical for robotic applications. However, the constant changes of the physical world create significant domain gaps between the received data during robot deployment and the source data used for training. In addition, existing methods mainly focus on a single modality, e.g., RGB images, limiting the application of these methods in multi-modality input scenarios. In this work, we propose a Deep Multi-modality Aggregation Test-time Adaptation (DMATA) method to address the above mentioned issues. To prevent the domain shifts from disrupting the adaptation process, we first propose a Momentum-based Teacher-Student (MTS) framework. Since the teacher model and the student model contain complementary information, we design an Uncertainty-Guide (UG) feature fusion block to fuse the representation of the teacher model and student model of each modality. Finally, we introduce a 3D-Guide-2D (3G2) feature fusion block to leverage spatial information for enhancing 2D feature representation. Extensive experiments across three scenarios, including sensor-to-sensor, day-to-night, and city-to-city, demonstrate the effectiveness of our method in TTA multi-modality semantic segmentation tasks. Notably, under the scenario of sensor-to-sensor adaptation, our proposed DMATA obtains an m IoU of 54.2%, which is superior to the state-of-the-art test-time adaptation method.

Abstract:
LiDAR point clouds 3D semantic segmentation enables efficient and accurate environmental sensing for intelligent vehicles and autonomous robots, greatly advancing these domains. Existing advanced methods that use 3D sparse convolutional often suffer from a small Effective Receptive Field (ERF), which limits context sensing and challenging highperformance segmentation. Building on this observation, we propose MDC-Seg for efficient ERF enlargement. We design Multi-directional Convolution (MDConv), which simultaneously performs sparse feature encoding on the Bird's Eye View (BEV) and Range View (RV) planes to enlarge the ERF of 3D sparse convolution. To enhance feature fusion in MDConv, we introduce an attention mechanism and design an efficient multifeature fusion (EMFF) module suitable for both 3D and 2D sparse features. To improve segmentation accuracy, we design a point-voxel constraint (PVC) module to handle edge voxels containing multiple point cloud categories, optimizing the final inference results. These modules add minimal memory and inference time but significantly improve performance compared to the baseline. Extensive experiments on the SemanticKITTI benchmark demonstrate MDC-Seg's excellent performance, with supplementary tests on nuScenes further confirming its superiority by yielding good results. The source code is available at https://github.com/OYgreat-river/MDC-Seg.

Abstract:
D object detection plays a critical role in advancing autonomous driving technology. To improve perception capabilities while maintaining low costs and ensuring performance in adverse weather conditions, 4D radar has emerged as a promising alternative for 3D object detection. However, current methods fail to fully exploit raw data and density information of 4 D radar point clouds to tackle challenges like sparse data and noise. To address these limitations and make use of the unique Doppler velocity information provided by 4D radar, we propose a novel approach called 4DRadDet, which uses cross-attention fusion with cluster-queried techniques for 3D object detection. The 4DRadDet model uses a specially designed incremental clustering method to cluster potential object point clouds, reducing measurement errors from limited radar angular resolution and signal multipath effects. The cross-attention feature fusion (CAFF) module enhances network performance by querying the clustered point cloud feature map, allowing the network to leverage reliable prior information from the clustered point cloud to better detect potential objects. Our experimental evaluations on the View-of-Delft (VoD) dataset demonstrate the effectiveness of 4DRadDet, showcasing state-of-the-art performance. Specifically, 4DRadDet achieves a 3D mean average precision (\textmAP_3 \mathrmD) of 51.44 % and a bird'seye view mean average precision (\mathbfm A P_\text BEV ) of \mathbf5 7. 0 7 %. Our proposed method demonstrates impressive inference times and achieves real-time detection capabilities.

Abstract:
Scalar field features such as extrema, contours, and saddle points are essential for applications in environmental monitoring, search and rescue, and resource exploration. Traditional navigation methods often rely on predefined trajectories, leading to inefficient and resource-intensive mapping. This paper introduces a new adaptive navigation framework that leverages learning techniques to enhance exploration efficiency and effectiveness in scalar fields, even under noisy data and obstacles. The framework employs Partial Differential Equations to model scalar fields and a Gaussian Process Regressor to estimate the fields and their gradients, enabling real-time path adjustments and obstacle avoidance. We provide a theoretical foundation for the approach and address several limitations found in existing methods. The effectiveness of our framework is demonstrated through simulation benchmarks and field experiments with an Autonomous Surface Vehicle, showing improved efficiency and adaptability compared to traditional methods and offering a robust solution for real-time environmental monitoring.

Abstract:
Tactile information is a critical tool for dexterous manipulation. As humans, we rely heavily on tactile information to understand objects in our environments and how to interact with them. We use touch not only to perform manipulation tasks but also to learn how to perform these tasks. Therefore, to create robotic agents that can learn to complete manipulation tasks at a human or super-human level of performance, we need to properly incorporate tactile information into both skill execution and skill learning. In this paper, we investigate how we can incorporate tactile information into imitation learning platforms to improve performance on manipulation tasks. We show that incorporating visuo-tactile pretraining improves imitation learning performance, not only for tactile agents (policies that use tactile information at inference), but also for non-tactile agents (policies that do not use tactile information at inference). For these non-tactile agents, pretraining with tactile information significantly improved performance (for example, improving the accuracy on USB plugging from 20% to 85%), reaching a level on par with visuo-tactile agents, and even surpassing them in some cases. For demonstration videos and access to our codebase, see the project website: https://sites.google.com/andrew.cmu.edu/visuo-tactile-pretraining

Abstract:
Hybrid aerial underwater vehicles (HAUVs) exhibit significant potential due to their ability to operate seamlessly in the air and water domains. However, balancing rapid maneuverability in both media and achieving stability during the cross-domain phase remains a significant challenge. Inspired by the retractable limbs of a tortoise, this paper presented a novel morphing HAUV, Nezha-MB. Nezha-MB utilizes linear actuators combined with a gear and rack system for arm transformation during the transition phase, replacing conventional servos. The transformation mechanism accounts for 11 % of the total weight. In aerial mode, Nezha-MB exhibits flight performance comparable to a quadrotor configuration. In underwater mode, NezhaMB retracts its quadrotor arms into a bullet-shaped shell, significantly reducing drag and energy consumption, while enabling passage through narrow gaps with diameters as small as 134 mm. Simulations and field tests conducted in both aerial and underwater domains demonstrate that NezhaMB combines the swift maneuverability of a streamlined underwater vehicle with the stability of a traditional multirotor vehicle, highlighting its robust and rapid cross-domain capabilities.

Abstract:
This work deals with the problem of unlocking perpetual deployment capabilities for small-UAS robotics across the diverse settings of the real world and their challenges, encompassing considerations for marine environments alongside the more common terrestrial ones. Via the progress made within this scope, a step towards truly ubiquitous and selfsustainable aerial robotics is accomplished. The work consists of the development of the Gannet Solar-VTOL, a waterproof small-UAS that is capable of resting on the surface of water for prolonged periods of time and over varying temperature ranges, while harvesting solar power to recharge itself. Equally importantly, it integrates a field-proven Self-Sustainable Autonomous System architecture that allows it to hibernate and sustain its battery charge overnight or during periods of solar illumination scarcity, as well as to assess mission-critical parameters (e.g., water surface turbulence, ambient temperature of battery compartment) on the low-power side of the Power Management Stack, and react appropriately. Finally, the robot is equipped with an onboard camera and a Neural Processing Unit that allows it to perform in-field environmental monitoring operations (e.g., wildfire detection). This paper experimentally demonstrates the aforementioned capabilities, and concludes with a presentation of the amphibious small-UAS' long-term deployment within a marine environment in the N. Nevada region, spanning over 3 consecutive days.

Abstract:
Collaborative perception, fusing information from multiple agents, can extend perception range so as to improve perception performance. However, temporal asynchrony in real-world environments, caused by communication delays, clock misalignment, or sampling configuration differences, can lead to information mismatches. If this is not well handled, then the collaborative performance is patchy, and what's worse safety accidents may occur. To tackle this challenge, we propose CoDynTrust, an uncertainty-encoded asynchronous fusion perception framework that is robust to the information mismatches caused by temporal asynchrony. CoDynTrust generates dynamic feature trust modulus (DFTM) for each region of interest by modeling aleatoric and epistemic uncertainty as well as selectively suppressing or retaining single-vehicle features, thereby mitigating information mismatches. We then design a multi-scale fusion module to handle multi-scale feature maps processed by DFTM. Compared to existing works that also consider asynchronous collaborative perception, CoDynTrust combats various low-quality information in temporally asynchronous scenarios and allows uncertainty to be propagated to downstream tasks such as planning and control. Experimental results demonstrate that CoDynTrust significantly reduces performance degradation caused by temporal asynchrony across multiple datasets, achieving state-of-the-art detection performance even with temporal asynchrony. The code is available at https://github.com/CrazyShout/CoDynTrust.

Abstract:
This paper presents a study on the infield self-calibration of two rigidly connected IMUs' intrinsic parameters, without the aid of any external sensors, equipment, or specialized procedures. Specifically, we consider the calibration of gyroscope biases, gyroscope scale factors, and accelerometer biases, using only IMU data and known extrinsics between the two IMUs. We focus on the observability analysis of this system, and show that all gyroscope intrinsic parameters and a portion of accelerometer biases are observable, with information from both IMUs and sufficient motion. Moreover, we identify the additional unobservable directions in the intrinsic parameters that arise from various degenerate motions. Finally, we validate our observability findings through numerical simulations, and assess our system's calibration accuracy using real-world data.

Abstract:
In this paper we address the simultaneous collision detection and force estimation problem for quadrupedal locomotion using joint encoder information and the robot dynamics only. We design an interacting multiple-model Kalman filter (IMM-KF) that estimates the external force exerted on the robot and multiple possible contact modes. The method is invariant to any gait pattern design. Our approach leverages pseudo-measurement information of the external forces based on the robot dynamics and encoder information. Based on the estimated contact mode and external force, we design a reflex motion and an admittance controller for the swing leg to avoid collisions by adjusting the leg's reference motion. Additionally, we implement a force-adaptive model predictive controller to enhance balancing. Simulation ablatation studies and experiments show the efficacy of the approach.

Abstract:
In critical applications, including search-and-rescue in degraded environments, blockages can be prevalent and prevent the effective deployment of certain sensing modalities, particularly vision, due to occlusion and the constrained range of view of onboard camera sensors. To enable robots to tackle these challenges, we propose a new approach, Proprioceptive Obstacle Detection and Estimation while navigating in clutter (PROBE), which instead relies only on the robot's proprioception to infer the presence or absence of occluded rectangular obstacles while predicting their dimensions and poses in SE (2). The proposed approach is a Transformer neural network that receives as input a history of applied torques and sensed whole-body movements of the robot and returns a parameterized representation of the obstacles in the environment. The effectiveness of PROBE is evaluated on simulated environments in Isaac Gym and with a real Unitree Go1 quadruped robot. The project webpage can be found at https://dhruvmetha.github.io/legged-probe/.

Abstract:
With the rapid advancements in autonomous driving, accurately predicting pedestrian behavior has become essential for ensuring safety in complex and unpredictable traffic conditions. The growing interest in this challenge highlights the need for comprehensive datasets that capture unstructured environments, enabling the development of more robust prediction models to enhance pedestrian safety and vehicle navigation. In this paper, we introduce an Indian driving pedestrian dataset designed to address the complexities of modeling pedestrian behavior in unstructured environments, such as illumination changes, occlusion of pedestrians, unsignalized scene types and vehicle-pedestrian interactions. The dataset provides high-level and detailed low-level comprehensive annotations focused on pedestrians requiring the ego-vehicle's attention. Evaluation of the state-of-the-art intention prediction methods on our dataset shows a significant performance drop of up to 15 %, while trajectory prediction methods underperform with an increase of up to 1208 MSE, defeating standard pedes-trian datasets. Additionally, we present exhaustive quantitative and qualitative analysis of intention and trajectory baselines. We believe that our dataset will open new challenges for the pedestrian behavior research community to build robust models. Project Page: https://cvit.iiit.ac.in/research/projects/cvit-projects/iddped

Abstract:
Craniosynostosis involves premature fusion of the cranial sutures resulting in abnormal skull morphology and elevated intracranial pressure. Surgical intervention is necessary to correct the skull shape and to allow for unrestricted brain growth. This study presents a novel snake robot designed for minimally invasive cranial osteotomies featuring two articulating bending segments. The end-effector comprises a bone-punch for bone-cutting, a dural and scalp retractor, as well as channels for an endoscope and an instrument. The robot's bending mechanism is driven by tendons and utilizes geared linkages to facilitate a smooth curved shape. Pre-tensioned antagonistic tendons allow the robot to modulate its stiffness to adapt to external loads. A follow-the-leader algorithm was implemented to guide the robot along a skull cutting path. Experimental results demonstrated that at maximum bending of 60^\circ for segment 1 and 90^\circ for segment 2 there was a 15.9^\circ and 11.5^\circ error, respectively. Position errors ranged from 2.5 to 21.5 mm when tracing a curved path. The tool increased stiffness with tendon pre-tensioning from 20–100 N during bent configurations q_1 and q_2 for segments 1 and 2, respectively, at [q_1,q_2]=[0^\mathrmo,30^\mathrmo] and [30^\circ,60^\circ]. Tip deflection reduced from 0.42 to 0.03 cm and 0.37 to 0.10 cm during axial loading and from 11.40 to 3.88 cm and 3.62 to 0.48 cm during radial loading for each configuration, respectively. Ex vitro trials demonstrated the robots ability to perform simulated osteotomies on skull models to 68–73% of desired path lengths with a maximum deviation of 8 mm.

Abstract:
An open problem in industrial automation is to reliably perform tasks requiring in-contact movements with complex workpieces, as current solutions lack the ability to seamlessly adapt to the workpiece geometry. In this paper, we propose a Learning from Demonstration approach that allows a robot manipulator to learn and generalise motions across complex surfaces by leveraging differential mathematical operators on discrete manifolds to embed information on the geometry of the workpiece extracted from triangular meshes, and extend the Dynamic Movement Primitives (DMPs) framework to generate motions on the mesh surfaces. We also propose an effective strategy to adapt the motion to different surfaces, by introducing an isometric transformation of the learned forcing term. The resulting approach, namely MeshDMP, is evaluated both in simulation and real experiments, showing promising results in typical industrial automation tasks like car surface polishing.

Abstract:
Current LiDAR-based Vehicle-to-Everything (V2X) multi-agent perception systems have shown the significant success on 3D object detection. While these models perform well in the trained clean weather, they struggle in unseen adverse weather conditions with the domain gap. In this paper, we propose a Domain Generalization based approach, named V2X-DGW, for LiDAR-based 3D object detection on multi-agent perception system under adverse weather conditions. Our research aims to not only maintain favorable multi-agent performance in the clean weather but also promote the performance in the unseen adverse weather conditions by learning only on the clean weather data. To realize the Domain Generalization, we first introduce the Adaptive Weather Augmentation (AWA) to mimic the unseen adverse weather conditions, and then propose two alignments for generalizable representation learning: Trust-region Weatherinvariant Alignment (TWA) and Agent-aware Contrastive Alignment (ACA). To evaluate this research, we add Fog, Rain, Snow conditions on two publicized multi-agent datasets based on physics-based models, resulting in two new datasets: OPV2V-w and V2XSet-w. Extensive experiments demonstrate that our V2X-DGW achieved significant improvements in the unseen adverse weathers. The code is available at https://github.com/Baolu1998/V2X-DGW.

Abstract:
In this paper, we propose a non-parametric method for estimating the posterior distribution of global positioning satellite systems (GNSS) integer ambiguity. It is difficult to estimate the posterior probability of discrete integer ambiguities directly from carrier phase observations due to the unclear domain definition. We thus introduce a positional likelihood field that accumulates the ambiguity function method values in the position space and then estimate the integer ambiguity distributions by marginalizing the likelihood over the entire position. Defining the positional likelihood field in the position space facilitates carrier phase likelihood accumulation. To correctly estimate the posterior distribution, however, a sufficient density of samples is required, which results in a large computational cost. The proposed method enables large-scale sampling by taking advantage of GPU parallel processing. Experimental results demonstrate that the proposed method enables accurate and robust estimation of integer ambiguity distributions, contributing to improved centimeter-level position estimation accuracy. In addition, the histograms provide quantitative evidence of events in urban environments where integer ambiguity is not uniquely determined.

Abstract:
Realistic crowd simulations are essential for immersive virtual environments, relying on both individual behaviors (microscopic dynamics) and overall crowd patterns (macroscopic characteristics). While recent data-driven methods like deep reinforcement learning improve microscopic realism, they often overlook critical macroscopic features such as crowd density and flow, which are governed by spatio-temporal spawn dynamics, namely, when and where agents enter a scene. Traditional methods, like random spawn rates, stochastic processes, or fixed schedules, are not guaranteed to capture the underlying complexity or lack diversity and realism. To address this issue, we propose a novel approach called nTPP-GMM that models spatio-temporal spawn dynamics using Neural Temporal Point Processes (nTPPs) that are coupled with a spawn-conditional Gaussian Mixture Model (GMM) for agent spawn and goal positions. We evaluate our approach by orchestrating crowd simulations of three diverse real-world datasets with nTPP-GMM. Our experiments demonstrate the orchestration with nTPP-GMM leads to realistic simulations that reflect real-world crowd scenarios and allow crowd analysis.

Abstract:
Visual-inertial odometry (VIO), which fuses noisy inertial readings and camera measurements to provide 3D motion tracking, is a foundational component in many autonomous applications. With the increasing use of next-generation edge devices (e.g., AR/VR devices, nano drones, and mobile robotics) that are constrained by limited power, resources, and multitasking demands, balancing computational efficiency and accuracy in VIO estimators has become more critical than ever. Historically, state estimation algorithms have been developed using either optimization or filtering-based methods, with the key distinction being the ability to relinearize measurements and correct state estimates iteratively. It has been widely claimed that iterative methods improve accuracy by allowing for the reduction of error through relinearization at a higher computational demand. Conversely, filtering methods are more efficient but may suffer from significant linearization errors. However, these trade-offs have not been thoroughly examined in the context of visual-inertial motion tracking. In this paper, we conduct the first comprehensive study on the impact of iterative algorithms in sliding-window VIO. We analyze the relinearization of IMU and camera measurements separately, providing insights into how each affects system performance. By considering key factors such as system observability and measurement processes, we offer a deeper understanding of VIO estimator behavior. Our findings, backed by real-world tests, offer practical guidelines for balancing accuracy and efficiency, helping practitioners determine when to prioritize iterative methods or simpler filtering approaches while encouraging researchers and engineers to rethink VIO design for optimal resource allocation.

Abstract:
Recent Vision-based Large Language Models (VisionLLMs) for autonomous driving have seen rapid advancements. However, such promotion is extremely dependent on large-scale high-quality annotated data, which is costly and labor-intensive. To address this issue, we propose unlocking the value of abundant yet unlabeled data to improve the language-driving model in a semi-supervised learning manner. Specifically, we first introduce a series of template-based prompts to extract scene information, generating questions that create pseudo-answers for the unlabeled data based on a model trained with limited labeled data. Next, we propose a Self-Consistency Refinement method to improve the quality of these pseudo-annotations, which are later used for further training. By utilizing a pre-trained VisionLLM (e.g., InternVL), we build a strong Language Driving Model (LDM) for driving scene question-answering, outperforming previous state-of-theart methods. Extensive experiments on the DriveLM benchmark show that our approach performs well with just 5% labeled data, achieving competitive performance against models trained with full datasets. In particular, our LDM achieves 44.85% performance with limited labeled data, increasing to 54.27 % when using unlabeled data, while models trained with full datasets reach 60.68% on the DriveLM benchmark.

Abstract:
Connected autonomous vehicles (CAVs) can provide benefits over individual vehicles for precise navigation, especially in GNSS-denied environments. CAV collaboration can enhance estimation accuracy, but the safety of collaborative localization in the presence of undetected sensor faults remains underexplored. This paper introduces an integrity monitoring method for CAV collaborative localization in both centralized and decentralized implementations. Fault models for landmark and relative measurements are described, and the probability of hazardous misleading information, or integrity risk, is derived. Simulation and experimental results for notional two-CAV scenarios indicate that collaborative localization reduces integrity risk and enhances navigation safety.

Abstract:
This article presents a novel, ultralight tree planting mechanism for use on an aerial vehicle. Current tree planting operations are typically performed manually, and existing automated solutions use large land-based vehicles or excavators which cause significant site damage and are limited to open, clear-cut plots. Our device uses a high-pressure compressed air power system and a novel double-telescoping design to achieve a weight of only 8 kg: well within the payload capacity of medium to large drones. This article describes the functionality and key components of the device and validates its feasibility through experimental testing. We propose this mechanism as a cost-effective, highly scalable solution that avoids ground damage, produces minimal emissions, and can operate equally well on open clear-cut sites as in denser, selectively-harvested forests.

Abstract:
We propose a multi-sensor fusion pipeline for multiple object tracking in autonomous surface vessels using lidar and camera data. Our approach follows the tracking-by-detection paradigm, leveraging the precision of lidar for accurate state estimation and camera data for robust association. The method addresses issues with false tracks from lidar returns by suppressing non-moving objects on the basis of optical flow. We compare the proposed pipeline against prior work, particularly in the use of lidar and stereo cameras as depth modalities, demonstrating its effectiveness in improving tracking performance.

Abstract:
While most people associate LiDAR primarily with its ability to measure distances and provide geometric information about the environment (via point clouds), LiDAR also captures additional data, including reflectivity or intensity values. Unfortunately, when LiDAR is applied to Place Recognition (PR) in mobile robotics, most previous works on LiDAR-based PR rely only on geometric measurements, neglecting the additional reflectivity information that LiDAR provides. In this paper, we propose a novel descriptor for 3D PR, named RE-TRIP (REflectivity-instance augmented TRIangle descriPtor). This new descriptor leverages both geometric measurements and reflectivity to enhance robustness in challenging scenarios such as geometric degeneracy, high geometric similarity, and the presence of dynamic objects. To implement RE-TRIP in real-world applications, we further propose (1) keypoint extraction method, (2) key instance segmentation method, (3) RE-TRIP matching method, and (4) reflectivity combined loop verification method. Finally, we conduct a series of experiments to demonstrate the effectiveness of RE-TRIP. Applied to public datasets (i.e., HELIPR, FusionPortable) containing diverse scenarios-including long corridors, bridges, large-scale urban areas, and highly dynamic environments-our experimental results show that the proposed method outperforms existing state-of-the-art methods in terms of Scan Context, Intensity Scan Context and STD. Our code is available at: https://github.com/pycS714IRE-TRIP.

Abstract:
External ocean disturbances (EODs) and internal thruster loss-of-effectiveness faults (ITLEFs) are key factors influencing the accuracy of the autonomous tugboat's path following, as well as the stability and safety of the tugboat's hull during maritime operations. To achieve robust path following for the autonomous tugboat, this paper proposes an optimal fault-tolerant control scheme. Firstly, we formulate the robust path following of the tugboat as an optimal fault-tolerance control problem. A matrixed error system for the control scheme is constructed to uniformly consider both EODs and ITLEFs. Secondly, considering the time and economic costs associated with algorithm deployment and tuning process on tugboats in real world, we present an adaptive dynamic programming algorithm to solve the proposed optimal fault-tolerance problem, which is characterized by ease of tuning. Then, the stability of the control system is proven based on the Lyapunov criterion. Finally, the proposed control scheme is evaluated under practical conditions with EODs and ITLEFs. The comparative results with backstepping-based control scheme demonstrate that the proposed control scheme exhibits more robustness for path following under EODs and ITLEFs.

Abstract:
Common approaches to providing feedback in reinforcement learning are the use of hand-crafted rewards or full-trajectory expert demonstrations. Alternatively, one can use examples of completed tasks, but such an approach can be extremely sample inefficient. We introduce value-penalized auxiliary control from examples (VPACE), an algorithm that significantly improves exploration in example-based control by adding examples of simple auxiliary tasks and an above-success-level value penalty. Across both simulated and real robotic environments, we show that our approach substantially improves learning efficiency for challenging tasks, while maintaining bounded value estimates. Preliminary results also suggest that VPACE may learn more efficiently than the more common approaches of using full trajectories or true sparse rewards. Project site: https://papers.starslab.ca/vpace/.

Abstract:
Autonomous navigation in unstructured outdoor environments is inherently challenging due to the presence of asymmetric traversal costs, such as varying energy expenditures for uphill versus downhill movement. Traditional reinforcement learning methods often assume symmetric costs, which can lead to suboptimal navigation paths and increased safety risks in realworld scenarios. In this paper, we introduce QuasiNav, a novel reinforcement learning framework that integrates quasimetric embeddings to explicitly model asymmetric costs and guide efficient, safe navigation. QuasiNav formulates the navigation problem as a constrained Markov decision process (CMDP) and employs quasimetric embeddings to capture directionally dependent costs, allowing for a more accurate representation of the terrain. We combine this approach with adaptive constraint tightening. This ensures that safety constraints are dynamically enforced during learning. We validate QuasiNav on a Clearpath Jackal robot in three challenging navigation scenarios-undulating terrains, asymmetric hill traversal, and directionally dependent terrain traversal-demonstrating its effectiveness in both simulated and real-world environments. Experimental results show that QuasiNav significantly outperforms conventional methods, achieving higher success rates, improved energy efficiency (13.6 % reduction in energy consumption compared to baseline methods), and better adherence to safety constraints.

Abstract:
Visual navigation in robotics traditionally relies on globally-consistent 3D maps or learned controllers, which can be computationally expensive and difficult to generalize across diverse environments. In this work, we present a novel RGB-only, object-level topometric navigation pipeline that enables zero-shot, long-horizon robot navigation without requiring 3D maps or pre-trained controllers. Our approach integrates global topological path planning with local metric trajectory control, allowing the robot to navigate towards object-level sub-goals while avoiding obstacles. We address key limitations of previous methods by continuously predicting local trajectory using monocular depth and traversability estimation, and in-corporating an auto-switching mechanism that falls back to a baseline controller when necessary. The system operates using foundational models, ensuring open-set applicability without the need for domain-specific fine-tuning. We demonstrate the effectiveness of our method in both simulated environments and real-world tests, highlighting its robustness and deployability. Our approach outperforms existing state-of-the-art methods, offering a more adaptable and effective solution for visual navigation in open-set environments. The source code is made publicly available: https://github.com/podgorki/TANGO.

Abstract:
Multi-session visual SLAM systems enable 6-DoF camera localization along with long-term maintenance and expansion of the global map, by utilizing image data from different sessions. However, in large-scale environments, these systems often suffer from severe scale drift. While modern SLAM systems attempt to maintain global map consistency through loop detection and correction, they still face challenges in terms of convergence and accuracy. In this paper, we propose a robust large-scale multi-session SLAM system for long-term localization and mapping that achieves global consistency. Furthermore, to address the backend optimization problem in large-scale environments, we introduce a hierarchical optimization strategy based on the graph structure. More specifically, a subgraph structure is introduced to reduce the size of problem while effectively propagating scale correction information. In addition, a hierarchical strategy enables coarse-to-fine updates of the graph states. Experimental results not only demonstrate that our method efficiently optimizes the pose graph and maintains map consistency in large-scale environments, but also highlight the effectiveness and scalability of the proposed approach.

Abstract:
In recent years, Neural Radiance Fields (NeRF) has revolutionized three-dimensional (3D) reconstruction with its implicit representation. Building upon NeRF, 3D Gaussian Splatting (3D-GS) has departed from the implicit representation of neural networks and instead directly represents scenes as point clouds with Gaussian-shaped distributions. While this shift has notably elevated the rendering quality and speed of radiance fields but inevitably led to a significant increase in memory usage. Additionally, effectively rendering dynamic scenes in 3D-GS has emerged as a pressing challenge. To address these concerns, this paper proposes a refined 3D Gaussian representation for high-quality dynamic scene reconstruction. Firstly, we use a deformable multi-layer perceptron (MLP) network to capture the dynamic offset of Gaussian points and express the color features of points through hash encoding and a tiny MLP to reduce storage requirements. Subsequently, we introduce a learnable denoising mask coupled with denoising loss to eliminate noise points from the scene, thereby further compressing 3D Gaussian model. Finally, motion noise of points is mitigated through static constraints and motion consistency constraints. Experimental results demonstrate that our method surpasses existing approaches in rendering quality and speed, while significantly reducing the memory usage associated with 3D-GS, making it highly suitable for various tasks such as novel view synthesis, and dynamic mapping.

Abstract:
As 4D extensions of 3D Gaussian Splatting (4D-GS) emerge as groundbreaking techniques for dynamic scene reconstruction and novel view synthesis in robotics and computer vision, ensuring the security and trustworthiness of these assets becomes crucial. While steganography has advanced significantly in 2D and 3D media, existing methods are inadequate for the complex, dynamic nature of 4D-GS representations. To address this gap, we propose Hide-in-Motion, a novel 4D steganography method for hiding information through deformation in Gaussian splatting. Our approach introduces a composite attribute and a Decouple Feature Field for coarse-to-fine deformation modeling and embedding implicit information, along with an Opacity-Guided Adaptive strategy. Hide-in-Motion overcomes the limitations of previous techniques, enhancing both the robustness of embedded information and the quality of 4D reconstruction. Extensive evaluations demonstrate that our method successfully embeds and recovers implicit information across various modalities while maintaining high rendering quality in dynamic scenes. This work not only advances copyright protection and secure data transmission for 4D assets but also paves the way for enhancing the security and integrity of 4D digital assets. Code is available at https://github.com/CUHK-AIM-Group/Hide-in-Motion.

Abstract:
3D point cloud object detection plays an important role in autonomous driving. However, labeling 3D object boxes is expensive and time-consuming, limiting the number of annotated point clouds used in fully-supervised training. This has led to a rise in semi-supervised 3D object detection research, which aims to improve model performance by leveraging both labeled and unlabeled point clouds. Existing methods typically rely on the Mean Teacher (MT) paradigm, which uses unlabeled instances discovered by the teacher with confidence scores higher than certain thresholds to train the student. However, this leads to a loss of information as it overlooks ambiguous instances from the teacher that could also contain valuable knowledge. To address this issue, we propose a Bi-Stream Knowledge Transfer (BiKT) framework that fully exploits and transfers knowledge from both confident and ambiguous instances to the student network. Specifically, all pseudo labels are allocated into two knowledge streams, the deterministic stream and the noisy stream, and then subsequently guide the student network through bi-level supervision. We also introduce a Dynamic Stream Switching (DSS) algorithm that sets the stream boundary tailored for the current learning status. To further improve the quality of pseudo labels in the knowledge streams, we propose a Diffusive Label Denoising (DLD) module, which is trained by explicitly generating noised instances and then learning to denoise them, as in diffusion models. Experiments show the state-of-the-art performance of our BiKT on the ONCE validation and testing sets, as well as the robust generalization capability when confronted with diverse base detectors, increased amount of unlabeled data, and distinct datasets (e.g., Waymo), unveiling the power of semi-supervised learning in 3D object detection.

Abstract:
Automating surgical systems enhances precision and safety while reducing human involvement in high-risk environments. A major challenge in automating surgical procedures like suturing is accurately modeling the suture thread, a highly flexible and compliant component. Existing models either lack the accuracy needed for safety-critical procedures or are too computationally intensive for real-time execution. In this work, we introduce a novel approach for modeling suture thread dynamics using control barrier functions (CBFs), achieving both realism and computational efficiency. Thread-like behavior, collision avoidance, stiffness, and damping are all modeled within a unified CBF and control Lyapunov function (CLFs) framework. Our approach eliminates the need to calculate complex forces or solve differential equations, significantly reducing computational overhead while maintaining a realistic model suitable for both automation and virtual reality surgical training systems. The framework also allows visual cues to be provided based on the thread's interaction with the environment, enhancing user experience when performing suture or ligation tasks. The proposed model is tested on the MagnetoSuture system, a minimally invasive robotic surgical platform that uses magnetic fields to manipulate suture needles, offering a less invasive solution for surgical procedures.

Abstract:
Colorectal cancer is the third most commonly diagnosed cancer and the second leading cause of cancer-related deaths in the United States. Despite advancements in screening and treatment, there remains a critical need for more effective and minimally invasive methods to manage complex polyps and early-stage colorectal cancers. This study introduces a novel approach to magnetic tissue manipulation for Endoscopic Submucosal Dissection (ESD), leveraging visual feedback to enhance precision and control. We develop and evaluate the proposed system within a ROS Gazebo simulation environment, integrating a small magnetic endoscopic clip affixed to tissue, which is manipulated by an external large magnet mounted on a robotic arm. A key challenge in ESD is achieving adequate tissue exposure for precise cutting, particularly in the confined space of the colon where the endoscope is manually controlled. To address this, our system enables controlled manipulation of the magnetic clip to optimize tissue retraction. The robotic arm, guided by real-time visual feedback, dynamically adjusts the internal clip's orientation. Multiple virtual cameras were used to validate the proposed method. The simulation results demonstrated that the robot arm successfully manipulated the internal magnetic clip to the desired tilt angle within an average of 8.4 seconds (range 5.3 to \mathbf1 5. 2 ~ s). Our findings suggest that robotic-assisted magnetic tissue manipulation has the potential to improve ESD success rates while reducing procedure time, paving the way for further advancements in minimally invasive endoscopic surgery.

Abstract:
Maintaining Situational Awareness (SA) is critical in space exploration contexts, yet made particularly difficult due to the presence of communication latency. In order to increase human SA without inducing cognitive overload, researchers have proposed Performative Autonomy (PA), in which robots intentionally interact at a lower level of autonomy than they are capable of. While researchers have demonstrated positive impacts of PA on team performance even under high latency, previous work on PA has not examined how the benefits of PA might be mediated by latency. In this work, we thus evaluate the impact of latency and PA on trust, SA, and human perceptions of robot intelligence and autonomy. Our results suggest that lower performed autonomy leads to increased cognitive load, especially when robot communication happens frequently and latency is present. In addition, we observe no effect of the PA strategies used within our experimental paradigm on SA, and instead find evidence that operating under high latency leads to negative perceptions of robots regardless of choice of PA strategy.

Abstract:
In physical human-robot interaction, force feedback has been the most common sensing modality to convey the human intention to the robot. It is widely used in admittance control to allow the human to direct the robot. However, it cannot be used in scenarios where direct force feedback is not available since manipulated objects are not always equipped with a force sensor. In this work, we study one such scenario: the collaborative pushing and pulling of heavy objects on frictional surfaces, a prevalent task in industrial settings. When humans do it, they communicate through verbal and non-verbal cues, where body poses, and movements often convey more than words. We propose a novel context-aware approach using Directed Graph Neural Networks to analyze spatiotemporal human posture data to predict human motion intention for non-verbal collaborative physical manipulation. Our experiments demonstrate that robot assistance significantly reduces human effort and improves task efficiency. The results indicate that incorporating posture-based context recognition, either together with or as an alternative to force sensing, enhances robot decision-making and control efficiency.

Abstract:
Ultra-wideband (UWB) is gaining popularity with devices like AirTags for precise home item localization but faces significant challenges when scaled to large environments like seaports. The main challenges are calibration and localization under obstructed conditions, which are common in logistics environments. Traditional calibration methods, dependent on line-of-sight (LoS), are slow, costly, and unreliable in seaports and warehouses, making large-scale localization a significant pain point in the industry. To overcome these challenges, we propose a one-shot calibration and localization framework based on UWB-LiDAR fusion. Our method uses Gaussian processes to estimate the anchor position from continuous-time LiDAR Inertial Odometry with sampled UWB ranges. This approach ensures accurate and reliable calibration with only one round of sampling in large-scale areas, i.e., 600 × 450 ~\mathrmm^2. With LoS issues, UWB-only localization can be problematic, even when anchor positions are known. We demonstrate that by applying a UWB-range filter, the search range for LiDAR loop closure descriptors is significantly reduced, improving both accuracy and speed. This concept can be applied to other loop closure detection methods, enabling cost-effective localization in large-scale warehouses and seaports. It significantly improves precision in challenging environments where the UWB-only and LiDAR-Inertial methods fail, as shown in the video https://https://youtu.be/oY8jQKdM7lU. We will open-source our datasets and calibration codes for community use.

Abstract:
Skid-steered wheel mobile robots (SSWMRs) operate in a variety of outdoor environments exhibiting motion behaviors dominated by the effects of complex wheel-ground interactions. Characterizing these interactions is crucial from both the immediate robot autonomy perspective (for motion prediction and control) and a long-term predictive maintenance and diagnostics perspective. An ideal solution entails capturing precise state measurements for decisions and controls, which is considerably difficult, especially in increasingly unstructured outdoor regimes of operations for these robots. In this milieu, a framework to identify pre-determined discrete modes of operation can considerably simplify the motion model identification process. To this end, we propose an interactive multiple model (IMM) based filtering framework to probabilistically identify predefined robot operation modes that could arise due to traversal in different terrains or loss of wheel traction.

Abstract:
Recent advancements in Visual Language Models (VLMs) have significantly driven research in open-vocabulary 3D scene reconstruction, showcasing strong potential in open-set retrieval and semantic understanding. However, existing approaches face challenges in open-world environments: they either suffer from insufficient precision in semantic segmentation, leading to inadequate fine-grained scene understanding, or they are limited to object-level reconstruction, failing to capture intricate object details and lack applicability in open-world settings. To address these issues, we introduce LE-Object, an object-centric Neural Implicit Radiance Field (NeRF) method for open-world scenarios to achieve fine-grained scene understanding and high-fidelity object reconstruction. LE-Object integrates spatial features (SF) from object point clouds with visual features (VF) from VLMs to perform object association, ensuring spatiotemporal consistency in object mask segmentation, and extends VLM features from 2D images into 3D space, enabling precise open-world semantic inference and detailed object reconstruction. Experimental results demonstrate that LE-Object excels in zero-shot semantic segmentation and open-world object reconstruction, offering innovative solutions for global navigation and local object manipulation in open-world applications.

Abstract:
Transparent object manipulation remains a significant challenge in robotics due to the difficulty of acquiring accurate and dense depth measurements. Conventional depth sensors often fail with transparent objects, resulting in incomplete or erroneous depth data. Existing depth completion methods struggle with interframe consistency and incorrectly model transparent objects as Lambertian surfaces, leading to poor depth reconstruction. To address these challenges, we propose TranSplat, a surface embedding-guided 3D Gaussian Splatting method tailored for transparent objects. TranSplat uses a latent diffusion model to generate surface embeddings that provide consistent and continuous representations, making it robust to changes in viewpoint and lighting. By integrating these surface embeddings with input RGB images, TranSplat effectively captures the complexities of transparent surfaces, enhancing the splatting of 3D Gaussians and improving depth completion. Evaluations on synthetic and real-world transparent object benchmarks, as well as robot grasping tasks, show that TranSplat achieves accurate and dense depth completion, demonstrating its effectiveness in practical applications. We open-source synthetic dataset and model: https://github.com/jeongyun0609/TranSplat

Abstract:
In this paper, we present a novel algorithm for probabilistically updating and rasterizing semantic maps within 3D Gaussian Splatting (3D-GS). Although previous methods have introduced algorithms which learn to rasterize features in 3D-GS for enhanced scene understanding, 3D-GS can fail without warning which presents a challenge for safety-critical robotic applications. To address this gap, we propose a method which advances the literature of continuous semantic mapping from voxels to ellipsoids, combining the precise structure of 3D-GS with the ability to quantify uncertainty of probabilistic robotic maps. Given a set of images, our algorithm performs a probabilistic semantic update directly on the 3D ellipsoids to obtain an expectation and variance through the use of conjugate priors. We also propose a probabilistic rasterization which returns per-pixel segmentation predictions with quantifiable uncertainty. We compare our method with similar probabilistic voxel-based methods to verify our extension to 3D ellipsoids, and perform ablation studies on uncertainty quantification and temporal smoothing.

Abstract:
Training robots directly from human videos is an emerging area in robotics and computer vision. While there has been notable progress with two-fingered grippers, learning autonomous tasks without teleoperation remains a difficult problem for multi-fingered robot hands. A key reason for this difficulty is that a policy trained on human hands may not directly transfer to a robot hand with a different morphology. In this work, we present HUDOR, a technique that enables online fine-tuning of the policy by constructing a reward function from the human video. Importantly, this reward function is built using object-oriented rewards derived from off-the-shelf point trackers, which allows for meaningful learning signals even when the robot hand is in the visual observation, while the human hand is used to construct the reward. Given a single video of human solving a task, such as gently opening a music box, HUDOR allows our four-fingered Allegro hand to learn this task with just an hour of online interaction. Our experiments across four tasks, show that HUDOR outperforms alternatives with an average of 4 × improvement. Code and videos are available on our website https://object-rewards.github.io/.

Abstract:
Grasp planning and estimation have been a longstanding research problem in robotics, with two main approaches to find graspable poses on the objects: 1) geometric approach, which relies on 3D models of objects and the gripper to estimate valid grasp poses, and 2) data-driven, learning-based approach, with models trained to identify grasp poses from raw sensor observations. The latter assumes comprehensive geometric coverage during the training phase. However, the data-driven approach is typically biased toward tabletop scenarios and struggle to generalize to out-of-distribution scenarios with larger objects (e.g. chair). Additionally, raw sensor data (e.g. RGB-D data) from a single view of these larger objects is often incomplete and necessitates additional observations. In this paper, we take a geometric approach, leveraging advancements in object modeling (e.g. NeRF) to build an implicit model by taking RGB images from views around the target object. This model enables the extraction of explicit mesh model while also capturing the visual appearance from novel viewpoints that is useful for perception tasks like object detection and pose estimation. We further decompose the NeRFreconstructed 3D mesh into superquadrics (SQs) - parametric geometric primitives, each mapped to a set of precomputed grasp poses, allowing grasp composition on the target object based on these primitives. Our proposed pipeline overcomes the problems: a) noisy depth and incomplete view of the object, with a modeling step, and b) generalization to objects of any size. For more qualitative results, refer to the supplementary video and webpage https://rpm-lab-umn.github.io/superq-grasp-webpage/.

Abstract:
Endoscopic Submucosal Dissection (ESD) is an effective minimally invasive approach to removing colon cancer, yet it is underutilized, since it is challenging to learn and perform. To promote the adoption of ESD by making it easier, we propose a system in which two small, flexible robotic manipulators are delivered through a colonoscope. Our system differs from prior robotic systems aimed at this application in that our manipulators are small enough to fit through a clinically used colonoscope. By not re-engineering the colonoscope, we maintain overall system diameter at the current clinical gold standard, and streamline the path to eventual clinical deployment. Our concentric push-pull robot (CPPR) manipulators offer dexterity and simultaneously provide a conduit for grasper or cutting tool deployment. Each manipulator in our system consists of two push-pull tube pairs, and we describe how they are actuated. We describe for the first time our approach to compensating for undesirable CPPR tip motion induced by differences in the tubes' transmission stiffness. We also evaluate the workspace of the manipulators and demonstrate teleoperation in a point-touching experiment. Lastly, we demonstrate the ability of the system to resect tissue via ex vivo animal experiments.

Abstract:
Recent advancements in autonomous driving have seen a paradigm shift towards end-to-end learning paradigms, which map sensory inputs directly to driving actions, thereby enhancing the robustness and adaptability of autonomous vehicles. However, these models often sacrifice interpretability, posing significant challenges to trust, safety, and regulatory compliance. To address these issues, we introduce DRIVE – Dependable Robust Interpretable Visionary Ensemble Framework in Autonomous Driving, a comprehensive framework designed to improve the dependability and stability of explanations in end-to-end unsupervised autonomous driving models. Our work specifically targets the inherent instability problems observed in the Driving through the Concept Gridlock (DCG) model, which undermine the trustworthiness of its explanations and decisionmaking processes. We define four key attributes of DRIVE: consistent interpretability, stable interpretability, consistent output, and stable output. These attributes collectively ensure that explanations remain reliable and robust across different scenarios and perturbations. Through extensive empirical evaluations, we demonstrate the effectiveness of our framework in enhancing the stability and dependability of explanations, thereby addressing the limitations of current models. Our contributions include an in-depth analysis of the dependability issues within the DCG model, a rigorous definition of DRIVE with its fundamental properties, a framework to implement DRIVE, and novel metrics for evaluating the dependability of concept-based explainable autonomous driving models. These advancements lay the groundwork for the development of more reliable and trusted autonomous driving systems, paving the way for their broader acceptance and deployment in real-world applications. “We can only see a short distance ahead, but we can see plenty there that needs to be done.” – Alan Turing

Abstract:
3D single object tracking is essential in autonomous driving and robotics. Existing methods often struggle with sparse and incomplete point cloud scenarios. To address these limitations, we propose a Multimodal-guided Virtual Cues Projection (MVCP) scheme that generates virtual cues to enrich sparse point clouds. Additionally, we introduce an enhanced tracker MVCTrack based on the generated virtual cues. Specifically, the MVCP scheme seamlessly integrates RGB sensors into LiDAR-based systems, leveraging a set of 2D detections to create dense 3D virtual cues that significantly improve the sparsity of point clouds. These virtual cues can naturally integrate with existing LiDAR-based 3D trackers, yielding substantial performance gains. Extensive experiments demonstrate that our method achieves competitive performance on the NuScenes dataset. Code is available at code and video.

Affiliations: School of Software Engineering, Xi'an Jiaotong University, Xi'an, China; College of Computer Science and Technology, Jilin University, Changchun, China; BNRist, THUIBCS, KLISS, BLBCI, School of Software, Tsinghua University, Beijing, China; National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, China

Abstract:
Low-light image enhancement aims to restore the under-exposure image captured in dark scenarios. Under such scenarios, traditional frame-based cameras may fail to capture the structure and color information due to the exposure time limitation. Event cameras are bio-inspired vision sensors that respond to pixel-wise brightness changes asynchronously. Event cameras' high dynamic range is pivotal for visual perception in extreme low-light scenarios, surpassing traditional cameras and enabling applications in challenging dark environments. In this paper, inspired by the success of the retinex theory for traditional frame-based low-light image restoration, we introduce the first methods that combine the retinex theory with event cameras and propose a novel retinex-based lowlight image restoration framework named ERetinex. Among our contributions, the first is developing a new approach that leverages the high temporal resolution data from event cameras with traditional image information to estimate scene illumination accurately. This method outperforms traditional image-only techniques, especially in low-light environments, by providing more precise lighting information. Additionally, we propose an effective fusion strategy that combines the high dynamic range data from event cameras with the color information of traditional images to enhance image quality. Through this fusion, we can generate clearer and more detailrich images, maintaining the integrity of visual information even under extreme lighting conditions. The experimental results indicate that our proposed method outperforms state-of-theart (SOTA) methods, achieving a gain of 1.0613 dB in PSNR while reducing FLOPS by 84.28 %. The code is available at https://github.com/lodew920/ERetinex.

Abstract:
People use tools to interact with and perceive the world, with multimodal sensory inputs forming the basis of how we understand our environment. For example, a blind person uses a walking cane to tap the road and detect obstacles, and a builder uses a hammer to strike a wall to assess its structural integrity. Using tools extends our sensory capabilities during exploratory behaviors, enabling us to perceive object properties that are otherwise inaccessible. Inspired by this cognitive process, we propose a framework in which a multisensory robot employs exploratory behaviors using various tools to recognize granular substances. Our framework effectively integrates multiple non-visual sensory inputs (e.g., audio, haptic, and tactile) gathered through multiple tools (e.g., spoon, fork) and behaviors (e.g., stirring, poking) to perceive object properties. The framework segments interactions into time windows and aligns different modalities, enhancing data efficiency and interactive perception. Additionally, we conducted tool-transfer experiments to evaluate similarities between tools. Our experiments demonstrate that combining multiple tools and behaviors outperforms single-tool and singlebehavior approaches. While the audio modality dominates the non-visual multimodal system, other modalities contribute. We further demonstrate that tool similarities vary depending on the behavior, and notably, the robot does not need to complete entire interactions to achieve optimal recognition accuracy.

Abstract:
Traffic sign is a critical map feature for navigation and traffic control. Nevertheless, current methods for traffic sign recognition rely on traditional deep learning models, which typically suffer from significant performance degradation considering the variations in data distribution across different regions. In this paper, we propose TSCLIP, a robust fine-tuning approach with the contrastive language-image pre-training (CLIP) model for worldwide cross-regional traffic sign recognition. We first curate a cross-regional traffic sign benchmark dataset by combining data from ten different sources. Then, we propose a prompt engineering scheme tailored to the characteristics of traffic signs, which involves specific scene descriptions and corresponding rules to generate targeted text descriptions. During the TSCLIP fine-tuning process, we implement adaptive dynamic weight ensembling (ADWE) to seamlessly incorporate outcomes from each training iteration with the zero-shot CLIP model. This approach ensures that the model retains its ability to generalize while acquiring new knowledge about traffic signs. To the best knowledge of authors, TSCLIP is the first contrastive language-image model used for the worldwide cross-regional traffic sign recognition task. The project website is available at: https://github.com/guoyangzhao/TSCLIP.

Abstract:
Autonomous navigation in mobile robots has made significant advancements; however, traditional methods often struggle to adapt in real-time to dynamic or unstructured environments. This paper presents the Online Planner's Parameter Adaptation (OPPA) framework, which enhances both adaptability and safety in mobile robot navigation by dynamically adjusting planner parameters. OPPA integrates a rule-based system for estimating tunnel width using 2D LiDAR and path data with a learning-based approach utilizing a shallow transformer model. By incorporating a human-in-the-loop process to refine training data, OPPA improves accuracy and reliability in complex environments. Designed for real-time efficiency on resource-constrained platforms, OPPA has been validated through simulation and real-world experiments, demonstrating its ability to enhance both safety and performance. These results highlight OPPA as a viable solution for dynamic and complex robotic applications.

Abstract:
In crowd navigation, the local goal plays a crucial role in trajectory initialization, optimization, and evaluation. Recognizing that when the global goal is distant, the robot's primary objective is avoiding collisions, making it less critical to pass through the exact local goal point, this work introduces the concept of goal lines, which extend the traditional local goal from a single point to multiple candidate lines. Coupled with a topological map construction strategy that groups obstacles to be as convex as possible, a goal-adaptive navigation framework is proposed to efficiently plan multiple candidate trajectories. Simulations and experiments demonstrate that the proposed GA-TEB framework effectively prevents deadlock situations, where the robot becomes frozen due to a lack of feasible trajectories in crowded environments. Additionally, the framework greatly increases planning frequency in scenarios with numerous non-convex obstacles, enhancing both robustness and safety.

Abstract:
In ophthalmic surgery, surgeons or robots manipulate a light probe and an instrument around two separated trocars following sclerotomy to achieve orbital control for eyeball pose adjustment and subsequent surgical tasks referring to microscope frames. However, current methods face significant challenges in directly extracting the eyeball pose from real-time microscope frames due to the limited microscope perspective and the darkened operating room (OR). This paper decomposes eyeball rotations only along the x and y axes. Then, a method of calculating eyeball poses using eyeball geometry and microscopic trocar positions is presented. This method is tested by simulation and a phantom system with current [2.0, 2.8] degree error, providing assistant intraoperative eyeball status in the dark OR with extended method discussions.

Abstract:
Developing generalizable robot policies that can robustly handle varied environmental conditions and object instances remains a fundamental challenge in robot learning. While considerable efforts have focused on collecting large robot datasets and developing policy architectures to learn from such data, naïvely learning from visual inputs often results in brittle policies that fail to transfer beyond the training data. This work presents Prescriptive Point Priors for Policies or P3-PO, a novel framework that constructs a unique state representation of the environment leveraging recent advances in computer vision and robot learning to achieve improved out-of-distribution generalization for robot manipulation. This representation is obtained through two steps. First, a human annotator prescribes a set of semantically meaningful points on a single demonstration frame. These points are then propagated through the dataset using off-the-shelf vision models. The derived points serve as an input to state-of-the-art policy architectures for policy learning. Our experiments across four real-world tasks demonstrate an overall 43% absolute improvement over prior methods when evaluated in identical settings as training. Further, P3-PO exhibits 58% and 80% gains across tasks for new object instances and more cluttered environments respectively. Videos illustrating the robot's performance are best viewed at point-priors.github.io.

Abstract:
The task of UAV-view geo-localization is to match a query image with database images to estimate the current geographic location of the query image. This is particularly useful in environments where GPS is not available or when the device fails. Although deep learning methods make sufficient progress in UAV-view geo-localization, they still face challenges in improving the distinguishability of features. For instance, some feature aggregation methods do not consider semantic integrity, and robust elements in the image are not given enough attention. This paper proposes a UAV-view geo-localization method (APA-BI) to tackle the above issues. Specifically, we propose an adaptive partition aggregation method to ensure feature integrity at the semantic level by increasing the receptive field of the classifier module. At the same time, we design a bidirectional integration module to further enhance feature distinguishability by extracting robust tubular topological structures from images. Experimental results on public datasets demonstrate that APA-BI achieves impressive retrieval accuracy and outperforms most state-of-the-art methods. Moreover, the test results of APA-BI in real-world scenarios also show excellent performance.

Abstract:
As industries increasingly adopt large robotic fleets, there is a pressing need for computationally efficient, practical, and optimal conflict-free path planning for multiple robots. Conflict-Based Search (CBS) is a popular method for multi-agent path finding (MAPF) due to its completeness and optimality; however, it is often impractical for real-world applications, as it is computationally intensive to solve and relies on assumptions about agents and operating environments that are difficult to realize. This article proposes a solution to overcome computational challenges and practicality issues of CBS by utilizing structural-semantic topometric maps. Instead of running CBS over large grid-based maps, the proposed solution runs CBS over a sparse topometric map containing structural-semantic cells representing intersections, pathways, and dead ends. This approach significantly accelerates the MAPF process and reduces the number of conflict resolutions handled by CBS while operating in continuous time. In the proposed method, robots are assigned time ranges to move between topometric regions, departing from the traditional CBS assumption that a robot can move to any connected cell in a single time step. The approach is validated through real-world multi-robot path-finding experiments and benchmarking simulations. The results demonstrate that the proposed MAPF method can be applied to real-world non-holonomic robots and yields significant improvement in computational efficiency compared to traditional CBS methods while improving conflict detection and resolution in cases of corridor symmetries.

Abstract:
Imagine a future when we can Zoom-call a robot to manage household chores remotely. This work takes one step in this direction. Robi Butler is a new household robot assistant that enables seamless multimodal remote interaction. It allows the human user to monitor its environment from a first-person view, issue voice or text commands, and specify target objects through hand-pointing gestures. At its core, a high-level behavior module, powered by Large Language Models (LLMs), interprets multimodal instructions to generate multistep action plans. Each plan consists of open-vocabulary primitives supported by vision-language models, enabling the robot to process both textual and gestural inputs. Zoom provides a convenient interface to implement remote interactions between the human and the robot. The integration of these components allows Robi Butler to ground remote multimodal instructions in real-world home environments in a zero-shot manner. We evaluated the system on various household tasks, demonstrating its ability to execute complex user commands with multimodal inputs. We also conducted a user study to examine how multimodal interaction influences user experiences in remote human-robot interaction. These results suggest that with the advances in robot foundation models, we are moving closer to the reality of remote household robot assistants.

Abstract:
We present admittance control and fingertip contact detection with a linkage gripper remotely driven by a pneumatic rolling diaphragm actuator. The gripper is driven by underactuated mechanisms sensorized by joint encoders in order to fully determine the gripper state. We present the modelling of the linkage and fluidic transmission, validate its ability to regulate pinch force via admittance control within an RMS error well under 0.5 Newtons, and show the ability to detect contact at targeted locations on the linkage. In addition, we demonstrate simple grasping behaviors: blindly searching for an unobstructed object and detecting object loss. Our results show that an integrative approach of instrumenting underactuated gripper mechanisms can result in a lightweight gripper that is not only mechanically adaptive but sensitive enough to react to contact events without distal sensors or vision.

Abstract:
Recent advances in Behavior Cloning (BC) have made it easy to teach robots new tasks. However, we find that the ease of teaching comes at the cost of unreliable performance that saturates with increasing data for tasks requiring precision. The performance saturation can be attributed to two critical factors: (a) distribution shift resulting from the use of offline data and (b) the lack of closed-loop corrective control caused by action chucking (predicting a set of future actions executed open-loop) critical for BC performance. Our key insight is that by predicting action chunks, BC policies function more like trajectory “planners” than closedloop controllers necessary for reliable execution. To address these challenges, we devise a simple yet effective method, Resip (Residual for Precise Manipulation), that overcomes the reliability problem while retaining BC's ease of teaching and long-horizon capabilities. Resip augments a frozen, chunked BC model with a fully closed-loop residual policy trained with reinforcement learning (RL) that addresses distribution shifts and introduces closed-loop corrections over open-loop execution of action chunks predicted by the BC trajectory planner. Videos, code, and data: residual-assembly.github.io.

Abstract:
Compliance plays a crucial role in manipulation, as it balances between the concurrent control of position and force under uncertainties. Yet compliance is often overlooked by today's visuomotor policies that solely focus on position control. This paper introduces Adaptive Compliance Policy (ACP), a novel framework that learns to dynamically adjust system com-pliance both spatially and temporally for given manipulation tasks from human demonstrations, improving upon previous approaches that rely on pre-selected compliance parameters or assume uniform constant stiffness. However, computing full compliance parameters from human demonstrations is an ill- defined problem. Instead, we estimate an approximate compli-ance profile with two useful properties: avoiding large contact forces and encouraging accurate tracking. Our approach en-ables robots to handle complex contact-rich manipulation tasks and achieves over 50% performance improvement compared to state-of-the-art visuomotor policy methods. Project website with result videos: adaptive-compliance.github.io.

Abstract:
Forklifts are used extensively in various industrial settings and are in high demand for automation. In particular, counterbalance forklifts are highly versatile and are employed in diverse scenarios. However, efforts to automate these processes are lacking, primarily owing to the absence of a safe and performance-verifiable development environment. This study proposes a learning system that combines a photorealistic digital learning environment with a 1 / 14-scale robotic forklift environment to address this challenge. Inspired by the training-based learning approach adopted by forklift operators, we employ an end-to-end vision-based deep reinforcement learning approach. The learning is conducted in a digitalized environment created from CAD data, making it safe and eliminating the need for real-world data. In addition, we safely validate the method in a physical setting using a 1 / 14-scale robotic forklift with a configuration similar to that of a real forklift. We achieved a 60% success rate in pallet loading tasks in real experiments using a robotic forklift. Our approach demonstrates zero-shot sim2real with a simple method that does not require heuristic additions. This learning-based approach is considered a first step towards the automation of counterbalance forklifts.

Abstract:
Due to high dimensionality and non-convexity, real-time optimal control using full-order dynamics models for legged robots is challenging. Therefore, Nonlinear Model Predictive Control (NMPC) approaches are often limited to reduced-order models or local approximations. Sampling-based MPC has shown potential in nonconvex even discontinuous problems, but often yields suboptimal solutions with high variance, which limits its applications in high-dimensional locomotion. This work introduces DIAL-MPC (Diffusion-Inspired Annealing for Legged MPC), a sampling-based MPC framework with a novel diffusion-style annealing process. Such a process is supported by the theoretical landscape analysis of Model Predictive Path Integral Control (MPPI) and the connection between MPPI and single-step diffusion. Algorithmically, DIALMPC iteratively refines solutions online and achieves both global coverage and local convergence. In quadrupedal torquelevel control tasks, DIAL-MPC reduces the tracking error of standard MPPI by 13.4 times and outperforms reinforcement learning (RL) policies by 50 % in challenging climbing tasks without any training. In particular, DIAL-MPC enables precise real-world quadrupedal jumping with payload. To the best of our knowledge, DIAL-MPC is the first training-free method that optimizes over full-order legged dynamics in real-time.

Abstract:
Fire poses significant threats to life and property, necessitating efficient inspection and accurate identification. Although aerial computer vision algorithms hold great promise, the deployment and computational limitations of onboard platforms prevent existing algorithms from meeting high standards of accuracy and real-time performance. To address these challenges, we propose an lightweight aerial fire detection model, LAFNET. This model incorporates the EffiDarknetLight backbone, optimized for both lightweight design and ease of deployment, integrates specially designed LightGhost(LG) block components within the LightGhost-Path Aggregation Network(LG-PAN) neck, resulting in a model Params of only 1.3 M. Experimental results demonstrate that our method attains a good trade-off between lightweight design and detection accuracy. Compared to the smallest standard YOLO series' model YOLOv5n, LAFNET improves MAP by \mathbf2. 1 %, while reducing Params and FLOPs by \mathbf2 7. 8 % and \mathbf2 9. 3 %, the inference speed on Nvidia Orin Nano edge computing side improves 24.8 %. These experiments indicate that LAFNET offers a highly efficient solution for aerial fire detection, combining speed and accuracy.

Affiliations: Maryland Robotics Center, University of Maryland, College Park, MD, USA; Center for Autonomous and Robotic Systems, University of Delaware, Newark, DE, USA; Automation and Control Institute, TU Wien, Vienna, Austria; School of Marine Science and Policy, University of Delaware, Lewes, DE, USA; Center for Environmental Science, Horn Point Laboratory, University of Maryland, Cambridge, MD, USA; Indepedendent Robotics, Montreal, QC, Canada; College of Engineering and Applied Science, University of Cincinnati, OH, USA

Abstract:
Oysters are a vital keystone species in coastal ecosystems, providing significant economic, environmental, and cultural benefits. As the importance of oysters grows, so does the relevance of autonomous systems for their detection and monitoring. However, current monitoring strategies often rely on destructive methods. While manual identification of oysters from video footage is non-destructive, it is time-consuming, requires expert input, and is further complicated by the challenges of the underwater environment. To address these challenges, we propose a novel pipeline using stable diffusion to augment a collected real dataset with photorealistic synthetic data. This method enhances the dataset used to train a YOLOv10-based vision model. The model is then deployed and tested on an edge platform; Aqua2, an Autonomous Underwater Vehicle (AUV), achieving a state-of-the-art 0.657 mAP@50 for oyster detection.

Abstract:
Hamilton-Jacobi (HJ) reachability analysis is a widely adopted verification tool to provide safety and performance guarantees for autonomous systems. However, it involves solving a partial differential equation (PDE) to compute a safety value function, whose computational and memory complexity scales exponentially with the state dimension, making its direct application to large-scale systems intractable. To overcome these challenges, DeepReach, a recently proposed learning-based approach, approximates high-dimensional reachable tubes using neural networks (NNs). While shown to be effective, the accuracy of the learned solution decreases with system complexity. One of the reasons for this degradation is a soft imposition of safety constraints during the learning process, which corresponds to the boundary conditions of the PDE, resulting in inaccurate value functions. In this work, we propose ExactBC, a variant of DeepReach that imposes safety constraints exactly during the learning process by restructuring the overall value function as a weighted sum of the boundary condition and the NN output. Moreover, the proposed variant no longer needs a boundary loss term during the training process, thus eliminating the need to balance different loss terms. We demonstrate the efficacy of the proposed approach in significantly improving the accuracy of the learned value function for four challenging reachability tasks: a rimless wheel system with state resets, collision avoidance in a cluttered environment, autonomous rocket landing, and multi-aircraft collision avoidance.

Abstract:
Picking diverse objects in the real world is a fundamental robotics skill. However, many objects in such settings are bulky, heavy, or irregularly shaped, making them ungraspable by conventional end effectors like suction grippers and parallel jaw grippers (PJGs). In this paper, we expand the range of pickable items without hardware modifications using bimanual nonprehensile manipulation. We focus on a grocery shopping scenario, where a bimanual mobile manipulator equipped with a suction gripper and a PJG is tasked with re-trieving ungraspable items from tightly packed grocery shelves. From visual observations, our method first identifies optimal grasp points based on force closure and friction constraints. If the grasp points are occluded, a series of nonprehensile nudging motions are performed to clear the obstruction. A bimanual grasp utilizing contacts on the side of the end effectors is then executed to grasp the target item. In our replica grocery store, we achieved a 90% success rate over 102 trials in uncluttered scenes, and a 67 % success rate over 45 trials in cluttered scenes. We also deployed our system to a real-world grocery store and successfully picked previously unseen items. Our results highlight the potential of bimanual nonprehensile manipulation for in-the-wild robotic picking tasks. A video summarizing this work can be found at youtu. be/gOhOrDuK8jM.

Abstract:
Achieving a real-time precise grasp of a specified target object in densely cluttered environments is an essential capability for autonomous robot operation. Recently, considerable investigations on planar and spatial grasp have been carried out, and significant results have been obtained. However, these point cloud-based grasp prediction methods often fail to ensure that the generated grasp configurations meet the precise requirements of the task. Additionally, some of the existing grasp pipelines are too time-consuming to meet the demand for real-time robot response. In more challenging cluttered scenes, the quality of pose and gripper jaw opening estimation in highdimensional space requires further improvement. Therefore, this paper introduces a data- and model-independent and efficient method to generate 7-DoF grasp configurations for arbitrary target objects from single-view point cloud data in dense cluttered scenes. In addition, this paper proposes a grasp framework that generates the grasp configuration for the target object while reducing the time consumed during the grasp process, to enable robots to efficiently grasp target objects for designated tasks. The grasp pipeline focuses on guided regions via target detection and rapidly adjusts grasp configurations through multi-region point cloud distribution perception. Extensive real-world robot experiments have demonstrated the effectiveness of the proposed method in grasping target objects in cluttered scenes, achieving higher success rates and reduced runtime compared to baseline methods. The realized code and video are available at https://github.com/L-tj/7DGCG.

Abstract:
Event cameras are bio-inspired sensors with some notable features, including high dynamic range and low latency, which makes them exceptionally suitable for perception in challenging scenarios such as high-speed motion and extreme lighting conditions. In this paper, we explore their potential for localization within pre-existing LiDAR maps, a critical task for applications that require precise navigation and mobile manipulation. Our framework follows a paradigm based on the refinement of an initial pose. Specifically, we first project LiDAR points into 2D space based on a rough initial pose to obtain depth maps, and then employ an optical flow estimation network to align events with LiDAR points in 2D space, followed by camera pose estimation using a PnP solver. To enhance geometric consistency between these two inherently different modalities, we develop a novel frame-based event representation that improves structural clarity. Additionally, given the varying degrees of bias observed in the ground truth poses, we design a module that predicts an auxiliary variable as a regularization term to mitigate the impact of this bias on network convergence. Experimental results on several public datasets demonstrate the effectiveness of our proposed method. To facilitate future research, both the code and the pre-trained models are made available online 11https://github.com/EasonChen99/EVLoc.

Abstract:
Heterogeneous robots can work together to accomplish a variety of complex tasks and have shown great potential in many fields. There are many efforts to make robot task orchestration more efficient. However, current methods still have some limitations, including the lack of a high-level abstraction for programming method and fault handling mechanism. In this paper, we design a state machine-based, fault-tolerant framework for heterogeneous multi-robot collaboration named HeRo, to effectively support the development of heterogeneous multi-robot systems. HeRo has three key techniques: (1) a state machine-based programming language to flexibly model robot behaviors and tasks; (2) a state synchronization mechanism to achieve information exchange and maintain the consistency among heterogeneous robots in distributed environments; (3) a fault detection and recovery mechanism to monitor the system's runtime states and use Large Language Model (LLM) combined with Planning Domain Definition Language (PDDL) to enable automated recovery. We evaluate the effectiveness and fault recovery capability of the framework by setting up manufacturing task and fault scenarios with varying difficulty in the ARIAC simulation environment, achieving a 100% task completion rate, with low system overhead and flexible scalability.

Abstract:
Differentiable simulation has become a powerful tool for system identification. While prior work has focused on identifying robot properties using robot-specific data or object properties using object-specific data, our approach calibrates object properties by using information from the robot, without relying on data from the object itself. Specifically, we utilize robot joint encoder information, which is commonly available in standard robotic systems. Our key observation is that by analyzing the robot's reactions to manipulated objects, we can infer properties of those objects, such as inertia and softness. Leveraging this insight, we develop differentiable simulations of robot-object interactions to inversely identify the properties of the manipulated objects. Our approach relies solely on proprioception – the robot's internal sensing capabilities – and does not require external measurement tools or vision-based tracking systems. This general method is applicable to any articulated robot and requires only joint position information. We demonstrate the effectiveness of our method on a low-cost robotic platform, achieving accurate mass and elastic modulus estimations of manipulated objects with just a few seconds of computation on a laptop.

Abstract:
A critical goal of adaptive control is enabling robots to rapidly adapt in dynamic environments. Recent studies have developed a meta-learning-based adaptive control scheme, which uses meta-learning to extract nonlinear features (represented by Deep Neural Networks (DNNs)) from offline data, and uses adaptive control to update linear coefficients online. However, such a scheme is fundamentally limited by the linear parameterization of uncertainties and does not fully unleash the capability of DNNs. This paper introduces a novel learning-based adaptive control framework that pretrains a DNN for adaptation via self-supervised meta-learning (SSML) from offline trajectories and online adapts the full DNN via composite adaptation. In particular, the offline SSML stage leverages the time consistency in trajectory data to train the DNN to predict future disturbances from history, in a selfsupervised manner without environment condition labels. The online stage carefully designs a control law and an adaptation law to update the full DNN with stability guarantees. Empirically, the proposed framework significantly outperforms (19-39%) various classic and learning-based adaptive control baselines, in challenging real-world quadrotor tracking problems under large dynamic wind disturbance11Experimental videos are in project website: https://sites.google.com/view/ssml-ac-project

Abstract:
First-order Policy Gradient (FoPG) algorithms such as Backpropagation through Time and Analytical Policy Gradients leverage local simulation physics to accelerate policy search, significantly improving sample efficiency in robot control compared to standard model-free reinforcement learning. However, FoPG algorithms can exhibit poor learning dynamics in contact-rich tasks like locomotion. Previous approaches address this issue by alleviating contact dynamics via algorithmic or simulation innovations. In contrast, we propose guiding the policy search by learning a residual over a simple baseline policy. For quadruped locomotion, we find that the role of residual policy learning in FoPG-based training (FoPG RPL) is primarily to improve asymptotic rewards, compared to improving sample efficiency for model-free RL. Additionally, we provide insights on applying FoPG's to pixel-based local navigation, training a point-mass robot to convergence within seconds. Finally, we showcase the versatility of FoPG RPL by using it to train locomotion and perceptive navigation end-toend on a quadruped in minutes.

Abstract:
Large real-world driving datasets have sparked significant research into various aspects of learning-based motion planners for autonomous driving. These include data augmentation, model architecture, reward design, training strategies, and planner pipelines. In this paper, we review and benchmark previous methods. Experiments show that many of these approaches have limited generalization abilities in planning performance due to overly complex designs or training paradigms. Experiments further reveal that as models are appropriately scaled, many designs become redundant. Therefore, we introduce StateTransformer-2 (STR2), a scalable, decoder-only motion planner. STR2uses a Vision Transformer (ViT) encoder and a mix-of-experts (MoE) causal transformer architecture. The MoE backbone addresses modality collapse and reward balancing by expert routing during training. Extensive experiments on the NuPlan dataset show that our method generalizes better than previous approaches across different test sets and closed-loop simulations. We evaluate its scalability on billions of real-world urban driving scenarios, demonstrating consistent accuracy improvements as both data and model size grow.

Abstract:
Robotic steering of magnetic guidewires has shown great potential in accelerating endovascular interventions, enhancing the success rate of time-sensitive surgeries such as stroke treatment. Incomplete state feedback of the guidewire from 2D perspective images and unknown interactions with the surrounding vessel wall raise challenges in modeling and steering control. These two factors, however, are commonly overlooked by existing works. In this paper, 2D perspective images of the guidewire, which comply with prevalent medical imaging modalities, are used as the only feedback. A model-based external force observer is proposed that allows the guidewire to perceive the unknown interactions, and a compliance controller is subsequently designed to handle the external force while steering the guidewire. Experiments conducted in a human-sized phantom demonstrate how the compliance controller preserves stability and safety by adapting to the estimated external force.

Abstract:
Chronic wounds, including diabetic ulcers, pressure ulcers, and ulcers secondary to venous hypertension, affects more than 6.5 million patients and a yearly cost of more than 25 billion in the United States alone. Chronic wound treatment is currently a manual process, and we envision a future where robotics and automation will aid in this treatment to reduce cost and improve patient care. In this work, we present the development of the first robotic system for wound dressing removal which is reported to be the worst aspect of living with chronic wounds. Our method leverages differentiable physics-based simulation to perform gradient-based trajectory optimization for peeling trajectory planning. By integrating fracture mechanics of adhesion, we are able to model the peeling effect inherent to dressing adhesion. The system is further guided by carefully designed objective functions that promote both efficient and safe control, reducing the risk of tissue damage. We validated the efficacy of our approach through a series of experiments conducted on both synthetic skin phantoms and real human subjects. Our results demonstrate the system's ability to achieve precise and safe dressing removal trajectories, offering a promising solution for automating this essential healthcare procedure.

Abstract:
Arboreal environments challenge current robots but are deftly traversed by many familiar animals such as squirrels. We present a small, 450 g robot “Pinto” developed for tree-jumping, a behavior seen in squirrels but rarely in legged robots: jumping from the ground onto a vertical tree trunk. We develop a powerful and lightweight latched series-elastic actuator using a twisted string and carbon fiber springs. We consider the effects of scaling down conventional quadrupeds and experimentally show how storing energy in a parallel-elastic fashion using a latch increases jump energy compared to series-elastic or springless strategies. By switching between series and parallel-elastic modes with our latched 5-bar leg mechanism, Pinto executes energetic jumps as well as maintains continuous control during shorter bounding motions. We also develop sprung 2-DoF arms equipped with spined grippers to grasp tree bark for high-speed perching following a jump.

Abstract:
Monocular depth estimation from RGB images plays a pivotal role in 3D vision. However, its accuracy can deteriorate in challenging environments such as nighttime or adverse weather conditions. While long-wave infrared cameras offer stable imaging in such challenging conditions, they are inherently low-resolution, lacking rich texture and semantics as delivered by the RGB image. Current methods focus solely on a single modality due to the difficulties to identify and integrate faithful depth cues from both sources. To address these issues, this paper presents a novel approach that identifies and integrates dominant cross-modality depth features with a learning-based framework. Concretely, we independently compute the coarse depth maps with separate networks by fully utilizing the individual depth cues from each modality. As the advantageous depth spreads across both modalities, we propose a novel confidence loss steering a confidence predictor network to yield a confidence map specifying latent potential depth areas. With the resulting confidence map, we propose a multi-modal fusion network that fuses the final depth in an end-to-end manner. Harnessing the proposed pipeline, our method demonstrates the ability of robust depth estimation in a variety of difficult scenarios. Experimental results on the challenging \textMS^2 and ViViD++ datasets demonstrate the effectiveness and robustness of our method.

Abstract:
Camera and LiDAR serve as informative sensors for accurate and robust autonomous driving systems. However, these sensors often exhibit heterogeneous natures, resulting in distributional modality gaps that present significant challenges for fusion. To address this, a robust fusion technique is crucial, particularly for enhancing 3D object detection. In this paper, we introduce a dynamic adjustment technology aimed at aligning modal distributions and learning effective modality representations to enhance the fusion process. Specifically, we propose a triphase domain aligning module. This module adjusts the feature distributions from both the camera and LiDAR, bringing them closer to the ground truth domain and minimizing differences. Additionally, we explore improved representation acquisition methods for dynamic fusion, which includes modal interaction and specialty enhancement. Finally, an adaptive learning technique that merges the semantics and geometry information for dynamical instance optimization. Extensive experiments in the nuScenes dataset present competitive performance with state-of-the-art approaches. Our code will be released in the future.

Abstract:
Efficient real-time trajectory planning and control for fixed-wing unmanned aerial vehicles is challenging due to their non-holonomic nature, complex dynamics, and the additional uncertainties introduced by unknown aerodynamic effects. In this paper, we present a fast and efficient real-time trajectory planning and control approach for fixed-wing unmanned aerial vehicles, leveraging the differential flatness property of fixed-wing aircraft in coordinated flight conditions to generate dynamically feasible trajectories. The approach provides the ability to continuously replan trajectories, which we show is useful to dynamically account for the curvature constraint as the aircraft advances along its path. Extensive simulations and real-world experiments validate our approach, showcasing its effectiveness in generating trajectories even in challenging conditions for small FW such as wind disturbances.

Abstract:
End-to-end autonomous driving aims to produce planning trajectories from raw sensors directly. Currently, most approaches integrate perception, prediction, and planning modules into a fully differentiable network, promising great scalability. However, these methods typically rely on deterministic modeling of online maps in the perception module for guiding or constraining vehicle planning, which may incorporate erroneous perception information and further compromise planning safety. To address this issue, we delve into the importance of online map uncertainty for enhancing autonomous driving safety and propose a novel paradigm named UncAD. Specifically, UncAD first estimates the uncertainty of the online map in the perception module. It then leverages the uncertainty to guide motion prediction and planning modules to produce multi-modal trajectories. Finally, to achieve safer autonomous driving, UncAD proposes an uncertainty-collision-aware planning selection strategy according to the online map uncertainty to evaluate and select the best trajectory. In this study, we incorporate UncAD into various state-of-the-art (SOTA) end-to-end methods. Experiments on the nuScenes dataset show that integrating UncAD, with only a 1.9% increase in parameters, can reduce collision rates by up to 26% and drivable area conflict rate by up to 42%. Codes, pre-trained models, and demo videos can be accessed at https://github.com/pengxuanyang/UncAD.

Abstract:
Most stereo matching networks assume that the stereo images are perfectly rectified, ignoring the perturbation of extrinsic parameters due to collisions, mechanical vibrations, and thermal expansion. This leads to poor rectification robustness in real-world stereo systems. That is, even minor rectification errors can lead to failure, making stereo systems unreliable for long-term autonomous operation in complex environments. In this paper, we are the first to propose a frequency filtering-based rectification robustness (\mathbfF^2 \mathbfR^2) method for stereo matching, which aims to enhance the robustness of existing stereo networks to rectification errors. Specifically, we propose a sensitive frequency filter (SFF) to remove components susceptible to rectification errors within the frequency domain. SFF achieves the filtering through the learning-based adaptive filtering mask (AFM) guided by the spatial-frequency mapping modulation mask (SFM). Moreover, we build the matching feature reconstruction module (MFRM) to recover the features lost during filtering to benefit cost aggregation. Comprehensive experiments on simulated datasets and self-collected data validate that our method can significantly enhance the rectification robustness of stereo matching networks.

Abstract:
Learning to perform manipulation tasks from human videos is a promising approach for teaching robots. However, many manipulation tasks require changing control parameters during task execution, such as force, which visual data alone cannot capture. In this work, we leverage sensing devices such as armbands that measure human muscle activities and microphones that record sound, to capture the details in the human manipulation process, and enable robots to extract task plans and control parameters to perform the same task. To achieve this, we introduce Chain-of-Modality (CoM), a prompting strategy that enables Vision Language Models to reason about multimodal human demonstration data - videos coupled with muscle or audio signals. By progressively integrating information from each modality, CoM refines a task plan and generates detailed control parameters, enabling robots to perform manipulation tasks based on a single multimodal human video prompt. Our experiments show that CoM delivers a threefold improvement in accuracy for extracting task plans and control parameters compared to baselines, with strong generalization to new task setups and objects in real-world robot experiments. Videos and code are available at chain-of-modality.github.io

Abstract:
Velocity Obstacles (VO) methods form a paradigm for collision avoidance strategies among moving obstacles and agents. While VO methods perform well in simple multi-agent environments, they do not guarantee safety and can show overly conservative behavior in common situations. In this paper, we propose to combine a VO strategy for guidance with a Control Barrier Function approach for safety, which overcomes the overly conservative behavior of VOs and formally guarantees safety. We validate our method in a baseline comparison study, using second-order integrator and car-like dynamics. Results support that our method outperforms the baselines with respect to path smoothness, collision avoidance, and success rates.

Abstract:
Cell aspiration is a common micro-manipulation technique for cell transfer, particularly in in vitro fertilization (IVF) procedures. The minuscule volume of a cell (pL) and limited damping provided by the medium make it challenging to accurately and quickly aspirate a cell to the desired position inside the micropipette. Experienced clinicians intentionally insert an air segment inside the micropipette in advance to make the aspiration easier. Nevertheless, the unclear damping effects and the varying initial length of the air segment in each aspiration pose difficulties for most operators. Inadequate judgment and response may lead to overshoot or even loss of the cell. This paper constructs a nonlinear dynamics model to elucidate the cell motion inside a micropipette containing an inserted air segment. The model reveals the impact of the air segment. A model-based controller is designed to facilitate the accurate aspiration of human sperm to a desired position, incorporating an estimated initial length of the air segment. Experiments were conducted to quantitatively evaluate the performance of both the model and the controller involving various initial air segment lengths. The results demonstrated a 100 % success rate in 50 sperm aspiration experiments, achieving an average positional accuracy within \pm 2 pixels and an average settling time of 5.89 seconds.

Abstract:
Visual observation of objects is essential for many robotic applications, such as object reconstruction and manipulation, navigation, and scene understanding. Machine learning algorithms constitute the state-of-the-art in many fields but require vast data sets, which are costly and time-intensive to collect. Automated strategies for observation and exploration are crucial to enhance the efficiency of data gathering. Therefore, a novel strategy utilizing the Next-Best-Trajectory principle is developed for a robot manipulator operating in dynamic environments. Local trajectories are generated to maximize the information gained from observations along the path while avoiding collisions. We employ a voxel map for environment modeling and utilize raycasting from perspectives around a point of interest to estimate the information gain. A global ergodic trajectory planner provides an optional reference trajectory to the local planner, improving exploration and helping to avoid local minima. To enhance computational efficiency, raycasting for estimating the information gain in the environment is executed in parallel on the graphics processing unit. Benchmark results confirm the efficiency of the parallelization, while real-world experiments demonstrate the strategy's effectiveness.

Abstract:
Special purpose compliant end-effectors are effective in realizing task-appropriate passive compliance. This paper presents a programmable, 3 -fingered, antagonistic, compliant hand (P3ACH) capable of realizing a desired compliant behavior within a large space of multidirectional compliant behaviors. Manipulation dexterity is demonstrated by performing different assembly tasks faster, more robustly, and with lower contact forces than an active system realizing the same compliant behavior.

Abstract:
Electroadhesion (EA), as an electrostatically driven, controllable adhesion technology, has unique attributes such as low noise, robust adaptability, and energy efficiency. However, its adhesion pressure is still low (0.1~10kPa) which may significantly limit its applications. This paper presents an innovative electroadhesion pad embedded with liquid and solid dielectrics. The experiments demonstrate that this liquid-solid electroadhesion pad (LSEAP) is capable of much larger adhesion pressure, compared to the traditional solid electroadhesion pad (SEAP). On one hand, the LSEAP can increase the dielectric contact with the substrate. On the other hand, the actuator can increase its dielectric strength. We also explore the application of this actuator to perching of a commercial Unmanned Aerial Vehicle (UAV), in order to promote the UAV's sustainable flight. Notably, the untethered LSEAP system, with an adhesion area as small as 4 cm2 and a self-weight as light as 8.7 g, can support an UAV of 249.7 g for stable adhesion on various surfaces. The adhesion pressure generated by our LSEAD can be 32.2kPa, significantly larger than those reported in the literature. The weight ratio of the UAV to the LSEAP system is 14.6, more than double those in previous studies. The integration of this EA system markedly prolongs the operational duration of UAVs, rendering them suitable for sustainable surveillance and reconnaissance missions. This LSEAP also marks a pivotal advancement towards adhesion-based applications such as grippers and wall-climbing robots.

Abstract:
Robotic weed removal in precision agriculture introduces a repetitive heterogeneous task planning (RHTP) challenge for a mobile manipulator. RHTP has two unique characteristics: 1) an observe-first-and-manipulate-later (OFML) temporal constraint that forces a unique ordering of two different tasks for each target and 2) energy savings from efficient task collocation to minimize unnecessary movements. RHTP can be framed as a stochastic renewal process. According to the Renewal Reward Theorem, the expected energy usage per task cycle is the long-run average. Traditional task and motion planning focuses on feasibility rather than optimality due to the unknown object and obstacle position prior to execution. However, the known target/obstacle distribution in precision agriculture allows minimizing the expected energy usage. For each instance in this renewal process, we first compute task space partition, a novel data structure that computes all possibilities of task multiplexing and its probabilities with robot reachability. Then we propose a region-based setcoverage problem to formulate the RHTP as a mixed-integer nonlinear programming. We have implemented and solved RHTP using Branch-and-Bound solver. Compared to a baseline in simulations based on real field data, the results suggest a significant improvement in path length, number of robot stops, overall energy usage, and number of replans.

Abstract:
This paper addresses the challenge of developing a multi-arm quadrupedal robot capable of efficiently harvesting fruit in complex, natural environments. To overcome the inherent limitations of traditional bimanual manipulation, we introduce the first three-arm quadrupedal robot LocoHarv3, that builds on top of the Spot quadruped, and propose a novel hierarchical tri-manual planning approach for automated fruit harvesting with collision-free trajectories between the built-in end-effector of Spot and our custom-made bimanual manipulator. Our comprehensive semi-autonomous framework integrates teleoperation, supported by LiDAR-based odometry and mapping, with learning-based visual perception for accurate fruit detection and pose estimation. Validation is conducted through a series of controlled indoor experiments using motion capture and extensive field tests in natural settings. Results demonstrate a 90 % success rate in in-lab settings with a single attempt, and field trials further verify the system's robustness and efficiency in more challenging real-world environments.

Abstract:
Multi-robot navigation in complex environments relies on inter-robot communication and mutual observation for situational awareness. This paper studies the multi-robot navigation problem in unknown environments with line-ofsight (LoS) connectivity constraints. While previous works are limited to known environment models to derive the LoS constraints between robots, this paper eliminates such requirements by directly formulating the LoS constraints from realtime LiDAR scans, adopting techniques in point cloud visibility analysis. Based on that, we propose a novel LoS-distance metric to quantify both the urgency and sensitivity of losing LoS between robots considering their potential movements. Moreover, to address the imbalanced urgency of losing LoS between two robots, we design a fusion function to capture the overall urgency while generating gradients that facilitate robots' collaborative behavior to maintain LoS. The team connectivity is guaranteed by encoding the LoS constraints into a potential function that preserves the positivity of the Fiedler eigenvalue of robots' underlying graph. Finally, we establish a LoS-constrained exploration framework integrating the proposed connectivity controller. We showcase its applications in multi-robot exploration in complex unknown environments, where robots can always maintain the LoS connectivity through distributed sensing and communication while collaboratively exploring unknown environments. Our implementations are available at https://github.com/bairuofei/LoS_constrained_navigation.

Abstract:
Aerial Manipulators (AMs) provide a versatile platform for various applications, including 3D printing, architecture, and aerial grasping missions. However, their operational speed is often sacrificed to uphold precision. Existing control strategies for AMs often regard the manipulator as a disturbance and employ robust control methods to mitigate its influence. This research focuses on elevating the precision of the end-effector and enhancing the agility of aerial manipulator movements. We present a composite control scheme to address these challenges. Initially, a Nonlinear Disturbance Observer (NDOB) is utilized to compensate for internal coupling effects and external disturbances. Subsequently, manipulator dynamics are processed through a high pass filter to facilitate agile movements. By integrating the proposed control method into a fully autonomous delta-arm-based AM system, we substantiate the controller's efficacy through extensive real-world experiments. The outcomes illustrate that the end-effector can achieve accuracy at the millimeter level.

Abstract:
The flapping-wing robotic birds were inspired by nature to present an alternative way of thrust and lift generation instead of conventional high-speed rotary propellers in unmanned aerial platforms. The advances in flapping technology recently led to the prototyping of leg-claw mechanisms for perching and occasionally very lightweight arms for sampling or tiny object aerial manipulation. A dual-arm manipulator on top of a robotic bird might not be bio-inspired and safe in case of a collision with the environment or human-robot interaction. Here in this work, the previously designed dual-arm scissors-type manipulator has been improved in terms of workspace, mechanism, vision system, and blade placement to present a more natural way of sampling. The new dual-arm, with 100.2(g) weight, is redesigned inside a beak to have protection against possible collisions and also secure the cutting blades within a protected shield. During the flight, the dual-arm system is inside the cover and invisible; the lower beak is opened before manipulation and sets out the arm in a proper place for sampling. This new safety cover (beak) along with the new blade mechanism enhanced the cutting power and the safety of the operation. The experimental results show the successful cutting of a series of plant samples.

Abstract:
While legged robots have achieved significant advancements in recent years, ensuring the robustness of their controllers on unstructured terrains remains challenging. It requires generating diverse and challenging unstructured terrains to test the robot and discover its vulnerabilities. This topic remains underexplored in the literature. This paper presents a Quality-Diversity framework to generate diverse and challenging terrains that uncover weaknesses in legged robot controllers. Our method, applied to both simulated bipedal and quadruped robots, produces an archive of terrains optimized to challenge the controller in different ways. Quantitative and qualitative analyses show that the generated archive effectively contains terrains that the robots struggled to traverse, presenting different failure modes. Interesting results were observed, including failure cases that were not necessarily expected. Experiments show that the generated terrains can also be used to improve RL-based controllers.

Abstract:
Imitation learning through a demonstration interface is expected to learn policies for robot automation from intuitive human demonstrations. However, due to the differences in human and robot movement characteristics, a human expert might unintentionally demonstrate an action that the robot cannot execute. We propose feasibility-aware behavior cloning from observation (FABCO). In the FABCO framework, the feasibility of each demonstration is assessed using the robot's pre-trained forward and inverse dynamics models. This feasibility information is provided as visual feedback to the demonstrators, encouraging them to refine their demonstrations. During policy learning, estimated feasibility serves as a weight for the demonstration data, improving both the data efficiency and the robustness of the learned policy. We experimentally validated FABCO's effectiveness by applying it to a pipette insertion task involving a pipette and a vial. Four participants assessed the impact of the feasibility feedback and the weighted policy learning in FABCO. Additionally, we used the NASA Task Load Index (NASA-TLX) to evaluate the workload induced by demonstrations with visual feedback.

Abstract:
This paper presents CoCube a tabletop modular robotics platform designed for robotics education and multirobot algorithm research. CoCube is characterized by its low cost low floors high ceilings and wide walls offering flexibility and broad applicability across various use cases. The platform comprises four key components: CoCube robots which integrate wireless communication movement and interaction; CoModules which provide versatile external functionality; CoMaps which enable high-precision localization via microdot patterns on regular printed paper; and CoTags for interaction. CoCube operates on MicroBlocks a blocks programming language for physical computing inspired by MIT Scratch a widely-used coding language with a simple visual interface that makes programming accessible to young learners. It offers users both flexibility and ease of use with advanced API support for more complex applications. This paper details the design of the CoCube platform and demonstrates its potential in both educational and research contexts.

Abstract:
Robotic catching has traditionally focused on single-handed systems, which are limited in their ability to handle larger or more complex objects. In contrast, bimanual catching offers significant potential for improved dexterity and object handling but introduces new challenges in coordination and control. In this paper, we propose a novel framework for learning dexterous bimanual catching skills using Heterogeneous-Agent Reinforcement Learning (HARL). Our approach introduces an adversarial reward scheme, where a throw agent increases the difficulty of throws-adjusting speed-while a catch agent learns to coordinate both hands to catch objects under these evolving conditions. We evaluate the framework in simulated environments using 15 different objects, demonstrating robustness and versatility in handling diverse objects. Our method achieved approximately a 2 x increase in catching reward compared to single-agent baselines across \mathbf1 5 diverse objects.

Abstract:
We present Flat'n'Fold, a novel large-scale dataset for garment manipulation that addresses critical gaps in existing datasets. Comprising 1,212 human and 887 robot demonstrations of flattening and folding 44 unique garments across 8 categories, Flat'n'Fold surpasses prior datasets in size, scope, and diversity. Our dataset uniquely captures the entire manipulation process from crumpled to folded states, providing synchronized multi-view RGB-D images, point clouds, and action data, including hand or gripper positions and rotations. We quantify the dataset's diversity and complexity compared to existing benchmarks and show that our dataset features natural and diverse manipulations of real-world demonstrations of human and robot demonstrations in terms of visual and action information. To showcase Flat'n'Fold's utility, we establish new benchmarks for grasping point prediction and subtask decomposition. Our evaluation of state-of-the-art models on these tasks reveals significant room for improvement. This underscores Flat'n'Fold's potential to drive advances in robotic perception and manipulation of deformable objects. Our dataset can be downloaded at https://cvas-ug.github.io/flat-n-fold

Abstract:
Training and preparing first responders and humanitarian robots for Mass Casualty Incidents (MCIs) often poses a challenge owing to the lack of realistic and easily accessible test facilities. While such facilities can offer realistic scenarios post an MCI that can serve training and educational purposes for first responders and humanitarian robots, they are often hard to access owing to logistical constraints. To overcome this challenge, we present HEROES- a versatile Unreal Engine-based simulator for designing novel training simulations for humans and emergency robots for such urban search and rescue operations. The proposed HEROES simulator is capable of generating synthetic datasets for machine learning pipelines that are used for training robot navigation. This work addresses the necessity for a comprehensive training platform in the robotics community, ensuring pragmatic and efficient preparation for real-world emergency scenarios. The strengths of our simulator lie in its adaptability, scalability, and ability to facilitate collaboration between robot developers and first responders, fostering synergy in developing effective strategies for search and rescue operations in MCIs. We conducted a preliminary user study with an average score of 8.1 out of 10 supporting the ability of HEROES to generate sufficiently varied environments and a score of 7.8 out of 10 affirming the usefulness of the simulation environment. HEROES has been integrated with ROS and has been used to train an RL model for a real robot as a proof of concept.

Abstract:
Robot autonomy is driving an ever-increasing demand for computational power, including on-board multi-core CPUs and accelerators such as GPUs, to enable fast perception, planning, control, and more. Careful scheduling of these computational tasks on the CPU cores and GPUs is important to prevent locking up the finite computational capacity in ways that hinder other critical workloads; delays in computing time-critical tasks like obstacle detection and control can have huge negative consequences for autonomous robots, potentially resulting in damage, substantial financial loss, or even loss of life. In this paper, we leverage recent advances from real-time systems research. We apply TimeWall, a component-based real-time framework, to the computational components of an autonomous drone and experimentally show that the timeliness and safe operation properties of a drone are preserved even in the presence of increasing interfering computational processes.

Abstract:
This paper introduces the optimal design manifold as a novel approach for understanding and optimizing the design of robotic systems. Existing optimization frameworks often jointly optimize design and behavior but lack insight into why specific designs are optimal for given tasks. Additionally, a functionally optimal design may not always be the most practical to build and practicality cannot always be captured by an objective function. By defining and learning the optimal design manifold, which represents the space of all optimal solutions, we provide a systematic method for exploring the design space and selecting the most practical optimal design. We apply the optimal design manifold to robot cell layout optimization, robot design optimization, and multi-camera placement and demonstrate its effectiveness in enhancing design choices by enabling a deeper understanding of what makes a design optimal.

Abstract:
Compact quadrupedal robots are proving increasingly suitable for deployment in real-world scenarios. Their smaller size fosters easy integration into human environments. Nevertheless, real-time locomotion on uneven terrains remains challenging, particularly due to the high computational demands of terrain perception. This paper presents a robust reinforcement learning-based exteroceptive locomotion controller for resource-constrained small-scale quadrupeds in challenging terrains, which exploits real-time elevation mapping, supported by a careful depth sensor selection. We concurrently train both a policy and a state estimator, which together provide an odometry source for elevation mapping, optionally fused with visual-inertial odometry (VIO). We demonstrate the importance of positioning an additional time-of-flight sensor for maintaining robustness even without VIO, thus having the potential to free up computational resources. We experimentally demonstrate that the proposed controller can flawlessly traverse steps up to 17.5 cm in height and achieve an 80% success rate on 22.5 cm steps, both with and without VIO. The proposed controller also achieves accurate forward and yaw velocity tracking of up to 1.0 m/s and 1.5 rad/s respectively. We open-source our training code at github.com/ETH-PBL/elmap-rl-controller.

Abstract:
Humans learn from observations and experiences to adjust their behaviours towards better performance. Interacting with such dynamic humans is challenging, as the robot needs to predict the humans accurately for safe and efficient operations. Long-term interactions with dynamic humans have not been extensively studied by prior works. We propose an adaptive human prediction model based on the Theory-of-Mind (ToM), a fundamental social-cognitive ability that enables humans to infer others' behaviours and intentions. We formulate the human internal belief about others using a game-theoretic model, which predicts the future motions of all agents in a navigation scenario. To estimate an evolving belief, we use an Unscented Kalman Filter to update the behavioural parameters in the human internal model. Our formulation provides unique interpretability to dynamic human behaviours by inferring how the human predicts the robot. We demonstrate through longterm experiments in both simulations and real-world settings that our prediction effectively promotes safety and efficiency in downstream robot planning. Code will be available at https://github.com/centiLinda/AToM-human-prediction.git.

Abstract:
Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics. Keypoint-based representations have been proven effective as a succinct representation for capturing essential object features, and for establishing a reference frame in action prediction, enabling data-efficient learning of robot skills. However, their manual design nature and reliance on additional human labels limit their scalability. In this paper, we propose KALM, a framework that leverages large pre-trained vision-language models (LMs) to automatically generate taskrelevant and cross-instance consistent keypoints. KALM distills robust and consistent keypoints across views and objects by generating proposals using LMs and verifies them against a small set of robot demonstration data. Based on the generated keypoints, we can train keypoint-conditioned policy models that predict actions in keypoint-centric frames, enabling robots to generalize effectively across varying object poses, camera views, and object instances with similar functional shapes. Our method demonstrates strong performance in the real world, adapting to different tasks and environments from only a handful of demonstrations while requiring no additional labels. Videos can be found at https://kalm-il.github.io/.

Abstract:
Accurately predicting 3D occupancy grids from visual inputs is critical for autonomous driving, but current discriminative methods struggle with noisy data, incomplete observations, and the complex structures inherent in 3D scenes. In this work, we reframe 3D occupancy prediction as a generative modeling task using diffusion models, which learn the underlying data distribution and incorporate 3D scene priors. This approach enhances prediction consistency, noise robustness, and better handles the intricacies of 3D spatial structures. Our extensive experiments show that diffusion-based generative models outperform state-of-the-art discriminative approaches, delivering more realistic and accurate occupancy predictions, especially in occluded or low-visibility regions. Moreover, the improved predictions significantly benefit downstream planning tasks, highlighting the practical advantages of our method for real-world autonomous driving applications.

Abstract:
Autonomous parking has become a critical application in automatic driving research and development. Parking operations often suffer from limited space and complex environments, requiring accurate perception and precise maneuvering. Traditional rule-based parking algorithms struggle to adapt to diverse and unpredictable conditions, while learning-based algorithms lack consistent and stable performance in various scenarios. Therefore, a hybrid approach is necessary that combines the stability of rule-based methods and the generalizability of learning-based methods. Recently, reinforcement learning (RL) based policy has shown robust capability in planning tasks. However, the simulation-to-reality (sim-to-real) transfer gap seriously blocks the real-world deployment. To address these problems, we employ a hybrid policy, consisting of a rule-based Reeds-Shepp (RS) planner and a learningbased reinforcement learning (RL) planner. A real-time LiDARbased Occupancy Grid Map (OGM) representation is adopted to bridge the sim-to-real gap, leading the hybrid policy can be applied to real-world systems seamlessly. We conducted extensive experiments both in the simulation environment and real-world scenarios, and the result demonstrates that the proposed method outperforms pure rule-based and learningbased methods. The real-world experiment further validates the feasibility and efficiency of the proposed method.

Abstract:
Inverse kinematics is a fundamental technique for motion and positioning control in robotics, typically applied to end-effectors. In this paper, we extend the concept of inverse kinematics to guiding vector fields for path following in autonomous mobile robots. The desired path is defined by its implicit equation, i.e., by a collection of points belonging to one or more zero-level sets. These level sets serve as a reference to construct an error signal that drives the guiding vector field toward the desired path, enabling the robot to converge and travel along the path by following such a vector field. We start with the formal exposition on how inverse kinematics can be applied to guiding vector fields for single-integrator robots in an m-dimensional Euclidean space. Then, we leverage inverse kinematics to ensure that the level-set error signal behaves as a linear system, facilitating control over the robot's transient motion toward the desired path and allowing for the injection of feed-forward signals to induce precise motion behavior along the path. We then propose solutions to the theoretical and practical challenges of applying this technique to unicycles with constant speeds to follow 2D paths with precise transient control. We finish by validating the predicted theoretical results through real flights with fixed-wing drones.

Abstract:
The development of unmanned aerial vehicles (UAVs) with extended maneuverability has unlocked new applications such as complex inspection tasks at height. In this work, we introduce the Dragonfly drone, a novel tilt-rotor body-morphing UAV, capable of altering its shape and orientation without compromising its position tracking. Unlike most existing UAV designs that only target at decoupling position and orientation control, Dragonfly can also perform unique body-morphing in flight, featuring all six degrees of freedom in every morphology. This enables navigation into tight gaps with irregular shapes, conforming to obstacles of varying geometries, and maintaining physical contact with uneven surfaces. Such capabilities make our design particularly effective for complex inspection tasks at height, such as pipe or bridge inspection. Our contributions include the mechanical design of the system, the modeling and control strategies employed, and the realrobot experiments with a prototype platform. See Dragonfly drone in action: https://youtu.be/YxoV_Qt_5XE.

Abstract:
Object pose estimation is a core means for robots to understand and interact with their environment. For this task, monocular category-level methods are attractive as they require only a single RGB camera. However, current methods rely on shape priors or CAD models of the intra-class known objects. We propose a diffusion-based monocular category-level 9D object pose generation method, MonoDiff9D. Our motivation is to leverage the probabilistic nature of diffusion models to alleviate the need for shape priors, CAD models, or depth sensors for intra-class unknown object pose estimation. We first estimate coarse depth via DINOv2 from the monocular image in a zero-shot manner and convert it into a point cloud. We then fuse the global features of the point cloud with the input image and use the fused features along with the encoded time step to condition MonoDiff9D. Finally, we design a transformer-based denoiser to recover the object pose from Gaussian noise. Extensive experiments on two popular benchmark datasets show that MonoDiff9D achieves state-of-the-art monocular category-level 9D object pose estimation accuracy without the need for shape priors or CAD models at any stage. Our code will be made public at https://github.com/CNJianLiu/MonoDiff9D.

Abstract:
Learning visuomotor policy for multi-task robotic manipulation has been a long-standing challenge for the robotics community. The difficulty lies in the diversity of action space: typically, a goal can be accomplished in multiple ways, resulting in a multimodal action distribution for a single task. The complexity of action distribution escalates as the number of tasks increases. In this work, we propose Discrete Policy, a robot learning method for training universal agents capable of multi-task manipulation skills. Discrete Policy employs vector quantization to map action sequences into a discrete latent space, facilitating the learning of task-specific codes. These codes are then reconstructed into the action space conditioned on observations and language instruction. We evaluate our method on both simulation and multiple real-world embodiments, including both single-arm and bimanual robot settings. We demonstrate that our proposed Discrete Policy outperforms a well-established Diffusion Policy baseline and many state-of-the-art approaches, including ACT, Octo, and OpenVLA. For example, in a real-world multi-task training setting with five tasks, Discrete Policy achieves an average success rate that is \mathbf2 6 % higher than Diffusion Policy and 15% higher than OpenVLA. As the number of tasks increases to 12, the performance gap between Discrete Policy and Diffusion Policy widens to 32.5 %, further showcasing the advantages of our approach. Our work empirically demonstrates that learning multi-task policies within the latent space is a vital step toward achieving general-purpose agents. Our project is at https://discretepolicy.github.io.

Abstract:
Reinforcement learning-based mapless navigation holds significant potential. However, it faces challenges in indoor environments with local minima area. This paper introduces a safe mapless navigation framework utilizing hierarchical reinforcement learning (HRL) to enhance navigation through such areas. The high-level policy creates a sub-goal to direct the navigation process. Notably, we have developed a sub-goal update mechanism that considers environment congestion, efficiently avoiding the entrapment of the robot in local minimum areas. The low-level motion planning policy, trained through safe reinforcement learning, outputs real-time control instructions based on acquired sub-goal. Specifically, to enhance the robot's environmental perception, we introduce a new obstacle encoding method that evaluates the impact of obstacles on the robot's motion planning. To validate the performance of our HRL-based navigation framework, we conduct simulations in office, home, and restaurant environments. The findings demonstrate that our HRL-based navigation framework excels in both static and dynamic scenarios. Finally, we implement the HRL-based navigation framework on a TurtleBot3 robot for validation experiments, which exhibits its strong generalization capabilities.

Abstract:
In robot manipulation, Reinforcement Learning (RL) often suffers from low sample efficiency and uncertain convergence, especially in large observation and action spaces. Foundation Models (FMs) offer an alternative, demonstrating promise in zero-shot and few-shot settings. However, they can be unreliable due to limited physical and spatial understanding. We introduce ExploRLLM, a method that combines the strengths of both paradigms. In our approach, FMs improve RL convergence by generating policy code and efficient representations, while a residual RL agent compensates for the FMs' limited physical understanding. We show that Explorllm outperforms both policies derived from FMs and RL baselines in table-top manipulation tasks. Additionally, real-world experiments show that the policies exhibit promising zero-shot sim-to-real transfer. Supplementary material is available at https://explorllm.github.io.

Abstract:
For robots tasked with surveying the temporal dynamics of a changing environment, a choice must be made to observe novel regions of the environment or to re-survey previously visited regions, which may have changed. We present a novel multi-robot informative path planner (IPP) that combines an environmental and task kernel to direct mobile robots to gather samples from regions that would result in the greatest expected improvement in map accuracy. Our planner utilizes a multi-output Gaussian process to unify priors about the spatiotemporal environment along with priors about observational correlations between sensing vehicles. Additionally, we extend our analysis into an adaptive planning scenario and examine the performance under different planning configurations. We find that planning performance is largely driven by the choice of environmental priors, and that unrepresentative priors can be improved through adaptive planning.

Abstract:
Controlling articulated soft robots (ASRs) driven by variable stiffness actuators (VSAs) is challenging because they are highly nonlinear and difficult to model accurately. This paper proposes an efficient neural network (NN) learning control solution for ASRs driven by agonistic-antagonistic (AA)-VSAs to guarantee tracking performance without exact robot models. Composite learning resorts to memory regressor extension to enhance adaptive parameter estimation such that parameter convergence can be guaranteed without the stringent condition of persistent excitation. In the proposed method, an NN-based controller is constructed for the position tracking of AA-VSA-driven ASRs, and an NN weight update law based on composite learning is developed to enhance online modeling and control capabilities. Experiments are carried out on an ASR with three degrees of freedom and qbmove Advance actuators (a kind of AA-VSAs), which have validated the effectiveness and superiority of the proposed method in terms of modeling and tracking accuracy compared with existing control methods.

Abstract:
Soft robots, compared to regular rigid robots, as their multiple segments with soft materials bring flexibility and compliance, have the advantages of safe interaction and dexterous operation in the environment. However, due to its characteristics of high dimensional, nonlinearity, time-varying nature, and infinite degree of freedom, it has been challenges in achieving precise and dynamic control such as trajectory tracking and position reaching. To address these challenges, we propose a framework of Deep Koopman-based Model Predictive Control (DK-MPC) for handling multi-segment soft robots. We first employ a deep learning approach with sampling data to approximate the Koopman operator, which therefore linearizes the high-dimensional nonlinear dynamics of the soft robots into a finite-dimensional linear representation. Secondly, this linearized model is utilized within a model predictive control framework to compute optimal control inputs that minimize the tracking error between the desired and actual state trajectories. The real-world experiments on the soft robot “Chordata” demonstrate that DK-MPC could achieve highprecision control, showing the potential of DK-MPC for future applications to soft robots. More visualization results can be found at https://pinkmoon-io.github.io/DKMPC/.

Abstract:
This research reports VascularPilot3D, the first 3D fully autonomous endovascular robot navigation system. As an exploration toward autonomous guidewire navigation, VascularPilot3D is developed as a complete navigation system based on intra-operative imaging systems (fluoroscopic X-ray in this study) and typical endovascular robots. VascularPilot3D adopts previously researched fast 3D-2D vessel registration algorithms and guidewire segmentation methods as its perception modules. We additionally propose three modules: a topologyconstrained 2D-3D instrument end-point lifting method, a treebased fast path planning algorithm, and a prior-free endovascular navigation strategy. VascularPilot3D is compatible with most mainstream endovascular robots. Ex-vivo experiments validate that VascularPilot3D achieves 100 % success rate among 25 trials. It reduces the human surgeon's overall control loops by 18.38 %. VascularPilot3D is promising for general clinical autonomous endovascular navigation.

Abstract:
Motor imagery (MI) classification in rehabilitation brain-computer interfaces (RBCIs) faces significant challenges due to the variability of electroencephalography (EEG) signals across subjects. Existing methods typically require extensive EEG data collection from each new subject, which is time-consuming and results in poor user experience. To address this issue, this paper decompose MI-EEG into subject-specific private components and shared components common across all subjects, and propose a plug-and-play domain fusion adaptive method (PPMDFA) to handle variability between subjects. In the training phase, PPMDFA introduces a Multi-Domain Fusion Graph Convolutional Network (MDFGCN) module to extract shared and private features from the MI processes of source domain subjects. In the calibration phase, the method constructs private classifiers for the target new subject using the extracted shared features combined with a small amount of labeled data. During testing, PPMDFA leverages the similarity of private components to utilize knowledge from source subjects, thereby enhancing classification accuracy for target subjects' MI. We validated the proposed method on the PhysioNet and LLMBCImotion datasets. Experimental results show that PPMDFA achieves state-of-the-art classification accuracy on both datasets, with rapid adaptation to new subjects using only 20% of the data, reaching accuracies of 73.33% and 61.62%, demonstrating strong generalization ability and robustness.

Abstract:
Building generalist robotic systems involves effectively endowing robots with the capabilities to handle novel objects in an open-world setting. Inspired by the advances of large pre-trained models, we propose Keypoint Affordance Learning from Imagined Environments (KALIE), which adapts pre-trained Vision Language Models (VLMs) for robotic control in a scalable manner. Instead of directly producing motor commands, KALIE controls the robot by predicting pointbased affordance representations based on natural language instructions and visual observations of the scene. The VLM is trained on 2D images with affordances labeled by humans, bypassing the need for training data collected on robotic systems. Through an affordance-aware data synthesis pipeline, KALIE automatically creates massive high-quality training data based on limited example data manually collected by humans. We demonstrate that KALIE can learn to robustly solve new manipulation tasks with unseen objects given only 50 example data points. Compared to baselines using pre-trained VLMs, our approach consistently achieves superior performance.

Abstract:
Vision-Language Models (VLMs) have shown great success as foundational models for downstream vision and natural language applications in a variety of domains. However, these models are limited to reasoning over objects and actions currently visible on the image plane. We present a spatial extension to the VLM, which leverages spatially-localized egocentric video demonstrations to augment VLMs in two ways - through understanding spatial task-affordances, i.e. where an agent must be for the task to physically take place, and the localization of that task relative to the egocentric viewer. We show our approach outperforms the baseline of using a VLM to map similarity of a task's description over a set of location-tagged images. Our approach has less error both on predicting where a task may take place and on predicting what tasks are likely to happen at the current location. The resulting representation will enable robots to use egocentric sensing to navigate to, or around, physical regions of interest for novel tasks specified in natural language.

Abstract:
This paper addresses the inherent inference latency challenges associated with deploying multimodal large language models (MLLM) in quadruped vision-language-action (QUAR-VLA) tasks. Our investigation reveals that conventional parameter reduction techniques ultimately impair the performance of the language foundation model during the action instruction tuning phase, making them unsuitable for this purpose. We introduce a novel latency-free quadruped MLLM model, dubbed QUARTOnline, designed to enhance inference efficiency without degrading the performance of the language foundation model. By incorporating Action Chunk Discretization (ACD), we compress the original action representation space, mapping continuous action values onto a smaller set of discrete representative vectors while preserving critical information. Subsequently, we fine-tune the MLLM to integrate vision, language, and compressed actions into a unified semantic space. Experimental results demonstrate that QUART-Online operates in tandem with the existing MLLM system, achieving real-time inference at 50 Hz in sync with the underlying controller frequency, significantly boosting the success rate across various tasks by 65 %. Our project page is https://quart-online.github.io.

Abstract:
We demonstrate the ability of large language models (LLMs) to perform iterative self-improvement of robot policies. An important insight of this paper is that LLMs have a built-in ability to perform (stochastic) numerical optimization and that this property can be leveraged for explainable robot policy search. Based on this insight, we introduce the SAS Prompt (Summarize, Analyze, Synthesize) – a single prompt that enables iterative learning and adaptation of robot behavior by combining the LLM's ability to retrieve, reason and optimize over previous robot traces in order to synthesize new, unseen behavior. Our approach can be regarded as an early example of a new family of explainable policy search methods that are entirely implemented within an LLM. We evaluate our approach both in simulation and on a real-robot table tennis task. Project website: sites.google.com/asu.edu/sas-llm/

Abstract:
In the domain of robotics, achieving Lifelong Open-ended Learning Autonomy (LOLA) represents a significant milestone, especially in contexts where autonomous agents must adapt to unforeseen environmental variations and evolving objectives. This paper introduces VISOR (VisionSimilarity for Open-ended Robotic exploration), a vision-based framework designed to assist robotic agents in autonomously exploring and learning from new environments and objects, whether through guided or random exploration, without reliance on predefined design considerations. In that direction, VISOR acts as a perception mediator, classifying everything a robot encounters in a scene as either known or unknown. It further identifies potential distractors (e.g., background elements), known categories, or objects specified through text seeds. By leveraging recent advancements in vision foundation models, VISOR operates in a training-free manner. It begins by segmenting a scene into its constituent entities, regardless of familiarity, and then extracts robust visual representations for each one. These representations are compared against an adaptive memory system that evolves over time; unknown objects are assigned unique IDs and added to this memory as new classes, enriching the robot's understanding of its environment. We argue that this evolving memory can facilitate guided exploration through prior knowledge, enhancing the efficiency of robotic exploration, and validate this by designing two exploration scenarios and running both simulated and real-world experiments.

Abstract:
We describe a robust planning method for autonomous driving that mixes normal and adversarial agent predictions output by a diffusion model trained for motion prediction. We first train a diffusion model to learn an unbiased distribution of normal agent behaviors. We then generate a distribution of adversarial predictions by biasing the diffusion model at test time to generate predictions that are likely to collide with a candidate plan. We score plans using expected cost with respect to a mixture distribution of normal and adversarial predictions, leading to a planner that is robust against adversarial behaviors but not overly conservative when agents behave normally. Unlike current approaches, we do not use risk measures that over-weight adversarial behaviors while placing little to no weight on low-cost normal behaviors or use hard safety constraints that may not be appropriate for all driving scenarios. We show the effectiveness of our method on single-agent and multi-agent jaywalking scenarios as well as a red light violation scenario.

Abstract:
Multi-agent systems can be extremely efficient when working concurrently and collaboratively, e.g., for delivery, surveillance, search and rescue. Coordination of such teams often involves two aspects: (i) selecting appropriate subteams for different tasks in various areas; (ii) coordinating agents in the subteams to execute the associated subtasks. Existing work often assumes that the tasks are static and known beforehand, where an integer program can be formulated and solved offline. However, in many applications, the team-wise tasks are generated online continually by external requests; and the amount of subtasks within each task is uncertain (e.g., the number of packages to deliver, and victims to rescue). The aforementioned offline solution becomes inadequate as it would require constant re-computation for the whole team and global communication to broadcast the results. Thus, this work tackles the large-scale coordination problem under continual and uncertain temporal tasks, specified as temporal logic formulas over collaborative actions. The proposed hierarchical framework (HULK) consists of two interleaved layers: the rolling assignment of currently-known tasks to subteams within a certain horizon, and the dynamic coordination within a sub-team given the detected subtasks during online execution. Thus, the coordination is performed hierarchically at different granularities and triggering conditions, to improve the computational efficiency and robustness. It is validated rigorously over large-scale heterogeneous systems under various temporal tasks and environment uncertainties.

Abstract:
Leveraging the powerful reasoning capabilities of large language models (LLMs), recent LLM-based robot task planning methods yield promising results. However, they mainly focus on single or multiple homogeneous robots on simple tasks. Practically, complex long-horizon tasks always require collaboration among multiple heterogeneous robots especially with more complex action spaces, which makes these tasks more challenging. To this end, we propose COHERENT, a novel LLM-based task planning framework for collaboration of heterogeneous multi-robot systems including quadrotors, robotic dogs, and robotic arms. Specifically, a Proposal-Execution-Feedback-Adjustment (PEFA) mechanism is designed to decompose and assign actions for individual robots, where a centralized task assigner makes a task planning proposal to decompose the complex task into subtasks, and then assigns subtasks to robot executors. Each robot executor selects a feasible action to implement the assigned subtask and reports self-reflection feedback to the task assigner for plan adjustment. The PEFA loops until the task is completed. Moreover, we create a challenging heterogeneous multi-robot task planning benchmark encompassing 100 complex long-horizon tasks. The experimental results show that our work surpasses the previous methods by a large margin in terms of success rate and execution efficiency. The experimental videos, code, and benchmark are released at https://github.com/MrKeee/COHERENT.

Abstract:
The successful deployment of deep learning-based techniques for autonomous systems is highly dependent on the data availability for the respective system in its deployment environment. Especially for unstructured outdoor environments, very few datasets exist for even fewer robotic platforms and scenarios. In an earlier work, we presented the German Outdoor and Offroad Dataset (GOOSE) framework along with 10000 multimodal frames from an offroad vehicle to enhance the perception capabilities in unstructured environments. In this work, we address the generalizability of the GOOSE framework. To accomplish this, we open-source the GOOSE-Ex dataset, which contains additional 5000 labeled multimodal frames from various completely different environments, recorded on a robotic excavator and a quadruped platform. We perform a comprehensive analysis of the semantic segmentation performance on different platforms and sensor modalities in unseen environments. In addition, we demonstrate how the combined datasets can be utilized for different downstream applications or competitions such as offroad navigation, object manipulation or scene completion. The dataset, its platform documentation and pre-trained state-of-the-art models for offroad perception will be made available on https://goose-dataset.de/.

Abstract:
This paper studies sand profile grading, a manipulation task to obtain a desired geometric curve in sand. Manipulating sand is challenging because like other amorphous materials, its properties are difficult to estimate and emergent effects such as collapses may occur which both influence the manipulation outcome. To tackle these challenges, humans iterate and adapt their manual actions to the observed material states. In this paper, we propose to replicate this adaptive and iterative approach on a robotic profile grading task. Our results demonstrate that (1) tool insertion adaptation reduces force limit violations during tool-material interactions, (2) grading angle adaptation ensures no undercutting or collisions while allowing for cutting or smoothing the sand profile, and (3) adapting progress speed to task evolution provides a balance between grading precision and execution time. This paper's findings pave the way for generalized and transferable robotic systems manipulating various amorphous materials and automating a larger set of construction tasks and beyond.

Abstract:
As autonomous robots become integrated into society, they must socially navigate around humans. We propose that effective social robot navigation relies on three key principles: social norms, perceived safety, and legibility. Our framework, Overlapping Social Navigation Principles, suggests that the strength of each principle is influenced by the presence of other principles. To test our framework, we implemented SRN behaviors on an autonomous robot in a passing scenario and conducted an online study where participants ranked videos of different SRN behavior combinations. Our findings show that incorporating all three principles enhances SRN, with social norms having the greatest impact.

Abstract:
Existing sensor-based global localization methods limit the miniaturization potential of magnetically-actuated endoscopes (MAE) while localization based on external medical imaging demands accurate registration and imposes a variety of modality-specific challenges during continuous image acquisition. This work proposes a novel self-sufficient method for discrete (one-time) global localization of an MAE based solely on inherent endoscopic images without any prior MAE pose information. More specifically, it adopts a model-free control approach to determine five different external magnet (EM) poses (corresponding to five independent nonlinear equations) that can align the MAE image center with the lumen center while the MAE maintains the same pose. The five degree-of-freedom (DoF) global pose of the MAE can then be estimated by minimizing the root mean square of MAE's torque balance residuals under these EM poses. Our proposed method achieves similar accuracy as other sensor-based methods for permanent magnet-driven MAE with \mathbf6.7 \pm \mathbf2.1 mm position error and \mathbf9.5 \pm \mathbf2.9^\circ orientation error in the experiments. Compared to existing methods, our approach does not require physical sensor integration, enabling a more compact endoscope design for exploration in narrower respiratory tracts. It also offers a critical step toward achieving sensorless and continuous global localization of the permanent magnet-driven MAE during its autonomous navigation.

Abstract:
The inherent stochasticity of the agents' behavior presents a challenge to trajectory prediction models, which are required to generate multiple plausible future trajectories. Recently, diffusion models have been applied to implement multimodal trajectory prediction. Existing approaches typically employ a standard diffusion process, denoising from a sample drawn from a Gaussian distribution. However, we identify that most agents exhibit an obvious movement trend, rendering many initial denoising steps redundant-primarily transitioning from pure noise to an initial coarse trajectory. To conquer this challenge, this paper innovatively proposes a diffusion refiner that can be used along with existing multi-agent trajectory prediction models to improve their performance. Specifically, we first leverage a baseline model for predicting the coarse future trajectory. Then, the diffusion model is applied as a refiner to reduce the prediction error. Moreover, our method is naturally plug-and-play, allowing convenient integration with existing models. To achieve this, we improve the traditional diffusion process to not only converge towards noise but also the coarse predictions from the baseline model. In such a case, standard step-skipping sampling techniques is inapplicable and we further propose an ordinary differential equation (ODE)-based fast sampling method. Extensive experiments with selected baseline models demonstrate the effectiveness of our approach.

Abstract:
We consider the setting where a robot must complete a sequence of tasks in a persistent large-scale environment, given one at a time. Existing task planners often operate myopically, focusing solely on immediate goals without considering the impact of current actions on future tasks. Anticipatory planning, which reduces the joint objective of the immediate planning cost of the current task and the expected cost associated with future subsequent tasks, offers an approach for improving long-lived task planning. However, applying anticipatory planning in large-scale environments presents significant challenges due to the sheer number of assets involved, which strains the scalability of learning and planning. In this research, we introduce a model-based anticipatory task planning framework designed to scale to large-scale realistic environments. Our framework uses a graph neural network (GNN) in particular via a representation inspired by a 3D scene graph to learn the essential properties of the environment crucial to estimating the state's expected cost and a samplingbased procedure for practical large-scale anticipatory planning. Our experimental results show that our planner reduces the cost of task sequence by \mathbf5. 3 8 % in home and \mathbf3 1. 5 % in restaurant settings. If given time to prepare in advance using our model reduces task sequence costs by \mathbf4 0. 6 % and \mathbf4 2. 5 %, respectively.

Abstract:
3D semantic occupancy prediction networks have demonstrated remarkable capabilities in reconstructing the geometric and semantic structure of 3D scenes, providing crucial information for robot navigation and autonomous driving systems. However, due to their large overhead from dense network structure designs, existing networks face challenges balancing accuracy and latency. In this paper, we introduce OccRWKV, an efficient semantic occupancy network inspired by Receptance Weighted Key Value (RWKV). OccRWKV separates semantics, occupancy prediction, and feature fusion into distinct branches, each incorporating Sem-RWKV and Geo-RWKV blocks. These blocks are designed to capture long-range dependencies, enabling the network to learn domain-specific representation (i.e., semantics and geometry), which enhances prediction accuracy. Leveraging the sparse nature of real-world 3D occupancy, we reduce computational overhead by projecting features into the bird's-eye view (BEV) space and propose a BEV-RWKV block for efficient feature enhancement and fusion. This enables real-time inference at 22.2 FPS without compromising performance. Experiments demonstrate that OccRWKV outperforms the state-of-the-art methods on the SemanticKITTI dataset, achieving a mIoU of 25.1 while being 20 times faster than the best baseline, Co-Occ, making it suitable for real-time deployment on robots to enhance autonomous navigation efficiency. Code and video are available on our project page: https://jmwang0117.github.io/OccRWKV/.

Abstract:
Monocular video depth estimation is a key challenge in computer vision, highlighting its importance in visual understanding. Monocular depth estimation models trained on single images achieve impressive results on individual frames but often lack temporal consistency when applied to videos, leading to flickering and artifacts. Current video depth estimation methods often rely on additional optical flow or camera poses, which are limited by their accuracy, complex design, and lack robustness. Specially, we propose a plug-and-play method that seamlessly transfers the robustness of image depth estimation to video depth estimation. By leveraging powerful priors from image depth estimation, our method enhances the performance of video depth estimation without requiring additional conditional inputs or extensive pretraining on large and expensive video datasets. We introduce the Temporal Depth Stabilization Module (TDSM), which can seamlessly inflate an image monocular depth estimation model into a video depth estimation model, enabling unified modeling of depth across video sequences and capturing the temporal cues in video. We validate the effectiveness and efficiency of our method across various datasets (e.g., normal and challenging conditions) and different backbones. Extensive experiments demonstrate that our simple and effective method significantly improves monocular depth estimation networks, achieving new state-of-the-art accuracy in both spatial and temporal dimensions.

Abstract:
Contact-rich manipulation tasks with stiff frictional elements, like connector insertion, are difficult to model with rigid-body simulators. In this work, we propose a new approach for modeling these environments by learning a quasistatic contact force model instead of a full simulator. Using a feature vector that contains information about the configuration and control, we find a linear mapping adequately captures the relationship between this feature vector and the sensed contact forces. A novel Linear Model Learning (LML) algorithm is used to solve for the globally optimal mapping in real time without any matrix inversions, resulting in an algorithm that runs in nearly constant time on a GPU as the model size increases. We validate the proposed approach for connector insertion in both simulation and hardware experiments, where the learned model is combined with an optimizationbased impedance controller to achieve smooth insertions in the presence of misalignments and uncertainty. Our website featuring videos, code, and more materials is available at https://model-based-plugging.github.io/.

Abstract:
Flying quadrotors in tight formations is a challenging problem. It is known that in the near-field airflow of a quadrotor, the aerodynamic effects induced by the propellers are complex and difficult to characterize. Although machine learning tools can potentially be used to derive models that capture these effects, these data-driven approaches can be sample inefficient and the resulting models often do not generalize as well as their first-principles counterparts. In this work, we propose a framework that combines the benefits of first-principles modeling and data-driven approaches to construct an accurate and sample efficient representation of the complex aerodynamic effects resulting from quadrotors flying in formation. The data-driven component within our model is lightweight, making it amenable for optimization-based control design. Through simulations and physical experiments, we show that incorporating the model into a novel learning-based nonlinear model predictive control (MPC) framework results in substantial performance improvements in terms of trajectory tracking and disturbance rejection. In particular, our framework significantly outperforms nominal MPC in physical experiments, achieving a 40.1% improvement in the average trajectory tracking errors and a 57.5% reduction in the maximum vertical separation errors. Our framework also achieves exceptional sample efficiency, using only a total of 46 seconds of flight data for training across both simulations and physical experiments. Furthermore, with our proposed framework, the quadrotors achieve an exceptionally tight formation, flying with an average separation of less than 1.5 body lengths throughout the flight.

Abstract:
Recent advances in large language models (LLMs) have demonstrated exceptional reasoning capabilities in natural language processing, sparking interest in applying LLMs to task planning problems in robotics. Most studies focused on task planning for clear natural language commands that specify target objects and their locations. However, for more user-friendly task execution, it is crucial for robots to autonomously plan and carry out tasks based on abstract natural language commands that may not explicitly mention target objects or locations, such as ‘Put the food ingredients in the same place.’ In this study, we propose an LLM-based autonomous task planning framework that generates task plans for abstract natural language commands. This framework consists of two phases: an environment recognition phase and a task planning phase. In the environment recognition phase, a large vision-language model generates a hierarchical scene graph that captures the relationships between objects and spaces in the environment surrounding a robot agent. During the task planning phase, an LLM uses the scene graph and the abstract user command to formulate a plan for the given task. We validate the effectiveness of the proposed framework in the AI2THOR simulation environment, demonstrating its superior performance in task execution when handling abstract commands.

Abstract:
In robot task planning, large language models (LLMs) have shown significant promise in generating complex and long-horizon action sequences. However, it is observed that LLMs often produce responses that sound plausible but are not accurate. To address these problems, existing methods typically employ predefined error sets or external knowledge sources, requiring human efforts and computation resources. Recently, self-correction approaches have emerged, where LLM generates and refines plans, identifying errors by itself. Despite their effectiveness, they are more prone to failures in correction due to insufficient reasoning. In this paper, we introduce InversePrompt, a novel self-corrective task planning approach that leverages inverse prompting to enhance interpretability. Our method incorporates reasoning steps to provide clear, interpretable feedback. It generates inverse actions corresponding to the initially generated actions and verifies whether these inverse actions can restore the system to its original state, explicitly validating the logical coherence of the generated plans. The results on benchmark datasets show an average 16.3% higher success rate over existing LLM-based task planning methods. Our approach offers clearer justifications for feedback in real-world environments, resulting in more successful task completion than existing self-correction approaches across various scenarios.

Abstract:
This paper introduces and tests a framework integrating traffic regulation compliance into automated driving systems (ADS). The framework enables ADS to follow traffic laws and make informed decisions based on the driving environment. Using RGB camera inputs and a vision-language model (VLM), the system generates descriptive text to support a regulation-aware decision-making process, ensuring legal and safe driving practices. This information is combined with a machine-readable ADS regulation database to guide future driving plans within legal constraints. Key features include: 1) a regulation database supporting ADS decision-making, 2) an automated process using sensor input for regulation-aware path planning, and 3) validation in both simulated and real-world environments. Particularly, the real-world vehicle tests not only assess the framework's performance but also evaluate the potential and challenges of VLMs to solve complex driving problems by integrating detection, reasoning, and planning. This work enhances the legality, safety, and public trust in ADS, representing a significant step forward in the field.

Abstract:
Real-time SLAM with dense 3D mapping is computationally challenging, especially on resource-limited devices. The recent development of 3D Gaussian Splatting (3DGS) offers a promising approach for real-time dense 3D reconstruction. However, existing 3DGS-based SLAM systems struggle to balance hardware simplicity, speed, and map quality. Most systems excel in one or two of the aforementioned aspects but rarely achieve all. A key issue is the difficulty of initializing 3D Gaussians while concurrently conducting SLAM. To address these challenges, we present Monocular GSO (MGSO), a novel real-time SLAM system that integrates photometric SLAM with 3DGS. Photometric SLAM provides dense structured point clouds for 3DGS initialization, accelerating optimization and producing more efficient maps with fewer Gaussians. As a result, experiments show that our system generates reconstructions with a balance of quality, memory efficiency, and speed that outperforms the state-of-the-art. Furthermore, our system achieves all results using RGB inputs. We evaluate the Replica, TUM-RGBD, and EuRoC datasets against current live dense reconstruction systems. Not only do we surpass contemporary systems, but experiments also show that we maintain our performance on laptop hardware, making it a practical solution for robotics, A / R, and other real-time applications.

Abstract:
Visual SLAM has regained attention due to its ability to provide perceptual capabilities and simulation test data for Embodied AI. However, traditional SLAM methods struggle to meet the demands of high-quality scene reconstruction, and Gaussian SLAM systems, despite their rapid rendering and high-quality mapping capabilities, lack effective pose optimization methods and face challenges in geometric reconstruction. To address these issues, we introduce FGO-SLAM, a Gaussian SLAM system that employs an opacity radiance field as the scene representation to enhance geometric mapping performance. After initial pose estimation, we apply global adjustment to optimize camera poses and sparse point cloud, ensuring robust tracking of our approach. Additionally, we maintain a globally consistent opacity radiance field based on 3D Gaussians and introduce depth distortion and normal consistency terms to refine the scene representation. Further-more, after constructing tetrahedral grids, we identify level sets to directly extract surfaces from 3D Gaussians. Results across various real-world and large-scale synthetic datasets demonstrate that our method achieves state-of-the-art tracking accuracy and mapping performance.

Abstract:
Developing versatile quadruped robots that can smoothly perform various actions and tasks in real-world environments remains a significant challenge. This paper introduces a novel vision-language-action (VLA) model, mixture of robotic experts (MoRE), for quadruped robots that aim to introduce reinforcement learning (RL) for fine-tuning large-scale VLA models with a large amount of mixed-quality data. MoRE integrates multiple low-rank adaptation modules as distinct experts within a dense multi-modal large language model (MLLM), forming a sparse-activated mixture-of-experts model. This design enables the model to effectively adapt to a wide array of downstream tasks. Moreover, we employ a reinforcement learning-based training objective to train our model as a Q-function after deeply exploring the structural properties of our tasks. Effective learning from automatically collected mixed-quality data enhances data efficiency and model performance. Extensive experiments demonstrate that MoRE outperforms all baselines across six different skills and exhibits superior generalization capabilities in out-of-distribution scenarios. We further validate our method in real-world scenarios, confirming the practicality of our approach and laying a solid foundation for future research on multi-task learning in quadruped robots.

Abstract:
Scaling social robot studies is constrained due to the need for human interaction, making large participant recruitment impractical. Robotics simulators help mitigate this limitation but generally lack the realism to accurately simulate social cues. We introduce a cognitive robotic simulation scheme to evaluate social attention models in physical environments. By projecting ground-truth priority maps to a simulated environment, we can directly compare predicted maps using common saliency metrics. Using the iCub robot, we assess a dynamic scanpath model that predicts attention targets, simulating human scanpaths. Evaluations with the FindWho and MVVA datasets show strong correlations between robotcaptured metrics and direct-streamed video metrics. Our results indicate robustness of the social attention model to noise and real-world conditions, suggesting its practical usability for predicting personalized scanpaths in real settings. This approach reduces the need for extensive human-robot interaction studies in the early stages of study design, enabling the scalability and reproducibility of social robot evaluations.

Abstract:
Identifying a brain signal that enables the detection of incorrect execution in human-robot interaction (HRI) is considered a holy grail for real-time systems. A major challenge in achieving this is the inherent imbalance caused by the sparsity of error-related potential (ErrP) events in streaming electroencephalogram (EEG) data, which often leads models to learn irrelevant features and perform poorly. Thus, while deep learning-based ErrP detection has seen considerable advancements, the variability in individual user reaction times introduces labeling errors, complicating model adaptation to new subjects. Moreover, most deep learning methods are developed and validated on discrete, offline experiments using pre-defined windows, which fail to translate effectively to continuous, real-time HRI. Addressing these challenges is crucial to improving the robustness and adaptability of real-time ErrP detection in practical HRI applications. Here, we develop a causal EEG conformer framework, combining a convolutional neural network (CNN) encoder and a transformer with causal attention for real-time prediction of ErrP signals during HRI. We evaluated our ErrP model in a pseudo-online environment in both inter-session and inter-subject cross-validation settings for exoskeleton assistive robotics. Our model demonstrated superior performance in decoding accuracy and efficiency, showcasing better generalization for real-world dynamic HRI applications.

Abstract:
We present VisFly, a quadrotor simulator designed to efficiently train vision-based flight policies using reinforcement learning algorithms. VisFly offers a user-friendly framework and interfaces, leveraging Habitat-Sim's rendering engines to achieve frame rates exceeding 10,000 frames per second for rendering motion and sensor data. The simulator incorporates differentiable physics and is seamlessly wrapped with the Gym environment, facilitating the straightforward implementation of various learning algorithms. It supports the directly importing open-source scene datasets compatible with Habitat-Sim, enabling training on diverse real-world environments simultaneously. To validate our simulator, we also make three reinforcement learning examples for typical flight tasks relying on visual observations. The simulator is now available at [https://github.com/SJTU-ViSYS-team/VisFly].

Abstract:
This paper presents a unified approach to realize versatile distributed maneuvering with generalized formations. Specifically, we decompose the robots' maneuvers into two independent components, i.e., interception and enclosing, which are parameterized by two independent virtual coordinates. Treating these two virtual coordinates as dimensions of an abstract manifold, we derive the corresponding singularity-free guiding vector field (GVF), which, along with a distributed coordination mechanism based on the consensus theory, guides robots to achieve various motions (i.e., versatile maneuvering), including (a) formation tracking, (b) target enclosing, and (c) circumnavigation. Additional motion parameters can generate more complex cooperative robot motions. Based on GVFs, we design a controller for a nonholonomic robot model. Besides the theoretical results, extensive simulations and experiments are performed to validate the effectiveness of the approach.

Abstract:
Coordinated multi-agent robotic construction provides a means to build infrastructure in extreme environments and improve efficiency in high performance applications. Planning methods are key to understanding and achieving the scope of such applications, and are typically tailored to specific models of construction material and a consideration of passivity or activity thereof. Here, we focus on the NASA Automated Reconfigurable Mission Adaptive Digital Assembly Systems (ARMADAS) model, which includes passive lightweight structural modules and small robots that traverse the structure. We present an algorithm for calculating a build plan for robots under the constraints of this type of system. We then evaluate the quality of this plan experimentally. Many of the techniques we use can be applied to any robotic assembly system whose robots perform locomotion over the structure that they are building.

Abstract:
Research into robotics applications of deep reinforcement learning (DRL) has increasingly been focussed on learning precise object manipulation and trajectory planning. Extending these tasks to continuous robot-object interactions with the surface of complex geometries remains an open problem. In this paper we investigate end-to-end DRL solutions for depowdering tasks that work by directing a pressurized air stream onto the object's surfaces using a blast nozzle head mounted on a robotic arm. We develop a GPU accelerated vectorized cleaning effect for integration into RL training and consider ways to expose vision-less trajectory synthesis for surface treatment applications to the RL agent based on UV mapping. Our experimental evaluation demonstrates that DRL has the potential to be used for generating object-specific agents for depowdering tasks on a variety of 3D objects without requiring intermediate path planners even in a full 3D motion setup. Finally, we show that DRL-generated trajectories can be transferred to a real-world setup. Our task formulation lends itself to approximate a wide range of surface treatment applications (e.g., cleaning and spray painting) with various effects.

Abstract:
Self-balancing lower limb rehabilitation exoskeletons (SLLREs) allow individuals with lower limb dysfunction to walk without the use of crutches. Stable and human-like walking motions are crucial for SLLREs because achieving a close imitation of healthy human walking is a key goal in rehabilitation therapy. Existing SLLREs can realize stable walking but lack human-like features such as knee-stretched, heel-strike and toe-off. This paper designs a walking motion generator based on hierarchical optimization to generate a human-like walking motion with variable hip height, heelstrike, toe-off, and knee-stretched features. This generator consists of a knee-stretched optimizer and a stabilizing filter. Specifically, the knee-stretched optimizer realizes the stretched knee feature by optimizing the hip trajectory with varying heights. And the stabilizing filter realizes stable walking by optimizing the hip trajectory in the sagittal plane direction. To validate the effectiveness of the proposed human-like walking motion generator, walking experiments were conducted on SLLRE AutoLEE-G3 both in a simulation environment and the real world. The experimental results show that the humanlike walking motions look more natural and reduce the required torque for the knee joint compared with knee-bent walking.

Abstract:
Articulated continuum robots (ACRs) are characterized by flexibility, controllability, and adaptability and perform excellently in complex and constrained environments. However, the large number of motor drives limit the ACRs' portability and make them cumbersome to control. This paper presents a novel tendon-driven ACR composed of stabilized self-locking joints (SLJs) connected in series. After triggering the mechanical constraints with shape memory alloy coils, each joint can be maintained in either a self-locking or release state with zero power consumption. Consequently, even with a single set of drive units, the ACR can operate in multiple modes, enabling variable motion performance and workspace adaptability, effectively reducing the number of motors. The ACR's stiffness also varies with the locking state of its SLJs, and no motor drive is required to maintain its shape when all SLJs are self-locking. The performance and reliability of the SLJ prototype were validated. The workspace of the ACR prototype model was analyzed, and its partial motion performance, motion error, and variable stiffness were verified.

Abstract:
Tensegrity structures have been widely explored for their lightweight, high-stiffness, and foldable properties. These unique characteristics have enabled their application in various fields including robotics. Tensegrity robots have demonstrated diverse locomotion modes offering versatile solutions for navigation in complex environments. Recent efforts in bio-inspired robotics have led to designs mimicking the movement of natural organisms, such as earthworms. However, existing designs, particularly those utilizing motor-pulley mechanisms for robot body contraction, face significant challenges due to their bulky actuation systems that reduce locomotion efficiency. This paper introduces a novel tensegrity robot, “Tensiworm,” inspired by the peristaltic locomotion of an earthworm. Composed of three icosahedron tensegrity unit cells connected in series, the Tensiworm robot employs a sequential contraction and relaxation mechanism driven by active cable members made of shape memory actuators. This innovative design achieves a 59.13% folding ratio and weighs only 46.9 grams. The robot can travel a distance equal to its body length in approximately ten cycles with an average speed of 10.01 mm per minute. Furthermore, the use of thinner, flexible structural members broadens possibilities for development of millimeter-scale tensegrity robots, which hold significant potential for biomedical applications, including in-vivo testing and targeted drug delivery.

Abstract:
The capability of autonomous exploration in complex, unknown environments is important in many robotic applications. While recent research on autonomous exploration have achieved much progress, there are still limitations, e.g., existing methods relying on greedy heuristics or optimal path planning are often hindered by repetitive paths and high computational demands. To address such limitations, we propose a novel exploration framework that utilizes the global topology information of observed environment to improve exploration efficiency while reducing computational overhead. Specifically, global information is utilized based on a skeletal topological graph representation of the environment geometry. We first propose an incremental skeleton extraction method based on wavefront propagation, based on which we then design an approach to generate a lightweight topological graph that can effectively capture the environment's structural characteristics. Building upon this, we introduce a finite state machine that leverages the topological structure to efficiently plan coverage paths, which can substantially mitigate the back-and-forth maneuvers (BFMs) problem. Experimental results demonstrate the superiority of our method in comparison with state-of-theart methods. The source code will be made publicly available at: https://github.com/Haochen-Niu/STGPlanner.

Abstract:
Stable and robust robotic grasping is essential for current and future robot applications. In recent works, the use of large datasets and supervised learning has enhanced speed and precision in antipodal grasping. However, these methods struggle with perception and calibration errors due to large planning horizons. To obtain more robust and reactive grasping motions, leveraging reinforcement learning combined with tactile sensing is a promising direction. Yet, there is no systematic evaluation of how the complexity of force-based tactile sensing affects the learning behavior for grasping tasks. This paper compares various tactile and environmental setups using two model-free reinforcement learning approaches for antipodal grasping. Our findings suggest that under imperfect visual perception, various tactile features improve learning outcomes, while complex tactile inputs complicate training.

Abstract:
Visual and tactile pretraining have been extensively studied in dexterous robot manipulation tasks. However, existing methods typically require the simultaneous acquisition of visual and tactile data, making it difficult to utilize low-cost, unpaired visual-tactile datasets. Moreover, these methods often rely on tactile sensors to provide input data for reinforcement learning (RL) during the physical deployment of robotic dexterous hands, which highly increases deployment costs. To address these challenges, we propose UpViTaL, an unpaired visualtactile self-supervised representation learning method for RLbased robot dexterous manipulation. Specifically, we collect low-cost unpaired visual and tactile datasets for manipulation skill learning using a camera and tactile gloves on three robot manipulation tasks. The temporal tactile self-supervised representation learning module of UpViTaL is used to explore efficient tactile representations from time-series tactile data. In parallel, the visual pretraining module of UpViTaL helps to extract efficient visual representations from visual data. In addition, we fuse unpaired visual-tactile representations through an RL reward mechanism, which does not require robotic dexterous hands tactile sensors for practical deployment. We validate our approach on three dexterous robot manipulation tasks. Experimental results demonstrate that UpViTaL can efficiently learn robot manipulation skills. Compared to existing approaches for visual pretraining, our method significantly improves the success rate by more than 30%.

Abstract:
When humans collaborate, they form positive or negative experiences with each other. These experiences depend on various factors such as the individual's skills, abilities, and agency. In this paper, we consider human-robot collaborations and present a novel model of an autonomous robot's trust in humans based on the probability of the robot having a positive experience with the human. The model defines a dynamic trust-building process that translates into a computationallyaccessible implementation. We hypothesize predictors of a positive experience with human teammates and derive trust in individual humans. As the interactions continue, team members develop an affinity toward each other. The robot's affinity towards humans can be viewed as kinship, and we also investigate how kinship affects trust and distrust. We present an algorithm for how the robot may use kinship-mediated trust in its decision-making, and demonstrate its use in simulated missions truly requiring human-robot collaboration.

Abstract:
Robots powered by large language model (LLM) demonstrate significant research and application potential by effectively interpreting scene information to respond to human commands. However, when robots rely on static scene information during task execution, they face difficulties in adapting to changes in the environment, posing a major challenge for dynamic scene perception. To address the above issues, we propose an innovative interaction-driven approach to enhance robots' ability to perceive dynamic scene information. This approach consists of two contributions, the observation point selection module and the dynamic scene maintenance module. Specifically, first, the robot uses the 3D scene graph (3DSG) containing assets and objects to perceive static scene information through the LLM planner. Next, the best observation point for each asset is obtained through the observation point selection module. Then, with the help of the best observation point, the dynamic scene maintenance module interacts with the asset-related objects to dynamically update all the object node information related to the asset node. This approach enables robots to maintain dynamic scene information, enhancing their adaptability in unpredictable environments and improving task reliability. We evaluated our method using the iTHOR and RoboTHOR datasets within the AI2-THOR simulator and in real-world scenarios. Experimental results demonstrate that our method effectively and accurately maintains robots' perception of dynamic scene information.

Abstract:
Accurate physical simulation is crucial for the development and validation of control algorithms in robotic systems. Recent works in Reinforcement Learning (RL) take notably advantage of extensive simulations to produce efficient robot control. State-of-the-art servo actuator models generally fail at capturing the complex friction dynamics of these systems. This limits the transferability of simulated behaviors to real-world applications. In this work, we present extended friction models that allow to more accurately simulate servo actuator dynamics. We propose a comprehensive analysis of various friction models, present a method for identifying model parameters using recorded trajectories from a pendulum test bench, and demonstrate how these models can be integrated into physics engines. The proposed friction models are validated on four distinct servo actuators and tested on 2R manipulators, showing significant improvements in accuracy over the standard Coulomb-Viscous model. Our results highlight the importance of considering advanced friction effects in the simulation of servo actuators to enhance the realism and reliability of robotic simulations.

Abstract:
This paper presents an optimization-based motion planning methodology for snake robots operating in constrained environments. By using a reduced-order model, the proposed approach simplifies the planning process, enabling the optimizer to autonomously generate gaits while constraining the robot's footprint within tight spaces. The method is validated through high-fidelity simulations that accurately model contact dynamics and the robot's motion. Key locomotion strategies are identified and further demonstrated through hardware experiments, including successful navigation through narrow corridors.

Abstract:
Online optimal control of quadruped robots would enable them to adapt to varying inputs and changing conditions in real time. A common way of achieving this is linear model predictive control (LMPC), where a quadratic programming (QP) problem is formulated over a finite horizon with a quadratic cost and linear constraints obtained by linearizing the equations of motion and solved on the fly. However, the model linearization may lead to model inaccuracies. In this paper, we use the Koopman operator to create a linear model of the quadrupedal system in high dimensional space which preserves the nonlinearity of the equations of motion. Then using LMPC, we demonstrate high fidelity tracking and disturbance rejection on a quadrupedal robot. This is the first work that uses the Koopman operator theory for LMPC of quadrupedal locomotion.

Abstract:
Accurate Relative Pose Estimation (RPE) is critical for effective collaboration of multi-robot systems. Traditional methods using cameras or LiDARs heavily rely on overlapping Fields of View (FoV) between robots, which is highly demanding in practical applications and may hinder collaboration efficiency. To accommodate this issue, we propose Anchorless UWB-Assisted Relative Pose Estimation (AURPE), a novel approach that leverages ultra-wideband (UWB) technology in an anchorless setup to achieve multi-robot RPE without requiring overlapping FoVs or external infrastructure. AURPE first estimates the initial relative poses between robots using inter-robot UWB ranging combined with a Bayesian framework and constrained optimization. During robot operation, AURPE continuously refines the relative poses by integrating UWB measurements with LiDAR-inertial odometry (LIO) and employs a consensus voting mechanism to identify the most reliable pose estimates. Additionally, a pose graph-based backend optimization is incorporated to enhance the accuracy of both initial and real-time relative pose. Extensive simulations and real-world experiments demonstrate that AURPE achieves accurate RPE even in non-overlapping scenarios where traditional methods fail. Compared to state-of-the-art point cloud registration methods, AURPE shows superior performance in both accuracy and robustness, highlighting its potential to significantly enhance cooperative tasks in multi-robot systems operating in complex environments.

Abstract:
Modular Aerial Robotic Systems (MARS) consist of multiple drone units assembled into a single, integrated rigid flying platform. With inherent redundancy, MARS can self-reconfigure into different configurations to mitigate rotor or unit failures and maintain stable flight. However, existing works on MARS self-reconfiguration often overlook the practical controllability of intermediate structures formed during the reassembly process, which limits their applicability. In this paper, we address this gap by considering the control-constrained dynamic model of MARS and proposing a robust and efficient self-reconstruction algorithm that maximizes the controllability margin at each intermediate stage. Specifically, we develop algorithms to compute optimal, controllable disassembly and assembly sequences, enabling robust self-reconfiguration. Finally, we validate our method in several challenging fault-tolerant self-reconfiguration scenarios, demonstrating significant improvements in both controllability and trajectory tracking while reducing the number of assembly steps. The videos and source code of this work are available at https://github.com/RuiHuangNUS/MARS-Reconfig/

Abstract:
C-Ray is an amphibious robot that is capable of swimming in water and crawling on land using its undulating fins, enabling operations in a wide range of environments. The robot can be modeled as a hybrid dynamical system whose dynamics and propulsion change when the robot transitions between water and land. Most importantly, the direction of wave travel in the robot's fins is reversed between its swimming and crawling locomotion styles. To operate autonomously, C-Ray requires both accurate identification of when transitions between water and land occur and robust state estimation in littoral environments where the transition dynamics are highly discontinuous and transient. This paper presents a hybrid observer for estimating continuous states and identifying state-driven mode switches for C-Ray, enabling autonomous water/land-transitions. The proposed observer is a combination of the multiplicative extended Kalman filter (MEKF) and the salted Kalman filter, a newly proposed Kalman filter for mapping state uncertainty during hybrid transitions. We also propose an altitude and sea floor geometry observer and incorporate this directly into the MEKF. The performance is evaluated in simulations.

Abstract:
LiDARs are widely used in autonomous robots due to their ability to provide accurate environment structural information. However, the large size of point clouds poses challenges in terms of data storage and transmission. In this paper, we propose a novel point cloud compression and transmission framework for resource-constrained robotic applications, called RCPCC. We iteratively fit the surface of point clouds with a similar range value and eliminate redundancy through their spatial relationships. Then, we use Shape-adaptive DCT (SA-DCT) to transform the unfit points and reduce the data volume by quantizing the transformed coefficients. We design an adaptive bitrate control strategy based on QoE as the optimization goal to control the quality of the transmitted point cloud. Experiments show that our framework achieves compression rates of 40×to 80× while maintaining high accuracy for downstream applications. our method significantly outperforms other baselines in terms of accuracy when the compression rate exceeds 70×. Furthermore, in situations of reduced communication bandwidth, our adaptive bitrate control strategy demonstrates significant QoE improvements. The code will be available at https://github.com/HITSZ-NRSL/RCPCC.git.

Abstract:
Large language models (LLMs) have shown significant potential in guiding embodied agents to execute language instructions across a range of tasks, including robotic manipulation and navigation. However, existing methods are primarily designed for static environments and do not leverage the agent's own experiences to refine its initial plans. Given that real-world environments are inherently stochastic, initial plans based solely on LLMs' general knowledge may fail to achieve their objectives, unlike in static scenarios. To address this limitation, this study introduces the Experience-and-Emotion Map (E2Map), which integrates not only LLM knowledge but also the agent's real-world experiences, drawing inspiration from human emotional responses. The proposed methodology enables one-shot behavior adjustments by updating the E2Map based on the agent's experiences. Our evaluation in stochastic navigation environments, including both simulations and real-world scenarios, demonstrates that the proposed method significantly enhances performance in stochastic environments compared to existing LLM-based approaches. The code and supplementary materials are available at https://e2map.github.io/.

Abstract:
In logistics applications, the vision-based technology for grasping target objects in the air is relatively mature. However, when operating across the air and water such as grasping marine products from the water, the visual information collected by the camera will be disturbed by ripples and bubbles on the water surface, resulting in low grasping efficiency. Therefore, we introduce a grasping strategy based on single-visual mapping for multi-step (SVMMS) strategy to achieve cross-medium operations involving stacked objects. Specifically, we design a multifunctional integrated Deep Q-learning-based network model to extract visual features from the scene to effectively detect stacked objects and outputs their hierarchical relationships. Moreover, we quantify the underlying relationship between motion logic during action execution and changes in RGB-D during action execution to help the robot achieve efficient and collision-free operations. Our approach also incorporates a time-series design with prioritized experience replay to globally optimize the action sequence. Additionally, we propose a novel sim2real method by combining domain randomization to address the difference in object sizes between the simulation and the real world. Extensive experiments in both simulation and physical environments show that SVMMS-Grasp significantly outperforms existing methods in terms of task success rate, stability, and operational efficiency.

Abstract:
We present a novel formulation of ergodic trajectory optimization that can be specified over general domains using kernel maximum mean discrepancy. Ergodic trajectory optimization is an effective approach that generates coverage paths for problems related to robotic inspection, information gathering problems, and search and rescue. These optimization schemes compel the robot to spend time in a region proportional to the expected utility of visiting that region. Current methods for ergodic trajectory optimization rely on domain-specific knowledge, e.g., a defined utility map, and well-defined spatial basis functions to produce ergodic trajectories. Here, we present a generalization of ergodic trajectory optimization based on maximum mean discrepancy that requires only samples from the search domain. We demonstrate the ability of our approach to produce coverage trajectories on a variety of problem domains including robotic inspection of objects with differential kinematics constraints and on Lie groups without having access to domain specific knowledge. Furthermore, we show favorable computational scaling compared to existing state-of-the-art methods for ergodic trajectory optimization with a trade-off between domain specific knowledge and computational scaling, thus extending the versatility of ergodic coverage on a wider application domain.

Abstract:
Imitation learning in robotics faces significant challenges in generalization due to the complexity of robotic environments and the high cost of data collection. We introduce RoCoDA, a novel method that unifies the concepts of invariance, equivariance, and causality within a single framework to enhance data augmentation for imitation learning. RoCoDA leverages causal invariance by modifying task-irrelevant subsets of the environment state without affecting the policy's output. Simultaneously, we exploit SE (3) equivariance by applying rigid body transformations to object poses and adjusting corresponding actions to generate synthetic demonstrations. We validate RoCoDA through extensive experiments on five robotic manipulation tasks, demonstrating improvements in policy performance, generalization, and sample efficiency compared to state-of-the-art data augmentation methods. Our policies exhibit robust generalization to unseen object poses, textures, and the presence of distractors. Furthermore, we observe emergent behavior such as re-grasping, indicating policies trained with RoCoDA possess a deeper understanding of task dynamics. By leveraging invariance, equivariance, and causality, RoCoDA provides a principled approach to data augmentation in imitation learning, bridging the gap between geometric symmetries and causal reasoning. Project Page: https://rocoda.github.io

Abstract:
Accurate and high-fidelity driving scene reconstruction relies on fully leveraging scene information as conditioning. However, existing approaches, which primarily use 3D bounding boxes and binary maps for foreground and background control, fall short in capturing the complexity of the scene and integrating multi-modal information. In this paper, we propose DualDiff, a dual-branch conditional diffusion model designed to enhance multi-view driving scene generation. We introduce Occupancy Ray Sampling (ORS), a semantic-rich 3D representation, alongside numerical driving scene representation, for comprehensive foreground and background control. To improve cross-modal information integration, we propose a Semantic Fusion Attention (SFA) mechanism that aligns and fuses features across modalities. Furthermore, we design a foreground-aware masked (FGM) loss to enhance the generation of tiny objects. DualDiff achieves state-of-the-art performance in FID score, as well as consistently better results in downstream BEV segmentation and 3D object detection tasks.

Abstract:
Significant progress has been made in openvocabulary mobile manipulation, where the goal is for a robot to perform tasks in any environment given a natural language description. However, most current systems assume a static environment, which limits the system's applicability in realworld scenarios where environments frequently change due to human intervention or the robot's own actions. In this work, we present DynaMem, a new approach to open-world mobile manipulation that uses a dynamic spatio-semantic memory to represent a robot's environment. DynaMem constructs a 3D data structure to maintain a dynamic memory of point clouds, and answers open-vocabulary object localization queries using multimodal LLMs or open-vocabulary features generated by state-of-the-art vision-language models. Powered by DynaMem, our robots can explore novel environments, search for objects not found in memory, and continuously update the memory as objects move, appear, or disappear in the scene. We run extensive experiments on the Stretch SE3 robots in three real and nine offline scenes, and achieve an average pick-and-drop success rate of 70 % on non-stationary objects, which is more than a \mathbf2 × improvement over state-of-the-art static systems.

Abstract:
The ability of neural networks to perform robotic perception and control tasks such as depth and optical flow estimation, simultaneous localization and mapping (SLAM), and automatic control has led to their widespread adoption in recent years. Deep Reinforcement Learning (DeepRL) has been used extensively in these settings, as it does not have the unsustainable training costs associated with supervised learning. However, DeepRL suffers from poor sample efficiency, i.e., it requires a large number of environmental interactions to converge to an acceptable solution. Modern RL algorithms such as Deep Q Learning and Soft Actor-Critic attempt to remedy this shortcoming but can not provide the explainability required in applications such as autonomous robotics. Humans intuitively understand the long-time-horizon sequential tasks common in robotics. Properly using such intuition can make RL policies more explainable while enhancing their sample efficiency. In this work, we propose SHIRE, a novel framework for encoding human intuition using Probabilistic Graphical Models (PGMs) and using it in the Deep RL training pipeline to enhance sample efficiency. Our framework achieves 25-78% sample efficiency gains across the environments we evaluate at negligible overhead cost. Additionally, by teaching RL agents the encoded elementary behavior, SHIRE enhances policy explainability. A real-world demonstration further highlights the efficacy of policies trained using our framework.

Abstract:
Traditional federated learning frameworks, often reliant on deep neural networks, face challenges related to computational demands and privacy risks. In this paper, we present a novel Hyperdimensional (HD) Computing-based federated learning framework designed for resource-constrained mobile robots. Unlike other HD-based learning, our approach introduces dynamic encoding, which improves both model accuracy and privacy by continuously updating hypervector representations. To further address the issue of imbalanced data, especially prevalent in robotics tasks, we propose a hypervector oversampling technique, enhancing model robustness. Extensive evaluations on LiDAR-equipped mobile robots demonstrate that our oversampling method outperforms state-of-the-art HD computing frameworks, achieving up to a 22.9% increase in accuracy while maintaining computational efficiency.

Abstract:
Crack segmentation is pivotal for structural health monitoring, enabling the timely maintenance of critical infrastructure such as bridges and roads. However, existing deep learning models are often too computationally intensive for deployment on resource-constrained devices. To address this limitation, we introduce UltraFastCrackSeg, a lightweight model designed for real-time crack segmentation that effectively balances high accuracy with low computational demands. Featuring an efficient encoder-decoder architecture, our model significantly reduces parameter count and floating-point operations (FLOPs) compared to current methods, as illustrated in Figure 1. We further enhance performance through a self-supervised pretraining approach that employs a novel, task-oriented masking strategy, thereby improving feature extraction. Experiments across multiple datasets demonstrate that UltraFastCrackSeg achieves state-of-the-art Intersection over Union (IoU) and F1 scores while maintaining a compact model size and high inference speed. Evaluations on a low-power CPU device confirm its capability to achieve up to 80 frames per second (FPS) with ONNX runtime optimization, making it highly suitable for real-time, on-site applications. These findings establish UltraFastCrackSeg as a robust and efficient solution for practical crack detection tasks. Code is available at: https://github.com/weiqingq/UltraFastCrackSeg.

Abstract:
Many real-world robotic applications can be formulated as Multi-Agent Path-Finding (MAPF) problems and approximated using Multi-Agent Reinforcement Learning (MARL) algorithms. However, the opaque nature of the blackbox neural network models employed by MARL algorithms has impeded their widespread adoption due to concerns over interpretability, debugging, and user trust. To address these limitations, we propose an interpretable MAPF framework that emulates a group of n path-finding agents optimized through reinforcement learning (RL) using behavior trees (BTs), where n is the number of agents in path-finding scenarios. Expert behavior datasets consisting of state-action trajectories from MARL algorithms are generated, and a knowledge distillation approach is employed to reduce the size of the datasets and extract implicit rules. Additionally, a principled rules factorization technique based on Boolean algebra theory is utilized to prune the behavior rules and create more compact BTs representations. The proposed framework is evaluated on randomly generated MAPF scenarios and demonstrates superior performance compared to conventional BTs generation methods. This paper advances the field of interpretable AI by enabling the extraction of understandable decision-making processes from complex reinforcement learning models in multiagent systems.

Abstract:
Accurate and efficient segmentation of unknown objects in unstructured environments is essential for robotic manipulation. Unknown Object Instance Segmentation (UOIS), which aims to identify all objects in unknown categories and backgrounds, has become a key capability for various robotic tasks. However, existing methods struggle with over-segmentation and under-segmentation, leading to failures in manipulation tasks such as grasping. To address these challenges, we propose QuBER (Quadruple Boundary Error Refinement), a novel error-informed refinement approach for high-quality UOIS. QuBER first estimates quadruple boundary errors-true positive, true negative, false positive, and false negative pixels-at the instance boundaries of the initial segmentation. It then refines the segmentation using an error-guided fusion mechanism, effectively correcting both fine-grained and instance-level segmentation errors. Extensive evaluations on three public benchmarks demonstrate that QuBER outperforms state-of-the-art methods and consistently improves various UOIS methods while maintaining a fast inference time of less than 0.1 seconds. Furthermore, we show that QuBER improves the success rate of grasping target objects in cluttered environments. Code and supplementary materials are available at https://sites.google.com/view/uois-quber.

Abstract:
Behavior Trees (BTs) are increasingly becoming a popular control structure in robotics due to their modularity, reactivity, and robustness. In terms of BT generation methods, BT planning shows promise for generating reliable BTs. However, the scalability of BT planning is often constrained by prolonged planning times in complex scenarios, largely due to a lack of domain knowledge. In contrast, pre-trained Large Language Models (LLMs) have demonstrated task reasoning capabilities across various domains, though the correctness and safety of their planning remain uncertain. This paper proposes integrating BT planning with LLM reasoning, introducing Heuristic Behavior Tree Planning (HBTP)-a reliable and efficient framework for BT generation. The key idea in HBTP is to leverage LLMs for task-specific reasoning to generate a heuristic path, which BT planning can then follow to expand efficiently. We first introduce the heuristic BT expansion process, along with two heuristic variants designed for optimal planning and satisficing planning, respectively. Then, we propose methods to address the inaccuracies of LLM reasoning, including action space pruning and reflective feedback, to further enhance both reasoning accuracy and planning efficiency. Experiments demonstrate the theoretical bounds of HBTP, and results from four datasets confirm its practical effectiveness in everyday service robot applications.

Abstract:
When entering an unfamiliar environment, animals usually sweep off their surroundings to identify points of interest. In search and rescue robotics, autonomous exploration requires both coarse mapping of unknown areas and detailed target detection, which poses a significant challenge in balancing these tasks. To that end, we propose a target-aware robotic exploration framework that prioritizes both exploration efficiency and search completeness through three components: First, considering the computational limitations of robotic platforms, a lightweight 3D target detection method with post-fusion is introduced to detect target positions in real time. Secondly, we propose a target-aware viewpoint generation approach that integrates information gain and inspection gain to identify promising viewpoints for thorough target searches. Lastly, since a detailed examination of the environment demands numerous viewpoints, we propose a heuristic-based active exploration framework that employs a hierarchical structure to optimize exploration gain, traveling distance, and path smoothness to maximize the utility function of viewpoint sequences and ultimately find the optimal path. Extensive simulations and real-world experiments demonstrate our framework significantly enhances target search capabilities, achieving a 13 % average improvement in exploration efficiency over existing methods.

Abstract:
Identifying spatially complete planar primitives from visual data is a crucial task in computer vision. Prior methods are largely restricted to either 2D segment recovery or simplifying 3D structures, even with extensive plane annotations. We present PlanarNeRF, a novel framework capable of detecting dense 3D planes through online learning. Drawing upon the neural field representation, PlanarNeRF brings three major contributions. First, it enhances 3D plane detection with concurrent appearance and geometry knowledge. Second, a lightweight plane fitting module is used to estimate plane parameters. Third, a novel global memory bank structure with an update mechanism is introduced, ensuring consistent cross-frame correspondence. The flexible architecture of PlanarNeRF allows it to function in both 2D-supervised and self-supervised solutions, in each of which it can effectively learn from sparse training signals, significantly improving training efficiency. Through extensive experiments, we demonstrate the effectiveness of PlanarNeRF in various real-world scenarios and remarkable improvement in 3D plane detection over existing works.

Abstract:
Event cameras are bio-inspired sensors that output asynchronous and sparse event streams, instead of fixed frames. Benefiting from their distinct advantages, such as high dynamic range and high temporal resolution, event cameras have been applied to address 3D reconstruction, important for robotic mapping. Recently, neural rendering techniques, such as 3D Gaussian splatting (3DGS), have been shown successful in 3D reconstruction. However, it still remains under-explored how to develop an effective event-based 3DGS pipeline. In particular, as 3DGS typically depends on high-quality initialization and dense multiview constraints, a potential problem appears for the 3DGS optimization with events given its inherent sparse property. To this end, we propose a novel event-based 3DGS framework, named Elite-EvGS. Our key idea is to distill the prior knowledge from the off-the-shelf event-to-video (E2V) models to effectively reconstruct 3D scenes from events in a coarse-to-fine optimization manner. Specifically, to address the complexity of 3DGS initialization from events, we introduce a novel warm-up initialization strategy that optimizes a coarse 3DGS from the frames generated by E2V models and then incorporates events to refine the details. Then, we propose a progressive event supervision strategy that employs the window-slicing operation to progressively reduce the number of events used for supervision. This subtly relives the temporal randomness of the event frames, benefiting the optimization of local textural and global structural details. Experiments on the benchmark datasets demonstrate that Elite-EvGS can reconstruct 3D scenes with better textural and structural details. Meanwhile, our method yields plausible performance on the captured real-world data, including diverse challenging conditions, such as fast motion and low light scenes. For demo and more results, please check our project page.

Abstract:
Hysteresis in the tendons driving continuum robots is frequently regarded as a nuisance and a problem that is best avoided. Some prior work seeks to ameliorate the effects of hysteresis through the selection of materials. Others propose models of hysteresis to compensate for their effects. In this work, we present an empirically validated model of hysteresis in tendon-driven continuum robots. We demonstrate that hysteresis contributes to the stability of these robots by mitigating undesirable tensions in robot's backbone. As a result, a model-based approach to hysteresis can be used not just for compensation of a nuisance, but to enhance the utility of continuum robots in safety critical applications such as medical robots.

Abstract:
Dexterous in-hand manipulation (IHM) for arbitrary objects is challenging due to the rich and subtle contact process. Variable-friction manipulation is an alternative approach to dexterity, previously demonstrating robust and versatile 2D IHM capabilities with only two single-joint fingers. However, the hard-coded manipulation methods for variable friction hands are restricted to regular polygon objects and limited target poses, as well as requiring the policy to be tailored for each object. This paper proposes an end-to-end learning-based manipulation method to achieve arbitrary object manipulation for any target pose on real hardware, with minimal engineering efforts and data collection. The method features a diffusion policy-based imitation learning method with cotraining from simulation and a small amount of real-world data. With the proposed framework, arbitrary objects including polygons and non-polygons can be precisely manipulated to reach arbitrary goal poses within 2 hours of training on an A100 GPU and only 1 hour of real-world data collection. The precision is higher than previous customized object-specific policies, achieving an average success rate of 71.3 % with average pose error being 2.676 mm and 1.902°. Code and videos can be found at: https://sites.google.com/view/vf-ihm-il/home.

Abstract:
Quadratic Programs (QPs) are widely used in the control of walking robots, especially in Model Predictive Control (MPC) and Whole-Body Control (WBC). In both cases, the controller design requires the formulation of a QP and the selection of a suitable QP solver, both requiring considerable time and expertise. While computational performance benchmarks exist for QP solvers, studies comparing optimal combinations of computational hardware (HW), QP formulation, and solver performance are lacking. In this work, we compare dense and sparse QP formulations, and multiple solving methods on different HW architectures, focusing on their computational efficiency in dynamic walking of four-legged robots using MPC. We introduce the Solve Frequency per Watt (SFPW) as a performance measure to enable a cross-hardware comparison of the efficiency of QP solvers. We also benchmark different QP solvers for WBC that we use for trajectory stabilization in quadrupedal walking. As a result, this paper recommends a starting point for practitioners on the selection of QP formulations and solvers for different HW architectures in walking robots and indicates which problems should be devoted the greater technical effort.

Abstract:
This paper proposed M3DSS, a multi-platform, multi-sensor, and multi-scenario dataset for Simultaneous Localization and Mapping (SLAM) systems. Fifty-five sequences were collected from multiple platforms, including a handheld equipment, an unmanned ground vehicle, a quadruped robot, a car, and an unmanned aerial vehicle. Sensors used in M3DSS included two pairs of stereo event cameras with resolutions of 640× 480 and 346× 260, one infrared camera, four RGB cameras, two visual-inertial sensors, four mechanical and one solid-state LiDARs, three inertial measurement units, two global navigation satellite and inertial navigation systems with real-time kinematic signals. 21 various sensors were used on 5 different platforms under various challenging scenarios, including extreme illumination, aggressive motion, low-texture, high-speed driving scenarios, etc. To the best of our knowledge, M3DSS offered the richest event-based sensory information for SLAM up to date. We comprehensively evaluated state-of-the-art SLAM approaches and identified their limitations on M3DSS. Details could be found at https://neufs-ma.github.io/M3DSS.

Abstract:
Humanoid robots have great potential for real-world applications due to their ability to operate in environments built for humans, but their deployment is hindered by the challenge of controlling their underlying high-dimensional nonlinear hybrid dynamics. While reduced-order models like the Hybrid Linear Inverted Pendulum (HLIP) are simple and computationally efficient, they lose whole-body expressiveness. Meanwhile, recent advances in Contact-Implicit Model Predictive Control (CI-MPC) enable robots to plan through multiple hybrid contact modes, but remain vulnerable to local minima and require significant tuning. We propose a control framework that combines the strengths of HLIP and CI-MPC. The reduced-order model generates a nominal gait, while CI-MPC manages the whole-body dynamics and modifies the contact schedule as needed. We demonstrate the effectiveness of this approach in simulation with a novel 24 degree-of-freedom humanoid robot: Achilles. Our proposed framework achieves rough terrain walking, disturbance recovery, robustness under model and state uncertainty, and allows the robot to interact with obstacles in the environment, all while running online in real-time at 50 Hz.

Abstract:
Tactile sensing and the manipulation of delicate objects are critical challenges in robotics. This study presents a vision-based magnetic-actuated whisker array sensor that integrates these functions. The sensor features eight whiskers arranged circularly, supported by an elastomer membrane and actuated by electromagnets and permanent magnets. A camera tracks whisker movements, enabling high-resolution tactile feedback. The sensor's performance was evaluated through object classification and grasping experiments. In the classification experiment, the sensor approached objects from four directions and accurately identified five distinct objects with a classification accuracy of 99.17% using a Multi-Layer Perceptron model. In the grasping experiment, the sensor tested configurations of eight, four, and two whiskers, achieving the highest success rate of 87% with eight whiskers. These results highlight the sensor's potential for precise tactile sensing and reliable manipulation.

Abstract:
The application of vision-language models (VLMs) has achieved impressive success in various robotics tasks. However, there are few explorations for foundation models used in quadruped robot navigation through terrains in 3D environments. We introduce SARO (Space-Aware Robot System for Terrain Crossing), an innovative system composed of a high-level reasoning module, a closed-loop sub-task execution module, and a low-level control policy. It enables the robot to navigate across 3D terrains and reach the goal position. For high-level reasoning and execution, we propose a novel algorithmic system taking advantage of a VLM, with a design of task decomposition and a closed-loop sub-task execution mechanism. For low-level locomotion control, we utilize the Probability Annealing Selection (PAS) method to effectively train a control policy by reinforcement learning. Numerous experiments show that our whole system can accurately and robustly navigate across several 3D terrains, and its generalization ability ensures the applications in diverse indoor and outdoor scenarios and terrains. Appendix and Videos can be found in project page: https://saro-vlm.github.io/.

Abstract:
Ensuring constraint satisfaction is a key requirement for safety-critical systems, which include most robotic platforms. For example, constraints can be used for modeling joint position/velocity/torque limits and collision avoidance. Constrained systems are often controlled using Model Predictive Control, because of its ability to naturally handle constraints, relying on numerical optimization. However, ensuring constraint satisfaction is challenging for nonlinear systems/constraints. A well-known tool to make controllers safe is the so-called control-invariant set (a.k.a. safe set). In our previous work, we have shown that safety can be improved by letting the safe-set constraint recede along the MPC horizon. In this paper, we push that idea further by exploiting parallel computation to improve safety. We solve several MPC problems at the same time, where each problem instantiates the safe-set constraint at a different time step along the horizon. Finally, the controller can select the best solution according to some user-defined criteria. We validated this idea through extensive simulations with a 3 -joint robotic arm, showing that significant improvements can be achieved in terms of safety and performance, even using as little as 4 computational cores.

Abstract:
Table tennis robots have significantly advanced in performance owing to the rapid progress in deep learning and reinforcement learning technologies. However, these advancements often require a large number of training samples. Besides, research focused on the robot serve task remains relatively limited. In response to these problems, this paper proposes a sample-efficient adjustment-learning (SEAL) method for the serve task inspired by human experience in table tennis, which can inherently augment the available training samples without the need for additional sample collection. The adjustment learning does not require complex network structures but demonstrates superior performances. The models trained by adjustment learning have good generalization and robustness, that can adapt to different serve styles and reduce system transfer errors very efficiently. In addition, the random interpolation method during dataset generation stage is introduced, and the effectiveness of simultaneous learning in both joint space and Cartesian space is also demonstrated. For specific serve task, an accuracy of less than \mathbf3 0 ~ m m to any designated position at the first shot is achieved.

Abstract:
Coordinated payload transport via a fleet of modular wheeled mobile robots offers flexibility for handling larger loads in indoor and outdoor environments. Biped-wheeled robots have recently emerged as a viable architecture for an independent/stand-alone wheeled mobile robot. In this work, we explore the use of two biped-wheeled robots that can leverage their mobility and maneuvarability for enhanced spatial pose control and stabilization for various payload transport tasks. However, coordinated control of multiple articulated wheeled robots for path tracking of a payload presents significant (and potentially competing) challenges, including kinematic redundancy, stability concerns, relative motion between the payload and robots, and precise motion control to achieve effective coordination. To address these challenges, we propose a Deep Reinforcement Learning (DRL) framework to develop the motion-plans for the system. In particular, this approach generates the ego robot's body twist and the follower robot's relative twist with respect to the ego robot. By formulating the action space of the follower robot as a relative twist, our approach facilitates pairwise interactions between robots. Furthermore, we use only relative pose information and the errors as states for the DRL controller, thereby making it agnostic to initial conditions and avoiding explicit dependency on absolute pose. We validate our approach through simulations conducted in Isaac Sim and on hardware using Diablo biped-wheeled robots with zero-shot transfer, demonstrating effective payload path tracking across varying parameters.

Abstract:
Robot soccer, in its full complexity, poses an unsolved research challenge. Current solutions heavily rely on engineered heuristic strategies, which lack robustness and adaptability. Deep reinforcement learning has gained significant traction in various complex robotics tasks such as locomotion, manipulation, and competitive games (e.g., AlphaZero, OpenAI Five), making it a promising solution to the robot soccer problem. This paper introduces MARLadona. A decentralized multi-agent reinforcement learning (MARL) training pipeline capable of producing agents with sophisticated team play behavior, bridging the shortcomings of heuristic methods. Furthermore, we created an open-source multi-agent soccer environment. Utilizing our MARL framework and a modified global entity encoder (GEE) as our core architecture, our approach achieves a 66.8 % win rate against HELIOS agent, which employs a state-of-the-art heuristic strategy. In addition, we provided an in-depth analysis of the policy behavior and interpreted the agent's intention using the critic network.

Abstract:
Visual localization involves estimating a query image's 6-DoF (degrees of freedom) camera pose, which is a fundamental component in various computer vision and robotic tasks. This paper presents LoGS, a vision-based localization pipeline utilizing the 3D Gaussian Splatting (GS) technique as scene representation. This novel representation allows high-quality novel view synthesis. During the mapping phase, structure-from-motion (SfM) is applied first, followed by the generation of a GS map. During localization, the initial position is obtained through image retrieval, local feature matching coupled with a PnP solver, and then a high-precision pose is achieved through the analysis-bysynthesis manner on the GS map. Experimental results on four large-scale datasets demonstrate the proposed approach's SoTA accuracy in estimating camera poses and robustness under challenging few-shot conditions. Codes can be found at: https://github.com/RPL-CS-UCL/gs_localization.

Abstract:
Skeleton-based action recognition (SAR) models often centralize skeleton data, increasing significant privacy concerns. To address this, decentralized training models for SAR have been advanced, particularly using federated learning (FL), a research area of considerable value with wide-ranging applications, including human-robot interaction, camera-enabled devices, and security surveillance. However, FL-based SAR faces the challenge of substantial accuracy degradation due to data poisoning attacks; thus, it requires the identification of malicious clients. This paper introduces an innovative approach for detecting data poisoning attacks in federated SAR, called FedDet. The method involves creating prototypes of perspective transform and exchanging these matrices between the clients and server to identify the malicious client. Additionally, a prototype-guided attack detector is de-veloped, incorporating spatiotemporal matching to analyze the correlation between prototype skeleton data. Experimental results on FL frameworks and SAR models demonstrate that the proposed approach outperforms existing models. Our code is available at https://github.com/alsgur0720/federated-detection.

Abstract:
Environmental monitoring is used to characterize the health and relationship between organisms and their environments. In forest ecosystems, robots can serve as platforms to acquire such data, even in hard-to-reach places where wire-traversing platforms are particularly promising due to their efficient displacement. This paper presents the RaccoonBot, which is a novel autonomous wire-traversing robot for persistent environmental monitoring, featuring a fail-safe mechanical design with a self-locking mechanism in case of electrical shortage. The robot also features energy-aware mobility through a novel Solar tracking algorithm, that allows the robot to find a position on the wire to have direct contact with solar power to increase the energy harvested. Experimental results validate the electro-mechanical features of the RaccoonBot, showing that it is able to handle wire perturbations, different inclinations, and achieving energy autonomy.

Abstract:
Some applications, such as surgical interventions, require that potential soft robots have the capability to alter their shape and enhance their force output on demand. This paper presents an antagonistic stiffening mechanism combining pneumatic actuation with tendon locking to achieve configuration- and stiffness control. Elongation of a soft pneumatic section, resulting from air actuation, is opposed by constraining the length of integrated tendons. These tendons can be locked in length by pneumatically activated levers at the base of each segment. Hence, tendon locking will not affect the configuration of other segments of a multi-segment manipulator. Our concept achieves a stiffness increase of up to 201.7% and a larger, more uniform radial workspace compared to the widely used pneumatic actuation concept while maintaining the low technical effort required for actuation. We also demonstrate how our actuation concept enables independent control of stiffness levels for individual segments of a multi-segment manipulator and their MR compatibility.

Abstract:
Spatial programming of magnetic soft materials holds immense potential for wide ranging applications in soft robotics, minimally invasive medicine, and haptic interfaces. Despite tremendous and rapid progress in encoding spatially resolved magnetization directions over soft structures, the currently available approaches employ sequential encoding, resulting in slow and tedious processes with limited throughput. In this paper, we present a rapid and parallel magnetic programming strategy based on digitally processed laser heating. Heating above the Curie temperature of the magnetic microparticles embedded within the soft material allows their facile magnetization in desired directions via small external magnetic fields. To achieve parallel and rapid magnetic programming, we developed an integrated digital laser processing and magnetic field generation system, facilitating generation of desired shapes and patterns at high-resolution. Performance of the pattern generation and magnetic soft material are experimentally evaluated. Employing the described magnetic programming framework, shape-morphing of magnetic soft structures with varying magnetic profiles are shown. The proposed approach establishes a rapid and facile encoding procedure with high-throughput magnetic programming potential.

Abstract:
This paper presents a framework designed to tackle a range of planning problems arise in manipulation, which typically involve complex geometric-physical reasoning related to contact and dynamic constraints. We introduce the Contact Factor Graph (CFG) to graphically model these diverse factors, enabling us to perform inference on the graphs to approximate the distribution and sample appropriate solutions. We propose a novel approach that can incorporate various phenomena of contact manipulation as differentiable factors, and develop an efficient inference algorithm for CFG that leverages this differentiability along with the conditional probabilities arising from the structured nature of contact. Our results demonstrate the capability of our framework in generating viable samples and approximating posterior distributions for various manipulation scenarios.

Abstract:
Ultrasound (US) probe localization relative to the examined subject is essential for freehand 3D US imaging, which offers significant clinical value due to its affordability and unrestricted field of view. However, existing methods often rely on expensive tracking systems or bulky probes, while recent US image-based deep learning methods suffer from accumulated errors during probe maneuvering. To address these challenges, this study proposes a versatile, cost-effective probe pose localization method for freehand 3D US imaging, utilizing two lightweight cameras. To eliminate accumulated errors during US scans, we introduce PoseNet, which directly predicts the probe's 6 D pose relative to a preset world coordinate system based on camera observations. We first jointly train pose and camera image encoders based on pairs of 6 D pose and camera observations densely sampled in simulation. This will encourage each pair of probe pose and its corresponding camera observation to share the same representation in latent space. To ensure the two encoders handle unseen images and poses effectively, we incorporate a triplet loss that enforces smaller differences in latent features between nearby poses compared to distant ones. Then, the pose decoder uses the latent representation of the camera images to predict the probe's 6 D pose. To bridge the sim-to-real gap, in the real world, we use the trained image encoder and pose decoder for initial predictions, followed by an additional MLP layer to refine the estimated pose, improving accuracy. The results obtained from an arm phantom demonstrate the effectiveness of the proposed method, which notably surpasses state-of-the-art techniques, achieving average positional and rotational errors of 2.03 mm and 0.37°, respectively. Code:https://github.com/dianyeHuang/FreehandUS_Pose_Estimation

Abstract:
We present an autonomous aerial system for safe and efficient through-the-canopy fruit counting. Aerial robot applications in large-scale orchards face significant challenges due to the complexity of fine-tuning flight paths based on orchard layouts, canopy density, and plant variability. Through-the-canopy navigation is crucial for minimizing occlusion by leaves and branches but is more challenging due to the complex and dense environment compared to traditional over-the-canopy flights. Our system addresses these challenges by integrating: i) a high-fidelity simulation framework for global path planning, ii) a low-cost autonomy stack for canopy-level navigation and data collection, and iii) a robust workflow for fruit detection and counting using RGB images. We validate our approach through fruit counting with canopy-level aerial images and by demonstrating the autonomous navigation capabilities of our experimental vehicle.

Abstract:
This work addresses the challenge of personalizing trajectories generated in automated decision-making systems by introducing a resource-efficient approach that enables rapid adaptation to individual users' preferences. Our method leverages a pretrained conditional diffusion model with Preference Latent Embeddings (PLE), trained on a large, reward-free offline dataset. The PLE serves as a compact representation for capturing specific user preferences. By adapting the pretrained model using our proposed preference inversion method, which directly optimizes the learnable PLE, we achieve superior alignment with human preferences compared to existing solutions like Reinforcement Learning from Human Feedback (RLHF) and Low-Rank Adaptation (LoRA). To better reflect practical applications, we create a benchmark experiment using real human preferences on diverse, high-reward trajectories.

Abstract:
We propose C-LAIfO, a computationally efficient algorithm designed for imitation learning from videos in the presence of visual mismatch between agent and expert domains. We analyze the problem of imitation from expert videos with visual discrepancies, and introduce a solution for robust latent space estimation using contrastive learning and data augmentation. Provided a visually robust latent space, our algorithm performs imitation entirely within this space using off-policy adversarial imitation learning. We conduct a thorough ablation study to justify our design and test C-LAIfO on high-dimensional continuous robotic tasks. Additionally, we demonstrate how C LAIfO can be combined with other reward signals to facilitate learning on a set of challenging hand manipulation tasks with sparse rewards. Our experiments show improved performance compared to baseline methods, highlighting the effectiveness of C-LAIfO. To ensure reproducibility, we open source our code.

Abstract:
As scientific research in chemistry, materials science, and applied sciences becomes increasingly complex and data-driven, there is a growing need for efficient, scalable, and flexible automation to accelerate discoveries and reduce human burden and error in laboratories. We introduce the Experiment Orchestration System (EOS), an open-source software framework and runtime offering a comprehensive foundation for laboratory automation. EOS offers an extensible framework allowing users to define labs, devices, tasks, experiments, and optimization criteria using YAML and Python plugins, and also offers a distributed runtime for managing and executing automation. EOS has a central orchestrator that communicates with and controls laboratory equipment to execute tasks. EOS implements autonomous experiment campaigns, parameter optimization, task scheduling, result aggregation, and more. By providing a common infrastructure for laboratory automation, EOS aims to reduce automation implementation barriers and accelerate discoveries in science laboratories.

Abstract:
Animals exhibit remarkable adaptability in sensing their environments, employing strategies that optimize information gathering. For instance, silk moths adjust their wingflapping frequency to detect pheromones, while dogs modify their sniffing behavior by altering sniff height and frequency based on proximity to an odor source. Despite the potential to enhance odor detection for olfactory navigation by drawing inspiration from these natural mechanisms, many existing approaches focus on computationally intensive methods like multi-sensory integration or rely on multiple robots for odor localization, rather than leveraging embodied sensing. In this study, we propose an embodied adaptive sensing strategy that enhances odor detection by implementing an active odor sensor on a legged robot and applying a bio-inspired adaptive robot height control system for dynamically adapting the robot's height based on real-time gas concentration feedback. The control system employs a simple artificial hormone mechanism to regulate the robot height by processing gas concentration derivatives, mimicking biological adaptability. By utilizing the interaction between the active odor sensor, adaptive control system, and the legged body, this approach allows the robot to optimize its height online to capture the maximum gas concentration, thereby reducing the need for complex algorithms and high computational resources. As a result, it offers a more efficient solution for odor-driven tasks, with potential applications in real-world environments.

Abstract:
Precise identification of dynamic models in robotics is essential to support dynamic simulations, control design, friction compensation, output torque estimation, etc. A longstanding challenge remains in the development and identification of friction models for robotic joints, given the numerous physical phenomena affecting the underlying friction dynamics which result into nonlinear characteristics and hysteresis behaviour in particular. These phenomena proof difficult to be modelled and captured accurately using physical analogies alone. This has motivated researchers to shift from physics-based to data-driven models. Currently, these methods are still limited in their ability to generalize effectively to typical industrial robot deployement, characterized by high- and low-velocity operations and frequent direction reversals. Empirical observations motivate the use of dynamic friction models but these remain particulary challenging to establish. To address the current limitations, we propose to account for unidentified dynamics in the robot joints using latent dynamic states. The friction model may then utilize both the dynamic robot state and additional information encoded in the latent state to evaluate the friction torque. We cast this stochastic and partially unsupervised identification problem as a standard probabilistic representation learning problem. In this work both the friction model and latent state dynamics are parametrized as neural networks and are integrated in the conventional lumped parameter dynamic robot model. The complete dynamics model is directly learned from the noisy encoder measurements in the robot joints. We use the Expectation-Maximisation (EM) algorithm to find a Maximum Likelihood Estimate (MLE) of the model parameters. The effectiveness of the proposed method is validated in terms of open-loop prediction accuracy in comparison with baseline methods, using the Kuka KR6 R700 as a test platform.

Abstract:
Embodied AI agents responsible for executing interconnected, long-sequence household tasks often face difficulties with in-context memory, leading to inefficiencies and errors in task execution. To address this issue, we introduce KARMA, an innovative memory system that integrates longterm and short-term memory modules, enhancing large language models (LLMs) for planning in embodied agents through memory-augmented prompting. Karma distinguishes between long-term and short-term memory, with long-term memory capturing comprehensive 3D scene graphs as representations of the environment, while short-term memory dynamically records changes in objects' positions and states. This dualmemory structure allows agents to retrieve relevant past scene experiences, thereby improving the accuracy and efficiency of task planning. Short-term memory employs strategies for effective and adaptive memory replacement, ensuring the retention of critical information while discarding less pertinent data. Compared to state-of-the-art embodied agents enhanced with memory, our memory-augmented embodied AI agent improves success rates by 1.3 × and 2.3 × in Composite Tasks and Complex Tasks within the AI2-THOR simulator, respectively, and enhances task execution efficiency by 3.4 × and 62.7 ×. Furthermore, we demonstrate that KARMA's plug-and-play capability allows for seamless deployment on real-world robotic systems, such as mobile manipulation platforms. Through this plug-and-play memory system, KARMA significantly enhances the ability of embodied agents to generate coherent and contextually appropriate plans, making the execution of complex household tasks more efficient. Our code is available at https://github.com/WZX0Swarm0Robotics/KARMA/tree/master.

Abstract:
A team of multiple robots seamlessly and safely working in human-filled public environments requires adaptive task allocation and socially-aware navigation that account for dynamic human behavior. Current approaches struggle with highly dynamic pedestrian movement and the need for flexible task allocation. We propose Hyper-SAMARL, a hypergraph-based system for multi-robot task allocation and socially-aware navigation, leveraging multi-agent reinforcement learning (MARL). Hyper-SAMARL models the environmental dynamics between robots, humans, and points of interest (POIs) using a hypergraph, enabling adaptive task assignment and socially-compliant navigation through a hypergraph diffusion mechanism. Our framework, trained with MARL, effectively captures interactions between robots and humans, adapting tasks based on real-time changes in human activity. Experimental results demonstrate that Hyper-SAMARL outperforms baseline models in terms of social navigation, task completion efficiency, and adaptability in various simulated scenarios11The experimental videos and additional information about this work can be found at: https://sites.google.com/view/hyper-samarl..

Abstract:
Recent research has seen the advancement of drone depot models as a promising way to allocate drones for large-scale task completion. Applications of these drone depot models include data collection, environmental monitoring, package delivery, and more. This paper focuses on sharing agents between static depots for task allocation based on expected demand. We model the problem as a Binary Nonlinear Program, then derive an iterative neighborhood search based on solving a series of Binary Linear Programs to drive towards the optimal configuration of agents for each depot. We show that our method is more tractable than a Branch and Bound approach for this model as problem complexity grows. We also show through simulations that with near optimal allocation between local depots, the overall system performance will outperform greedy and non-sharing approaches.

Abstract:
Learning by demonstration techniques are gaining popularity within the human-robot collaboration (HRC) scenarios. This is because they allow to deeply exploit the versatility of collaborative robots. In this context, dynamic motion primitives (DMPs) have become a standard method for enabling human operators to easily teach tasks to robots. However, DMPs have two main limitations. First, they may encounter difficulties in generalizing some tasks, which can lead to non-intuitive behavior. Second, it is not guaranteed that the output of DMPs is compliant with ISO/TS 15066, which provides guidelines for assessing safety in collaborative scenarios. This work aims to address these two issues by introducing a novel control pipeline. This pipeline leverages a new variant of DMPs, called Swap DMPs (SDMPs), introduced in this work. The SDMPs enable a more intuitive behavior when the robot reproduces the learned task. Subsequently, SDMPs are encoded into a new optimization problem that ensures the robot complies with the Speed and Separation Monitoring (SSM) collaborative mode. The proposed approach has been experimentally validated and compared with traditional DMPs in both simulation and a real scenario, where a UR5e and a human operator collaborate on a polishing task.

Abstract:
This paper presents a robust control algorithm for precise orientation control of robot manipulators using a disturbance observer (DOB) specifically designed for orientation dynamics. Our approach addresses the challenges of 3D orientation control by incorporating various orientation representations, such as Euler angles, quaternions, and exponential coordinates, and analyzing their impact on DOB performance. Through theoretical analysis and experimental validation, we demonstrate the effectiveness of our method in achieving high-precision orientation control under uncertainties and disturbances. This work offers a comprehensive framework for robust orientation control, advancing the application of DOB in complex robotic tasks.

Abstract:
This paper presents a novel gripper capable of actively changing both shape and firmness. The gripper increases its grasp ability by changing its finger posture and firmness suitable for given target objects. In the proposed gripper, each finger is constructed by serially connecting multiple Juzu units. By controlling the angles between neighboring Juzu units individually using two actuators used for sending and bending, arbitrary finger shapes can be generated. In addition, by controlling the tension of the wire that penetrated all Juzu units in each finger, the friction between Juzu units is adjusted and the firmness of finger can be varied. A prototype gripper was designed and developed, and experiments to evaluate the capabilities of changing shape and firmness were conducted. Furthermore, through experiments of preshaping and grasping for various objects with different shape and size, the validity of the proposed method was demonstrated.

Abstract:
Animal-robot interaction (ARI) remains an unexplored challenge in robotics, as robots struggle to interpret the complex, multimodal communication cues of animals, such as body language, movement, and vocalizations. Unlike human-robot interaction, which benefits from established datasets and frameworks, animal-robot interaction lacks the foundational resources needed to facilitate meaningful bidirectional communication. To bridge this gap, we present the MBE-ARI (Multimodal Bidirectional Engagement in Animal-Robot Interaction), a novel multimodal dataset that captures detailed interactions between a legged robot and cows. The dataset includes synchronized RGB-D streams from multiple viewpoints, annotated with body pose and activity labels across interaction phases, offering an unprecedented level of detail for ARI research. Additionally, we introduce a full-body pose estimation model tailored for quadruped animals, capable of tracking 39 keypoints with a mean average precision (mAP) of 92.7%, outperforming existing benchmarks in animal pose estimation. The MBE-ARI dataset and our pose estimation framework lay a robust foundation for advancing research in animal-robot interaction, providing essential tools for developing perception, reasoning, and interaction frameworks needed for effective collaboration between robots and animals. The dataset and resources are publicly available at https://github.com/RISELabPurdue/MBE-ARI/, inviting further exploration and development in this critical area.

Abstract:
Traditional approaches to motion modeling for skid-steer robots struggle to capture nonlinear tire-terrain dynamics, especially during high-speed maneuvers. In this paper, we tackle such nonlinearities by enhancing a dynamic unicycle model with Gaussian Process (GP) regression outputs. This enables us to develop an adaptive, uncertainty-informed navigation formulation. We solve the resultant stochastic optimal control problem using a chance-constrained Model Predictive Path Integral (MPPI) control method. This approach formulates obstacle avoidance and path-following as chance constraints, accounting for residual uncertainties from the GP to ensure safety and reliability in control. Leveraging GPU acceleration, we efficiently manage the non-convex nature of the problem, ensuring real-time performance. Our approach unifies path-following and obstacle avoidance across different terrains, unlike prior works which typically focus on one or the other. We compare our GP-MPPI method against unicycle and data-driven kinematic models within the MPPI framework. In simulations, our approach shows superior tracking accuracy and obstacle avoidance. We further validate our approach through hardware experiments on a skid-steer robot platform, demonstrating its effectiveness in high-speed navigation. The GPU implementation of the proposed method and supplementary video footage are available at https://stochasticmppi.github.io.

Abstract:
Movement Primitives (MPs) are a well-established method for representing and generating modular robot trajectories. This work presents FA-ProDMP, a novel approach that introduces force awareness to Probabilistic Dynamic Movement Primitives (ProDMP). FA-ProDMP adapts trajectories during runtime to account for measured and desired forces, offering smooth trajectories and capturing position and force correlations across multiple demonstrations. FA-ProDMPs support multiple axes of force, making them agnostic to Cartesian or joint space control. This versatility makes FA-ProDMP a valuable tool for learning contact rich manipulation tasks, such as power plug insertion. To reliably evaluate FA-ProDMP, this work additionally introduces a modular, 3D printed task suite called POEMPEL, inspired by the popular Lego Technic pins. POEMPEL mimics industrial peg-in-hole assembly tasks with force requirements and offers multiple parameters of adjustment, such as position, orientation and plug stiffness level, thereby varying the direction and amount of required forces. Our experiments demonstrate that FA-ProDMP outperforms other MP formulations on the POEMPEL setup and a electrical power plug insertion task, thanks to its replanning capabilities based on measured forces. These findings highlight how FA-ProDMP enhances the performance of robotic systems in contact-rich manipulation tasks.

Abstract:
Recent advances in Vision Language Models (VLMs) have enhanced their application in robotics, encompassing both high-level task planning and low-level action control. Despite their strong performance across various robotic tasks, even for zero-shot scenarios, most VLM applications remain open-loop, adhering to a plan-and-execute paradigm without mechanisms to assess task completion. To address this limitation, we propose GenCo, a Generate-Correct framework designed to automate a peg-in-hole task using a UR5e robot. This framework integrates an VLM-based motion generator and motion expert, working collaboratively to refine and correct actions during robotic task execution. Both VLM agents are fine-tuned using the pre-trained LLaVA, enhancing adaptability and scaling efficiently to diverse tasks. Our experiments demonstrate the adaptiveness of the framework, improving the success rate for the peg-in-hole task by 12.75% compared to a single VLM open-loop method. Notably, in unseen scenarios, the success rate for a triangular peg was increased by 15%, and for a random-shaped peg by 17%, underscoring the system's effectiveness in handling novel tasks. Adaptive testing under varied camera positions demonstrated robust performance, affirming reliability despite shifts in the visual input. The framework is also designed to be lightweight and efficient, facilitating broader adoption and practical deployment. Access to our code and model is provided here: https://github.com/Zhengxuez/generate_correct

Abstract:
Embodied navigation in unknown environments presents the significant challenge of integrating tasks with multimodal goals into a unified framework. In this paper, we propose the Multimodal Semantic Navigation on Relative Metric Intention Graph (Ms. NAMI), a framework that integrates various navigation tasks with multimodal goals based on a relative topo-metric intention graph. A reinforcement learning based policy with a concise action space, consisting of frontier nodes and intention nodes, is designed to guide the agent to select reasonable sub-goals. A sparse reward design is introduced to reduce bias during training. Additionally, several engineering optimizations are implemented to enhance overall performance. The experimental results indicate that our method can achieve robust navigation performance in a variety of unknown environments.

Abstract:
Interactive sensors are an important component of robotic systems but often require manual replacement due to wear and tear. Automating this process can enhance system autonomy and facilitate long-term deployment. We developed an autonomous sensor exchange and calibration system for an agriculture crop monitoring robot that inserts a nitrate sensor into cornstalks. A novel gripper and replacement mechanism, featuring a reliable funneling design, were developed to enable efficient and reliable sensor exchanges. To maintain consistent nitrate sensor measurement, an on-board sensor calibration station was integrated to provide in-field sensor cleaning and calibration. The system was deployed at the Ames Curtis Farm in June 2024, where it successfully inserted nitrate sensors with high accuracy into \mathbf3 0 cornstalks with a \mathbf7 7 % success rate.

Abstract:
Agriculture, fundamental for human sustenance, faces unprecedented challenges. The need for efficient, human-cooperative, and sustainable farming methods has never been greater. The core contributions of this work involve leveraging Active Vision (AV) techniques and ZeroShot Learning (ZSL) to improve the robot's ability to perceive and interact with agricultural environment in the context of fruit harvesting. The AV Pipeline implemented within ROS 2 integrates the Next-Best View (NBV) Planning for 3D environment reconstruction through a dynamic 3D Occupancy Map. Our system allows the robotics arm to dynamically plan and move to the most informative viewpoints and explore the environment, updating the 3D reconstruction using semantic information produced through ZSL models. Simulation and real-world experimental results demonstrate our system's effectiveness in complex visibility conditions, outperforming traditional and static predefined planning methods. ZSL segmentation models employed, such as YOLO World + EfficientViT SAM, exhibit high-speed performance and accurate segmentation, allowing flexibility when dealing with semantic information in unknown agricultural contexts without requiring any fine-tuning process.

Abstract:
Vision-Language Models (VLM) can generate plausible high-level plans when prompted with a goal, the context, an image of the scene, and any planning constraints. However, there is no guarantee that the predicted actions are geometrically and kinematically feasible for a particular robot embodiment. As a result, many prerequisite steps such as opening drawers to access objects are often omitted in their plans. Robot task and motion planners can generate motion trajectories that respect the geometric feasibility of actions and insert physically necessary actions, but do not scale to everyday problems that require common-sense knowledge and involve large state spaces comprised of many variables. We propose VLM-TAMP, a hierarchical planning algorithm that leverages a VLM to generate both semantically-meaningful and horizon-reducing intermediate subgoals that guide a task and motion planner. When a subgoal or action cannot be refined, the VLM is queried again for replanning. We evaluate VLMTAMP on kitchen tasks where a robot must accomplish cooking goals that require performing 30-50 actions in sequence and interacting with up to 21 objects. VLM-TAMP substantially outperforms baselines that rigidly and independently execute VLM-generated action sequences, both in terms of success rates (50 to 100 % versus 0 %) and average task completion percentage (72 to 100 % versus 15 to 45 %). See project site https://zt-yang.github.io/vlm-tamp-robot/ for more information.

Abstract:
D Gaussian Splatting (3DGS) has gained significant attention for 3D scene reconstruction, but still suffers from complex outdoor environments, especially under adverse weather. This is because 3DGS treats the artifacts caused by adverse weather as part of the scene and will directly reconstruct them, largely reducing the clarity of the reconstructed scene. To address this challenge, we propose WeatherGS, a 3DGSbased framework for reconstructing clear scenes from multiview images under different weather conditions. Specifically, we explicitly categorize the multi-weather artifacts into the dense particles and lens occlusions that have very different characters, in which the former are caused by snowflakes and raindrops in the air, and the latter are raised by the precipitation on the camera lens. In light of this, we propose a dense-to-sparse preprocess strategy, which sequentially removes the dense particles by an Atmospheric Effect Filter (AEF) and then extracts the relatively sparse occlusion masks with a Lens Effect Detector (LED). Finally, we train a set of 3D Gaussians by the processed images and generated masks for excluding occluded areas, and accurately recover the underlying clear scene by Gaussian splatting. We conduct a diverse and challenging benchmark to facilitate the evaluation of 3D reconstruction under complex weather scenarios. Extensive experiments on this benchmark demonstrate that our WeatherGS consistently produces high-quality, clean scenes across various weather scenarios, outperforming existing state-of-the-art methods. See project: https://jumponthemoon.github.io/weather-gs.

Abstract:
Sim-to-Real refers to the process of transferring policies learned in simulation to the real world, which is crucial for achieving practical robotics applications. However, recent Sim2real methods either rely on a large amount of augmented data or large learning models, which is inefficient for specific tasks. In recent years, with the emergence of radiance field reconstruction methods, especially 3D Gaussian splatting, it has become possible to construct realistic real-world scenes. To this end, we propose RL-GSBridge, a novel real-to-sim-to-real framework which incorporates 3D Gaussian Splatting into the conventional RL simulation pipeline, enabling zero-shot sim-to-real transfer for vision-based deep reinforcement learning. We introduce a mesh-based 3D GS method with soft binding constraints, enhancing the rendering quality of mesh models. Then utilizing a GS editing approach to synchronize the rendering with the physics simulator, RL-GSBridge could reflect the visual interactions of the physical robot accurately. Through a series of sim-to-real experiments, including grasping and pick-and-place tasks, we demonstrate that RL-GSBridge maintains a satisfactory success rate in real-world task completion during sim-to-real transfer. Furthermore, a series of rendering metrics and visualization results indicate that our proposed mesh-based 3D GS reduces artifacts in unstructured objects, demonstrating more realistic rendering performance.

Abstract:
We tackle the problem of monocular 3D object detection across different sensors, environments, and camera setups. In this paper, we introduce a novel unsupervised domain adaptation approach, MonoCT, that generates highly accurate pseudo labels for self-supervision. Inspired by our observation that accurate depth estimation is critical to mitigating domain shifts, MonoCT introduces a novel Generalized Depth Enhancement (GDE) module with an ensemble concept to improve depth estimation accuracy. Moreover, we introduce a novel Pseudo Label Scoring (PLS) module by exploring inner-model consistency measurement and a Diversity Maximization (DM) strategy to further generate high-quality pseudo labels for self-training. Extensive experiments on six benchmarks show that MonoCT outperforms existing SOTA domain adaptation methods by large margins (~21% minimum for AP Mod.) and generalizes well to car, traffic camera and drone views.

Abstract:
This paper presents a Model Predictive Control (MPC) formulation for bipedal footstep planning based on the Linear Inverted Pendulum (LIP) model, ensuring recursive feasibility when navigating restricted regions. The proposed approach incorporates capturability and introduces a new constraint that forces the Divergent Component of Motion (DCM) into a finite-step capture region, adjusted between consecutive MPC calls. This constraint enables the MPC to anticipate beyond its prediction horizon, preventing collisions with the walking surface boundaries. We validate the approach through high-fidelity simulations with the bipedal robot Digit, demonstrating recursively feasible MPC footstep planning in restricted regions. Future efforts will extend the approach to general polytopic constraints, thereby facilitating footstep planning in cluttered environments while preserving the MPC's recursive feasibility.

Abstract:
Open-loop stable limit cycles are foundational to legged robotics, providing inherent self-stabilization that minimizes the need for computationally intensive feedback-based gait correction. While previous methods have primarily targeted specific robotic models, this paper introduces a general framework for rapidly generating limit cycles across various dynamical systems, with the flexibility to impose arbitrarily tight stability bounds. We formulate the problem as a single-stage constrained optimization problem and use Direct Collocation to transcribe it into a nonlinear program with closed-form expressions for constraints, objectives, and their gradients. Our method supports multiple stability formulations. In particular, we tested two popular formulations for limit cycle stability in robotics: (1) based on the spectral radius of a discrete return map, and (2) based on the spectral radius of the monodromy matrix, and tested five different constraintsatisfaction formulations of the eigenvalue problem to bound the spectral radius. We compare the performance and solution quality of the various formulations on a robotic swing-leg model, highlighting the Schur decomposition of the monodromy matrix as a method with broader applicability due to weaker assumptions and stronger numerical convergence properties. As a case study, we apply our method on a hopping robot model, generating open-loop stable gaits in under 2 seconds on an Intel®Core i7-6700K, while simultaneously minimizing energy consumption even under tight stability constraints.

Abstract:
Percutaneous coronary intervention (PCI) involves the insertion of a catheter or guidewire into a blood vessel of a patient, which poses a problem as a doctor is exposed to radiation during the procedure. The use of assistive robots has been proposed to address this issue. Furthermore, recent research is progressing toward complete autonomous navigation using deep reinforcement learning (DRL). Nevertheless, existing algorithms face limitations when operating in numerous unseen environments close to real PCI. This study proposes a robust DRL framework for image-based guidewire navigation to overcome the limitation. We introduce a subtasks strategy and domain randomization to improve robustness in various environments. The subtasks strategy consistently addresses complex global tasks by breaking them into subtasks designed using local maps, allowing them to be robustly solved by a single agent. Domain randomization is applied to handle real PCI issues, including variations in vessel geometry, guidewire deformation, and camera settings. By integrating the two novel methods, our DRL algorithm demonstrates superior performance compared to existing methods across various challenging simulation and phantom environments, validating its effectiveness in real-world scenarios. A video of our experiment is available at https://youtu.be/93Q88gESzOY.

Abstract:
Electromyography (EMG) signals are widely used as control inputs for myoelectric exoskeletons. However, muscle fatigue, which can result from prolonged use or heavy loads, significantly affects muscle activation patterns, leading to reduced estimation accuracy. To address this challenge, we propose an adversarial learning framework to enhance grip force estimation under fatigue conditions. The framework consists of three key components: a domain-invariant feature extractor to mitigate domain shifts between non-fatigue and fatigue states, a force estimator to predict grip forces from these domain-invariant features, and a domain discriminator to distinguish between the two domains. The proposed method was evaluated on a dataset collected from eight participants performing gripping tasks under both non-fatigue and fatigue conditions, during which high-density EMG signals and grip forces were recorded simultaneously. Experimental results demonstrated that our method significantly reduced the root mean square error (RMSE) from 0.264 to 0.127, outperforming a baseline model consisting of only the feature extractor and force estimator (p < 0.01). Additionally, the proposed approach exhibited consistent performance across all participants, highlighting its robustness and generalizability. These findings suggest that the proposed adversarial learning framework effectively enhances grip force estimation accuracy under muscle fatigue, offering a promising solution for improving the reliability and usability of myoelectric exoskeletons.

Abstract:
Tethers are an underutilized tool in multi-robot systems: tethers can provide power, facilitate retrieval and sensing, and be used to manipulate and gather objects. Starting with the simplest possible configuration, our work explores how agents linked in series by flexible, passive, fixed-length tethers, can use those tethers as sensors to achieve distributed formation control. In this study, we extend upon previous work to show the applicability of strain-coordinated formation control for encapsulation and migration along a global gradient as well as the trade-offs between formation control and taxis in an obstacle-laden environment. Our results indicate significant potential for tethered robot collectives: versatile behaviors that can work on simple, resource-constrained robots or serve as a fallback mechanism in case more sophisticated means of coordination fail.

Abstract:
The online adaptation of exoskeleton control based on muscle activity sensing offers a promising approach to personalizing exoskeleton behavior based on the user's biosignals. While electromyography (EMG)-based methods have demonstrated improvements in joint torque estimation, EMG sensors require direct skin contact and extensive post-processing. In contrast, force myography (FMG) measures normal forces resulting from changes in muscle volume due to muscle activity. We propose an FMG-based method to estimate knee and ankle joint torques by integrating joint angles and velocities with muscle activity data. We learn a model for joint torque estimation using Gaussian process regression (GPR). The effectiveness of the proposed FMG-based method is validated on isokinetic motions performed by ten participants. The model is compared to a baseline model that uses only joint angle and velocity as well as a model augmented by EMG data. The results indicate that incorporating FMG into exoskeleton control can improve the estimation of joint torque for the ankle and knee joints in novel task characteristics within a single participant. Although the findings suggest that this approach may not improve the generalizability of estimates between multiple participants, they highlight the need for further research into its potential applications in exoskeleton control.

Abstract:
Traffic sign detection plays an essential role in advanced driver assistance system (ADAS) or self-driving vehicles. Typically, deep neural networks are employed to analyze road scene images captured by an onboard camera. However, due to the significant variation in appearance of different traffic signs, the classification of high similarity patterns is still a challenging task. To address these issues, this paper presents an end-to-end traffic sign detection framework based on DETR. The proposed network incorporates data augmentation and negative sample learning to mitigate the problem of data imbalance and enhance the model recognition capability effectively. An UASPP module (Upsample Atrous Pyramid Pooling) is introduced to integrate multi-scale features and global information. In the experiments, the performance evaluation has demonstrated the improvement of mAP by 3.9% on TT100K and 36.3% on GTSDB compared to state-of-the-art methods. The code and datasets are available at https://github.com/chinglun/TS-DETR.

Abstract:
In most contact-rich manipulation tasks, humans apply time-varying forces to the target object, compensating for inaccuracies in the vision-guided hand trajectory. However, current robot learning algorithms primarily focus on trajectory-based policy, with limited attention given to learning force-related skills. To address this limitation, we introduce ForceMimic, a force-centric robot learning system, providing a natural, force-aware and robot-free robotic demonstration collection system, along with a hybrid force-motion imitation learning algorithm for robust contact-rich manipulation. Using the proposed ForceCapture system, an operator can peel a zucchini in 5 minutes, while force-feedback teleoperation takes over 13 minutes and struggles with task completion. With the collected data, we propose HybridIL to train a force-centric imitation learning model, equipped with hybrid force-position control primitive to fit the predicted wrench-position parameters during robot execution. Experiments demonstrate that our approach enables the model to learn a more robust policy under the contact-rich task of vegetable peeling, increasing the success rates by 54.5% relatively compared to state-of-the-art pure-vision-based imitation learning. Hardware, code, data and more results can be found on the project website at https://forcemimic.github.io.

Abstract:
This paper presents a novel approach to addressing the control challenges of underactuated systems, focusing on the swing-up and stabilisation tasks on the double pendulum system. We propose the Average-Reward Entropy Advantage Policy Optimisation (AR-EAPO), a model-free reinforcement learning (RL) algorithm that integrates the strengths of the average-reward RL and the maximum entropy RL (MaxEnt RL). The average reward criterion allows the use of a simple reward function by naturally promoting the longterm goals, at the same time MaxEnt RL encourages the robustness of the policy. We validate our approach through simulations, consistently outperforming standard RL baselines and traditional control methods. Also, we provide preliminary test results on real double pendulum hardware. Additional experiments on MuJoCo environments further demonstrate AR-EAPO's efficacy on general continuous control tasks. This work underscores the potential of the average-reward criterion in simplifying control design while achieving superior results.

Abstract:
We present Points2Plans, a framework for composable planning with a relational dynamics model that enables robots to solve long-horizon manipulation tasks from partial-view point clouds. Given a language instruction and a point cloud of the scene, our framework initiates a hierarchical planning procedure, whereby a language model generates a high-level plan and a sampling-based planner produces constraint-satisfying continuous parameters for manipulation primitives sequenced according to the high-level plan. Key to our approach is the use of a relational dynamics model as a unifying interface between the continuous and symbolic representations of states and actions, thus facilitating language-driven planning from high-dimensional perceptual input such as point clouds. Whereas previous relational dynamics models require training on datasets of multi-step manipulation scenarios that align with the intended test scenarios, Points2Plans uses only single-step simulated training data while generalizing zero-shot to a variable number of steps during real-world evaluations. We evaluate our approach on tasks involving geometric reasoning, multi-object interactions, and occluded object reasoning in both simulated and real-world settings. Results demonstrate that Points2Plans offers strong generalization to unseen long-horizon tasks in the real world, where it solves over 85% of evaluated tasks while the next best baseline solves only 50%.

Abstract:
Robotic assembly tasks remain an open challenge due to their long horizon nature and complex part relations. Behavior trees (BTs) are increasingly used in robot task planning for their modularity and flexibility, but creating them manually can be effort-intensive. Large language models (LLMs) have recently been applied to robotic task planning for generating action sequences, yet their ability to generate BTs has not been fully investigated. To this end, we propose LLM-as-BT-Planner, a novel framework that leverages LLMs for BT generation in robotic assembly task planning. Four in-context learning methods are introduced to utilize the natural language processing and inference capabilities of LLMs for producing task plans in BT format, reducing manual effort while ensuring robustness and comprehensibility. Additionally, we evaluate the performance of fine-tuned smaller LLMs on the same tasks. Experiments in both simulated and real-world settings demonstrate that our framework enhances LLMs' ability to generate BTs, improving success rate through in-context learning and supervised fine-tuning.

Abstract:
The autoregressive world model exhibits robust generalization capabilities in vectorized scene understanding but encounters difficulties in deriving actions due to insufficient uncertainty modeling and self-delusion. In this paper, we explore the feasibility of deriving decisions from an autoregres-sive world model by addressing these challenges through the formulation of multiple probabilistic hypotheses. We propose LatentDriver, a framework models the environment's next states and the ego vehicle's possible actions as a mixture distribution, from which a deterministic control signal is then derived. By incorporating mixture modeling, the stochastic nature of decision-making is captured. Additionally, the self-delusion problem is mitigated by providing intermediate actions sampled from a distribution to the world model. Experimen-tal results on the recently released closed-loop benchmark Waymax demonstrate that LatentDriver surpasses state-of-the-art reinforcement learning and imitation learning methods, achieving expert-level performance. The code and models will be made available at https://github.com/Sephirex-X/LatentDriver.

Abstract:
Imitation learning based planning tasks on the nuPlan dataset have gained great interest due to their potential to generate human-like driving behaviors. However, open-loop training on the nuPlan dataset tends to cause causal confusion during closed-loop testing, and the dataset also presents a longtail distribution of scenarios. These issues introduce challenges for imitation learning. To tackle these problems, we introduce CAFE-AD, a Cross-Scenario Adaptive Feature Enhancement for Trajectory Planning in Autonomous Driving method, designed to enhance feature representation across various scenario types. We develop an adaptive feature pruning module that ranks feature importance to capture the most relevant information while reducing the interference of noisy information during training. Moreover, we propose a cross-scenario feature interpolation module that enhances scenario information to introduce diversity, enabling the network to alleviate overfitting in dominant scenarios. We evaluate our method CAFEAD, on the challenging public nuPlan Test14-Hard closed-loop simulation benchmark. The results demonstrate that CAFEAD outperforms state-of-the-art methods including rule-based and hybrid planners, and exhibits the potential in mitigating the impact of long-tail distribution within the dataset. Additionally, we further validate its effectiveness in real-world environments. The code and models will be made available at https://github.com/AlniyatRui/CAFE-AD.

Abstract:
Risk assessment of a robot in controlled environments, such as laboratories and proving grounds, is a common means to assess, certify, validate, verify, and characterize the robots' safety performance before, during, and even after their commercialization in the real-world. A standard testing program that acquires the risk estimate is expected to be (i) repeatable, such that it obtains similar risk assessments of the same testing subject among multiple trials or attempts with the similar testing effort by different stakeholders, and (ii) reliable against a variety of testing subjects produced by different vendors and manufacturers. Both repeatability and reliability are fundamental and crucial for a testing algorithm's validity, fairness, and practical feasibility, especially for standardization. However, these properties are rarely satisfied or ensured, especially as the subject robots become more complex, uncertain, and varied. This issue was present in traditional risk assessments through Monte-Carlo sampling, and remains a bottleneck for the recent accelerated risk assessment methods, primarily those using importance sampling. This study aims to enhance existing accelerated testing frameworks by proposing a new algorithm that provably integrates repeatability and reliability with the already established formality and efficiency. It also features demonstrations assessing the risk of instability from frontal impacts, initiated by push-over disturbances on a controlled inverted pendulum and a 7-DoF planar bipedal robot Rabbit managed by various control algorithms.

Abstract:
Solving real-world complex tasks using reinforcement learning (RL) without high-fidelity simulation environments or large amounts of offline data can be quite challenging. Online RL agents trained in imperfect simulation environments can suffer from severe sim-to-real issues. Offline RL approaches although bypass the need for simulators, often pose demanding requirements on the size and quality of the offline datasets. The recently emerged hybrid offline-and-online RL provides an attractive framework that enables joint use of limited offline data and imperfect simulator for transferable policy learning. In this paper, we develop a new algorithm, called \mathrmH 2 \mathrmO+, which offers great flexibility to bridge various choices of offline and online learning methods, while also accounting for dynamics gaps between the real and simulation environments. Through extensive simulation and real-world robotics experiments, we demonstrate superior performance and flexibility of \mathbfH 2 O+ over advanced cross-domain online and offline RL algorithms.

Abstract:
Forestry machines operated in forest production environments face challenges when performing manipulation tasks, especially regarding the complicated dynamics of underactuated crane systems and the heavy weight of logs to be grasped. This study investigates the feasibility of using reinforcement learning for forestry crane manipulators in grasping and lifting heavy wood logs autonomously. We first build a simulator using Mujoco physics engine to create realistic scenarios, including modeling a forestry crane with 8 degrees of freedom from CAD data and wood logs of different sizes. We further implement a velocity controller for autonomous log grasping with deep reinforcement learning using a curriculum strategy. Utilizing our new simulator, the proposed control strategy exhibits a success rate of 96% when grasping logs of different diameters and under random initial configurations of the forestry crane. In addition, reward functions and reinforcement learning baselines are implemented to provide an open-source benchmark for the community in large-scale manipulation tasks. A video with several demonstrations can be seen at https://www.acin.tuwien.ac.at/en/d18a/.

Abstract:
Forestry machines are not easily accessible for experimentation or demonstration of research results. These mobile robots are massive, very expensive, and require a large outdoor space and permits to operate. These factors hinder conducting experiments on real forestry robots. Thus, it is essential to design experimental setups utilizing easily accessible robots in indoor labs that can effectively replicate the behavior of interest of a forestry machine. We design a setup to resemble a log-loader crane and grapple motions using a Kinova Jaco2 arm by manufacturing a specialized end-effector to attach passively to the Jaco2 arm. Passively attached grapple causes undesirable sway, which is problematic and dangerous in forestry. To address the sway problem, we employ dynamic programming to develop an anti-sway motion planner, and validate its performance for different point-to-point maneuvers in our experimental setup. We also repeat each experiment at least 6 times to ensure the repeatability and reliability of the experiments. The experimental results showcase the excellent sway-damping performance of our planner and also the very good repeatability of our experiments.

Abstract:
Accurate modelling of object deformations is crucial for a wide range of robotic manipulation tasks, where interacting with soft or deformable objects is essential. Current methods struggle to generalise to unseen forces or adapt to new objects, limiting their utility in real-world applications. We propose Shape-Space Deformer, a unified representation for encoding a diverse range of object deformations using template augmentation to achieve robust, fine-grained reconstructions that are resilient to outliers and unwanted artefacts. Our method improves generalization to unseen forces and can rapidly adapt to novel objects, significantly outperforming existing approaches. We perform extensive experiments to test a range of force generalisation settings and evaluate our method's ability to reconstruct unseen deformations. Our results demonstrate significant improvements in reconstruction accuracy and robustness. Our approach is suitable for real-time performance, making it ready for downstream manipulation applications.

Abstract:
In robotics applications, few-shot segmentation is crucial because it allows robots to perform complex tasks with minimal training data, facilitating their adaptation to diverse, real-world environments. However, pixel-level annotations of even small amount of images is highly time-consuming and costly. In this paper, we present a novel few-shot binary segmentation method based on bounding-box annotations instead of pixel-level labels. We introduce, ProMi, an efficient prototype-mixture-based method that treats the background class as a mixture of distributions. Our approach is simple, training-free, and effective, accommodating coarse annotations with ease. Compared to existing baselines, ProMi achieves the best results across different datasets with significant gains, demonstrating its effectiveness. Furthermore, we present qualitative experiments tailored to real-world mobile robot tasks, demonstrating the applicability of our approach in such scenarios. Our code: https://github.com/ThalesGroup/promi.

Abstract:
Informative path planning (IPP) applied to bathy-metric mapping allows AUVs to focus on feature-rich areas to quickly reduce uncertainty and increase mapping efficiency. Existing methods based on Bayesian optimization (BO) over Gaussian Process (GP) maps work well on small scenarios but they are short-sighted and computationally heavy when mapping larger areas, hindering deployment in real applications. To overcome this, we present a 2-layered BO IPP method that performs non-myopic, online planning in a tree search fashion over large Stochastic Variational GP maps, while respecting the AUV dynamical constraints and accounting for localization uncertainty. Our framework outperforms the standard industrial lawn-mowing pattern and a myopic baseline in a set of hardware in the loop (HIL) experiments in an embedded platform over real bathymetry areas.

Abstract:
The deployment of robotic systems in real world environments requires the ability to quickly produce paths through cluttered, non-convex spaces. These planned trajectories must be both kinematically feasible (i.e., collision free) and dynamically feasible (i.e., satisfy the underlying system dynamics), necessitating a consideration of both the free space and the dynamics of the robot in the path planning phase. In this work, we explore the application of reachable Bézier polytopes as an efficient tool for generating trajectories satisfying both kinematic and dynamic requirements. Furthermore, we demonstrate that by offloading specific computation tasks to the GPU, such an algorithm can meet tight real time requirements. We propose a layered control architecture that efficiently produces collision free and dynamically feasible paths for nonlinear control systems, and demonstrate the framework on the tasks of 3D hopping in a cluttered environment.

Abstract:
Incorporating language comprehension into robotic operations unlocks significant advancements in robotics, but also presents distinct challenges, particularly in executing spatially oriented tasks like pattern formation. This paper introduces ZeroCAP, a novel system that integrates large language models with multi-robot systems for zero-shot context aware pattern formation. Grounded in the principles of language-conditioned robotics, ZeroCAP leverages the interpretative power of language models to translate natural language instructions into actionable robotic configurations. This approach combines the synergy of vision-language models, cutting-edge segmentation techniques and shape descriptors, enabling the realization of complex, context-driven pattern formations in the realm of multi robot coordination. Through extensive experiments, we demonstrate the systems proficiency in executing complex context aware pattern formations across a spectrum of tasks, from surrounding and caging objects to infilling regions. This not only validates the system's capability to interpret and implement intricate context-driven tasks but also underscores its adaptability and effectiveness across varied environments and scenarios. The experimental videos and additional information about this work can be found at https://sites.google.com/view/zerocap/home.

Abstract:
Camera-to-robot calibration is crucial for visionbased robot control and requires effort to make it accurate. Recent advancements in markerless pose estimation methods have eliminated the need for time-consuming physical setups for camera-to-robot calibration. While the existing markerless pose estimation methods have demonstrated impressive accuracy without the need for cumbersome setups, they rely on the assumption that all the robot joints are visible within the camera's field of view. However, in practice, robots usually move in and out of view, and some portion of the robot may stay out-of-frame during the whole manipulation task due to real-world constraints, leading to a lack of sufficient visual features and subsequent failure of these approaches. To address this challenge and enhance the applicability to visionbased robot control, we propose a novel framework capable of estimating the robot pose with partially visible robot manipulators. Our approach leverages the Vision-Language Models for fine-grained robot components detection, and integrates it into a keypoint-based pose estimation network, which enables more robust performance in varied operational conditions. The framework is evaluated on both public robot datasets and self-collected partial-view datasets to demonstrate our robustness and generalizability. As a result, this method is effective for robot pose estimation in a wider range of realworld manipulation scenarios.

Abstract:
Accurate transformation estimation between camera space and robot space is essential. Traditional methods using markers for hand-eye calibration require offline image collection, limiting their suitability for online self-calibration. Recent learning-based robot pose estimation methods, while advancing online calibration, struggle with cross-robot generalization and require the robot to be fully visible. This work proposes a Foundation feature-driven online End-Effector Pose Estimation (FEEPE) algorithm, characterized by its training-free and cross end-effector generalization capabilities. Inspired by the zero-shot generalization capabilities of foundation models, FEEPE leverages pre-trained visual features to estimate 2D-3D correspondences derived from the CAD model and target image, enabling 6D pose estimation via the PnP algorithm. To resolve ambiguities from partial observations and symmetry, a multi-historical key frame enhanced pose optimization algorithm is introduced, utilizing temporal information for improved accuracy. Compared to traditional hand-eye calibration, FEEPE enables marker-free online calibration. Unlike robot pose estimation, it generalizes across robots and end-effectors in a training-free manner. Extensive experiments demonstrate its superior flexibility, generalization, and performance. Additional demon-strations are available at https://feepose.github.io/

Abstract:
Over the past decades, the development of tactile sensors has gained increasing attention and has gradually become a fundamental device for robots. Especially in today's context where human-robot interaction demands are growing and the requirements for tactile perception are becoming stricter, how to enable robots to better perceive their environment has become a topic worth discussing. Tactile sensors, after years of development, have emerged in two main types: taxel-based and vision-based sensors, where the latter can provide relatively low resolution (LR) tactile patterns compared with the former. Both of them have seen significant enhancements in their tactile perception capabilities on flat and regular surfaces. However, as application scenarios expand, current flat tactile perception can no longer meet the robots' needs for multi-dimensional and complex perception capabilities. Therefore, we investigate the high-resolution (HR) reconstruction of non-planar tactile patterns captured by LR taxel-based sensors in this paper. We first develop a new dataset, where the ground truth of non-planar tactile patterns are obtained with a vision-based GelSight Mini tactile sensor, and the LR data are collected via a commercial taxel-based Xela sensor. In addition, we propose to adapt the state-of-the-art CNN- and GAN-based tactile super-resolution model of flat/planar surfaces to the non-planar scenario, and also develop a diffusion-based model for the nonplanar HR reconstruction. Experimental results confirm the efficiency of the proposed models.

Abstract:
We present a novel autonomous robot navigation algorithm for outdoor environments that is capable of handling diverse terrain traversability conditions. Our approach, VLM-GroNav, uses vision-language models (VLMs) and integrates them with physical grounding that is used to assess intrinsic terrain properties such as deformability and slipperiness. We use proprioceptive-based sensing, which provides direct measurements of these physical properties, and enhances the overall semantic understanding of the terrains. Our formulation uses in-context learning to ground the VLM's semantic understanding with proprioceptive data to allow dynamic updates of traversability estimates based on the robot's real-time physical interactions with the environment. We use the updated traversability estimations to inform both the local and global planners for real-time trajectory replanning. We validate our method on a legged robot (Ghost Vision 60) and a wheeled robot (Clearpath Husky), in diverse real-world outdoor environments with different deformable and slippery terrains. In practice, we observe significant improvements over state-of-the-art methods by up to 50% increase in navigation success rate.

Abstract:
Adept traffic models are critical to both real-time prediction/planning and closed-loop simulation for autonomous vehicles (AV). Key design objectives include accuracy, diverse multimodal behaviors, interpretability, and compatibility with other modules in the autonomy stack, e.g., the downstream planner. We present Categorical Traffic Transformer (CTT), a traffic model that outputs both continuous trajectory predictions and categorical predictions with clear semantic meanings (lane modes, homotopies, etc.). The most outstanding feature of CTT is its fully interpretable latent space, which enables direct supervision of the latent variables from the ground truth during training and avoids mode collapse completely. As a result, CTT can generate diverse behaviors conditioned on different semantic modes while significantly beating SOTA on prediction accuracy. In addition, CTT's ability to input and output tokens enables direct integration with semantic-heavy modules such as behavior planners and language models, bridging the tokenized representation and the continuous trajectory space.

Abstract:
A significant portion of roads, particularly in densely populated developing countries, lacks explicitly defined right-of-way rules. These understructured roads pose substantial challenges for autonomous vehicle motion planning, where efficient and safe navigation relies on understanding decentralized human coordination for collision avoidance. This coordination, often termed “social driving etiquette,” remains underexplored due to limited open-source empirical data and suitable modeling frameworks. In this paper, we present a novel dataset and modeling framework designed to study motion planning in these understructured environments. The dataset includes 20 aerial videos of representative scenarios, an image dataset for training vehicle detection models, and a development kit for vehicle trajectory estimation. We demonstrate that a consensus-based modeling approach can effectively explain the emergence of priority orders observed in our dataset, and is therefore a viable framework for decentralized collision avoidance planning.

Abstract:
With the rapid development of rehabilitation robotics, there is a pressing need for efficient and accurate gait prediction methods. However, due to the complexity and variability of individual gait characteristics and external disturbances, accurately predicting gait in real time remains a significant challenge. This paper proposes an innovative Bayesian-inference-based method for real-time gait prediction while a subject walks with a lower-limb exoskeleton. Periodic gait information is represented using von Mises basis functions, and the weight parameters serve as real-time updated state variables. The error-subspace transform Kalman filter (ESTKF) is applied for gait trajectory prediction. A fully connected neural network (FCNN) is used to estimate the walking speeds in real time based on predicted trajectories. Comparative experiments based on an open-source database prove the advantages of ESKTF compared with other Bayesian filters. Walking experiments are conducted to estimate phase and speed in real time, and to predict the joint angle, total joint torque, and lower-limb muscle surface electromyography (sEMG) values. Experimental results validate the method's prediction performance across different speeds and demonstrate its resilience to external interference.

Abstract:
This paper presents DENSER, a framework leveraging 3D Gaussian splatting (3DGS) for the reconstruction of dynamic urban environments. While several methods for photorealistic scene representations, both implicitly using neural radiance fields (NeRF) and explicitly using 3DGS have shown promising results in scene reconstruction of relatively complex dynamic scenes, modeling the dynamic appearance of foreground objects tends to be challenging, limiting the applicability of these methods to capture subtleties and details of the scenes, especially for dynamic objects. To this end, we propose DENSER, a framework that significantly enhances the representation of dynamic objects and accurately models the appearance of dynamic objects in the driving scene. Instead of directly using Spherical Harmonics (SH) to model the appearance of dynamic objects, we introduce and integrate a new method aiming at dynamically estimating SH bases using wavelets, resulting in better representation of dynamic objects appearance in both space and time. Besides object appearance, DENSER enhances object shape representation through densification of its point cloud across multiple scene frames, resulting in faster convergence of model training. Extensive evaluations on the KITTI dataset show that the proposed approach outperforms state-of-the-art methods by a wide margin. Source codes and models will be uploaded to this repository https://github.com/sntubix/denser

Abstract:
Perceiving pedestrians in highly crowded urban environments is a difficult long-tail problem for learning-based autonomous perception. Speeding up 3D ground truth generation for such challenging scenes is performance-critical yet very challenging. The difficulties include the sparsity of the captured pedestrian point cloud and a lack of suitable benchmarks for a specific system design study. To tackle the challenges, we first collect a new multi-view LiDAR-camera 3D multiple-object-tracking benchmark of highly crowded pedestrians for in-depth analysis. We then build an offboard auto-labeling system that reconstructs pedestrian trajectories from LiDAR point cloud and multi-view images. To improve the generalization power for crowded scenes and the performance for small objects, we propose to learn high-resolution representations that are density-aware and relationship-aware. Extensive experiments validate that our approach significantly improves the 3D pedestrian tracking performance towards higher auto-labeling efficiency. The code will be publicly available at this HTTP URL11https://github.com/Nicholasli1995/PCP-MV.

Abstract:
Urban winds are a serious hazard for low-altitude autonomous aerial operations in urban airspaces. Previous methods for motion planning in urban winds require global knowledge of the obstacles and flow field and do not lend themselves to real-time application. In this paper, a planning and control framework is proposed for safe and energy-efficient navigation through urban flow fields that strictly relies on onboard sensing. The algorithm incorporates predictions of local wind flow fields into a receding horizon optimal controller, balancing energy consumption with obstacle avoidance on the fly to reach a goal destination. Simulation studies on a procedurally generated urban map with diverse wind conditions demonstrate that the energy-aware motion planner reduces energy consumption by as much as 30% and results in 32% fewer crashes on average compared to the wind-agnostic baseline. Comparisons to a global wind-aware planner indicate only minor trade-offs associated with planning on a local horizon.

Abstract:
This paper presents the use of a simulation environment as an accurate, ethical, and sustainable alternative to testing robotic prototypes in animal models and simplified phantom models, specifically developed for robotic colonoscopy devices inside the human colon. A virtual simulation of the locomotion mechanism of a prototype robotic colonoscope and the colon was created in Ansys, and robot/colon experiments were conducted on different colon surfaces to validate simulation results. The successfully simulated propulsion force generated by the prototype produced an RMSE of 7% when compared at the optimal operating condition of the device, and 25-30% when compared to a full range of device velocities. The larger RMSE is due to physical phenomena that were not present in the simulation due to the constraints applied. The simulation, however, allowed evaluation of difficult quantities to measure in a real world setting such as the normal interaction force between the device and tissue wall, and stress distribution across the locomotion mechanism, as well as a phenomenon of oscillating propulsion force resulting from the device design. This work demonstrates feasibility of using finite element simulation to shape the design and optimization of a robotic colonoscope and understands its interaction with complex human anatomy.

Abstract:
A common robotics sensing problem is to place sensors to robustly monitor a set of assets, where robustness is assured by requiring asset p to be monitored by at least \kappa(p) sen-sors. Given n assets that must be observed by m sensors, each with a disk-shaped sensing region, where should the sensors be placed to minimize the total area observed? We provide and analyze a fast heuristic for this problem. We then use the heuristic to initialize an exact Integer Program-ming solution. Subsequently, we enforce separation constraints between the sensors by modifying the integer program formulation and by changing the disk candidate set.

Abstract:
This study introduces a novel approach to swarm leader identification (SLI) in multi-agent robot systems by employing a physical adversary interacting with the swarm in the same environment. We develop a new simulation environment to study the SLI problem and train an adversary, which we term the prober, to solve the SLI problem using forceful interactions with the swarm as its guiding information source. The prober's policy is modeled using the simplified structure state space sequence (S5) model and trained with the Proximal Policy Optimization (PPO) algorithm. The prober only has access to the information on the relative positions of the other agents. We evaluate our approach through extensive simulations using two performance metrics and validate the sim-to-real transfer through robot experiments. Results on evaluating the performance in 10,000 different testing scenarios demonstrate that our method finds the leader's identity in the vast majority (95.7%) of the cases, regardless of the initial leader selection during training. The proposed system represents the first instance of learning-based automatic identification of leader agents in a swarm. This capability is crucial for enabling efficient and robust human-swarm interaction, understanding artificial swarm behaviors, and analyzing latent behaviors in biological swarms in nature.

Affiliations: College of Intelligence Science and Technology, National University of Defense Technology, China; State Key Laboratory of Intelligent Autonomous Systems, and Frontiers Science, Center for Intelligent Autonomous Systems, College of Electronics & Information Engineering, Shanghai Institute of Intelligent Science and Technology, Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Shanghai, P. R. China

Abstract:
While millimeter-wave radars are widely used in robotics and autonomous driving, extrinsic calibration with other sensors remains challenging due to the sparsity and uncertainty of radar point clouds. In this paper, we propose a novel deep feature-matching-based online extrinsic calibration approach for a 4D millimeter-wave radar and 3D LiDAR system. We formulate the calibration problem as a crossmodal point cloud registration task, initiating with keypointlevel matching followed by dense matching refinement. Efficient yet powerful neural networks are employed to extract prior keypoint matches, which are then expanded to surrounding regions, establishing dense point correspondences. Our approach effectively leverages the majority of the information from millimeter-wave radar, mitigating the impact of radar point cloud sparsity. We evaluate our approach on two datasets, and experimental results demonstrate that it outperforms state-of-the-art baseline methods and achieves an average improvement of 66.96% in calibration success rate, while reducing translational error and rotational error by 23.84% and 30.31%, respectively. Our implementation will be made open-source at https://github.com/nubot-nudt/RLCNet.

Abstract:
We propose a novel motion-based tracker specifically designed for tracking multiple people in low frame rate scenarios. While previous studies have predominantly focused on scenarios with high frame rates (exceeding 10 frames per second), tracking in low frame rate conditions is significant for robotic platforms with limited computational resources. Our tracker optimizes the cost function, cascade structure and Kalman filter correction to better adapt to the characteristics of low frame rate environments. First, we enhance the cost function by incorporating stable variables through the introduction of height-based and displacement-based cost terms. Second, we prioritize handling occlusion among individuals during association, which reduces ambiguity in subsequent tracking processes. Third, we utilize the error-compensated detection to correct the Kalman filter, thereby improving tracking accuracy. Experimental results demonstrate that our proposed tracker, LoFSORT, outperforms other motion model-based trackers across various frame rate scenarios. Ablation studies further confirm that each component of our tracker enhances tracking performance in low frame rate scenarios.

Abstract:
3D multi-object tracking (3D MOT) is a key area in the field of autonomous driving. In systems that track by detection, the detection results of deep learning models will inevitably have FP(False Positives) and FN(False Nagatives), and detector always cannot continuously and accurately detect targets when facing obstacle occlusion and sensor blind spots. The task of 3D-MOT is to combine the discrete and disordered target detection results in time sequence into continuous and reliable tracks for use by downstream planning modules. At present, multi-target tracking algorithms in the field of autonomous driving are all based on single-hypothesis. In crowded scenarios, both false negatives (FN) and false positives (FP) significantly increase, making it difficult for single-hypothesis-based tracking algorithms to accurately output tracks. Towards this end, we propose LMH-MOT, a light multiple hypothesis framework for 3D MOT. Specifically, LMH-MOT effectively handles complex data association problems in autonomous driving scenarios by generating and maintaining multiple sets of hypotheses. Recognizing the possibility of switching between different motion states of the object, we use multiple motion models to more accurately estimate the motion state of the same object at the same time, and select the best estimation result for output. Additionally, we introduce a data association method based on decision trees, making full use of various features of the track and greatly reducing false matches and missing matches. In order to ensure the real-time performance of the entire algorithm framework, we also use gibbs sampling to significantly reduce the calculation time. On the NuScenes dataset, our proposed method achieves state-of-the-art performance with 76.2% AMOTA.

Abstract:
Safety is a critical concern in learning-enabled autonomous systems especially when deploying these systems in real-world scenarios. An important challenge is accurately quantifying the uncertainty of unknown models to generate provably safe control policies that facilitate the gathering of informative data, thereby achieving both safe and optimal policies. Additionally, the selection of the data-driven model can significantly impact both the real-time implementation and the uncertainty quantification process. In this paper, we propose a provably sample efficient episodic safe learning framework that remains robust across various model choices with quantified uncertainty for online control tasks. Specifically, we first employ Quadrature Fourier Features (QFF) for kernel function approximation of Gaussian Processes (GPs) to enable efficient approximation of unknown dynamics. Then the Adaptive Conformal Prediction (ACP) is used to quantify the uncertainty from online observations and combined with the Control Barrier Functions (CBF) to characterize the uncertainty-aware safe control constraints under learned dynamics. Finally, an optimism-based exploration strategy is integrated with ACP-based CBFs for safe exploration and near-optimal safe nonlinear control. Theoretical proofs and simulation results are provided to demonstrate the effectiveness and efficiency of the proposed framework.

Abstract:
With the rapid development of wearable technology, devices like smartphones, smartwatches, and headphones equipped with IMUs have become essential for applications such as pedestrian positioning. However, traditional pedestrian dead reckoning (PDR) methods struggle with diverse motion patterns, while recent data-driven approaches, though improving accuracy, often lack robustness due to reliance on a single device. In our work, we attempt to enhance the positioning performance using the low-cost commodity IMUs embedded in the wearable devices. We propose a multi-device deep learning framework named Suite-IN, aggregating motion data from Apple Suite for inertial navigation. Motion data captured by sensors on different body parts contains both local and global motion information, making it essential to reduce the negative effects of localized movements and extract global motion representations from multiple devices. Our model innovatively introduces a contrastive learning module to disentangle motionshared and motion-private latent representations, enhancing positioning accuracy. We validate our method on a self-collected dataset consisting of Apple Suite: iPhone, Apple Watch and Airpods, which supports a variety of movement patterns and flexible device configurations. Experimental results demonstrate that our approach outperforms state-of-the-art models while maintaining robustness across diverse sensor configurations.

Abstract:
Radar ensures robust sensing capabilities in adverse weather conditions, yet challenges remain due to its high inherent noise level. Existing radar odometry has overcome these challenges with strategies such as filtering spurious points, exploiting Doppler velocity, or integrating with inertial measurements. This paper presents two novel improvements beyond the existing radar-inertial odometry: ground-optimized noise filtering and continuous velocity preintegration. Despite the widespread use of ground planes in LiDAR odometry, imprecise ground point distributions of radar measurements cause naive plane fitting to fail. Unlike plane fitting in LiDAR, we introduce a zone-based uncertainty-aware ground modeling specifically designed for radar. Secondly, we note that radar velocity measurements can be better combined with IMU for a more accurate preintegration in radar-inertial odometry. Existing methods often ignore temporal discrepancies between radar and IMU by simplifying the complexities of asynchronous data streams with discretized propagation models. Tackling this issue, we leverage GP and formulate a continuous preintegration method for tightly integrating 3-DOF linear velocity with IMU, facilitating full 6-DOF motion directly from the raw measurements. Our approach demonstrates remarkable performance (less than 1 % vertical drift) in public datasets with meticulous conditions, illustrating substantial improvement in elevation accuracy. The code will be released as open source for the community: https://github.com/wooseongY/Go-RIO.

Abstract:
Environmental factors like weather and road conditions significantly impact object recognition in autonomous vehicles. While cameras provide rich semantic information, their reliance on electromagnetic waves makes them vulnerable to performance degradation in adverse conditions such as low light and rain. In contrast, ultrasonic sensors offer reliable short-range detection, unaffected by such conditions. We introduce Bat-VUFN, a bio-inspired multi-sensory system that merges camera and ultrasonic data using an Input Quality Score (IQS)-based fusion technique to enhance near-field perception in challenging environments. Bat-VUFN dynamically adjusts sensor contributions based on prevailing conditions, achieving impressive results on the K-Bat dataset (average precision: 0.95, MAE: 0.52m, RMSE: 0.55m), demonstrating its robustness in adverse scenarios.

Abstract:
Motion planning with simple objectives, such as collision-avoidance and goal-reaching, can be solved efficiently using modern planners. However, the complexity of the allowed tasks for these planners is limited. On the other hand, signal temporal logic (STL) can specify complex requirements, but STL-based motion planning and control algorithms often face scalability issues, especially in large multi-robot systems with complex dynamics. In this paper, we propose an algorithm that leverages the best of the two worlds. We first use a single-robot motion planner to efficiently generate a set of alternative reference paths for each robot. Then coordination requirements are specified using STL, which is defined over the assignment of paths and robots' progress along those paths. We use a Mixed Integer Linear Program (MILP) to compute task assignments and robot progress targets over time such that the STL specification is satisfied. Finally, a local controller is used to track the target progress. Simulations demonstrate that our method can handle tasks with complex constraints and scales to large multi-robot teams and intricate task allocation scenarios.

Abstract:
This work studies the planning problem for robotic systems under both quantifiable and unquantifiable uncertainty. The objective is to enable the robotic systems to optimally fulfill high-level tasks specified by Linear Temporal Logic (LTL) formulas. To capture both types of uncertainty in a unified modelling framework, we utilise Markov Decision Processes with Set-valued Transitions (MDPSTs). We introduce a novel solution technique for optimal robust strategy synthesis of MDPSTs with LTL specifications. To improve efficiency, our work leverages limit-deterministic Büchi automata (LDBAs) as the automaton representation for LTL to take advantage of their efficient constructions. To tackle the inherent nondeterminism in MDPSTs, which presents a significant challenge for reducing the LTL planning problem to a reachability problem, we introduce the concept of a Winning Region (WR) for MDPSTs. Additionally, we propose an algorithm for computing the WR over the product of the MDPST and the LDBA. Finally, a robust value iteration algorithm is invoked to solve the reachability problem. We validate the effectiveness of our approach through a case study involving a mobile robot operating in the hexagonal world, demonstrating promising efficiency gains.

Abstract:
Telemedicine is promising in digital healthcare management, such as supporting the coronavirus disease 2019 (COVID-19) pandemic. Three-dimensional (3D) ultrasound reconstruction and new view image synthesis, which can assist in diagnosis and reexamine, have significant potential in teleultrasound, especially integrating robotic ultrasound systems (RUSS). Neural Radiance Field (NeRF), an impressive reconstruction method, requires long training times, limiting its practicality in ultrasound. Despite NeRF variants achieving faster optimization, their performance remains confined to natural scene reconstructions. To address this limitation, we propose HFUS-NeRF, a hybrid representation method designed for fast and accurate ultrasound reconstruction. HFUS-NeRF integrates multi-resolution hash-grid and tri-plane representations to represent each sampling point of the ultrasonic wave. A unified model for sampling points from different ultrasonic probes is presented to simulate the wave's propagation through tissues, and the final ultrasound image is rendered using volume rendering. Compared with NeRF-based ultrasound reconstruction, both the hash grid and triplane resolutions can be scaled up more efficiently, improving reconstruction speed. Experimental results demonstrate that HFUS-NeRF enhances reconstruction quality while significantly reducing reconstruction time to mere minutes. Furthermore, we validated HFUS-NeRF's adaptability by reconstruction using images from different types of ultrasound probes, and real-world experiments confirmed its feasibility and transferability, enabling fast ultrasound reconstruction on human subjects.

Abstract:
This paper presents a learned model to predict the robot-centric velocity of an underwater robot through dynamics-aware proprioception. The method exploits a recurrent neural network using as inputs inertial cues, motor commands, and battery voltage readings alongside the hidden state of the previous time-step to output robust velocity estimates and their associated uncertainty. An ensemble of networks is utilized to enhance the velocity and uncertainty predictions. Fusing the network's outputs into an Extended Kalman Filter, alongside inertial predictions and barometer updates, the method enables long-term underwater odometry without further exteroception. Furthermore, when integrated into visual-inertial odometry, the method assists in enhanced estimation resilience when dealing with an order of magnitude fewer total features tracked (as few as 1) as compared to conventional visual-inertial systems. Tested onboard an underwater robot deployed both in a laboratory pool and the Trondheim Fjord, the method takes less than 5 ms for inference either on the CPU or the GPU of an NVIDIA Orin AGX and demonstrates less than 4% relative position error in novel trajectories during complete visual blackout, and approximately 2% relative error when a maximum of 2 visual features from a monocular camera are available.

Abstract:
Human-in-the-loop reinforcement learning integrates human expertise to accelerate agent learning and provide critical guidance and feedback in complex fields. However, many existing approaches focus on single-agent tasks and require continuous human involvement during the training process, significantly increasing the human workload and limiting scalability. In this paper, we propose HARP (HumanAssisted Regrouping with Permutation Invariant Critic), a multi-agent reinforcement learning framework designed for group-oriented tasks. HARP integrates automatic agent regrouping with strategic human assistance during deployment, enabling and allowing non-experts to offer effective guidance with minimal intervention. During training, agents dynamically adjust their groupings to optimize collaborative task completion. When deployed, they actively seek human assistance and utilize the Permutation Invariant Group Critic to evaluate and refine human-proposed groupings, allowing non-expert users to contribute valuable suggestions. In multiple collaboration scenarios, our approach is able to leverage limited guidance from non-experts and enhance performance. The project can be found at https://github.com/huawen-hu/HARP.

Abstract:
After-action reviews (AARs) are professional discussions that help operators and teams enhance their task performance by analyzing completed missions with peers and professionals. Previous studies comparing different formats of AARs have focused mainly on human teams. However, the inclusion of robotic teammates brings along new challenges in understanding teammate intent and communication. Traditional AAR between human teammates may not be satisfactory for human-robot teams. To address this limitation, we propose a new training review (TR) tool, called the Virtual Spectator Interface (VSI), to enhance human-robot team performance and situational awareness (SA) in a simulated search mission. The proposed VSI primarily utilizes visual feedback to review subjects' behavior. To examine the effectiveness of VSI, we took elements from AAR to conduct our own TR, and designed a 1 × 3 between-subjects experiment with experimental conditions: TR with (1) VSI, (2) screen recording, and (3) non-technology (only verbal descriptions). The results of our experiments demonstrated that the VSI did not result in significantly better team performance than other conditions. However, the TR with VSI led to more improvement in the subjects' SA over the other conditions.

Abstract:
Explainability is vital for establishing user trust, also in robotics. Recently, foundation models (e.g. vision-language models, VLMs) fostered a wave of embodied agents that answer arbitrary queries about their environment and their interactions with it. However, naively prompting VLMs to answer queries based on camera images does not take into account existing robot architectures which represent the robot's tasks, skills, and beliefs about the state of the world. To overcome this limitation, we propose RACCOON, a framework that combines foundation models' responses with a robot's internal knowledge. Inspired by Retrieval-Augmented Generation (RAG), RACCOON selects relevant context, retrieves information from the robot's state, and utilizes it to refine prompts for an LLM to answer questions accurately. This bridges the gap between the model's adaptability and the robot's domain expertise.

Abstract:
When operating in human environments, robots need to handle complex tasks while both adhering to social norms and accommodating individual preferences. For instance, based on common sense knowledge, a household robot can pre-dict that it should avoid vacuuming during a social gathering, but it may still be uncertain whether it should vacuum before or after having guests. In such cases, integrating common-sense knowledge with human preferences, often conveyed through human explanations, is fundamental yet a challenge for existing systems. In this paper, we introduce GRACE, a novel approach addressing this while generating socially appropriate robot actions. GRACE leverages common sense knowledge from LLMs, and it integrates this knowledge with human explanations through a generative network. The bidirectional structure of GRACE enables robots to refine and enhance LLM predictions by utilizing human explanations and makes robots capable of generating such explanations for human-specified actions. Our evaluations show that integrating human explanations boosts GRACE's performance, where it outperforms several baselines and provides sensible explanations.

Affiliations: Innovative Center for Flexible Devices (iFLEX), Max Planck-NTU Joint Laboratory for Artificial Senses, School of Materials Science, Nanyang Technological University, Singapore; College of Computing and Data Science, Nanyang Technological University, Singapore; Robot Navigation and Perception Lab, AASS Research Centre, Örebro University, Sweden; Chair: Perception for Intelligent Systems, MIRMI, Technical University of Munich, Germany; Agency for Science, Technology and Research, Institute of Materials Research and Engineering, Singapore

Abstract:
Gas source localization in complex environments is critical for applications such as environmental monitoring, industrial safety, and disaster response. Traditional methods often struggle with the challenges posed by a lack of environmental topography integration, especially when interactions between wind and obstacles distort gas dispersion patterns. In this paper, we propose a deep learning-based approach, which leverages spatial context and environmental mapping to enhance gas source localization. By integrating Simultaneous Localization and Mapping (SLAM) with a U-Net-based model, our method predicts the likelihood of gas source locations by analyzing gas sensor data, wind flow, and topography of the environment represented by a 2D occupancy map. We demonstrate the efficacy of our approach using a wheeled robot equipped with a photoionization detector, a LIDAR, and an anemometer, in various scenarios with dynamic wind fields and multiple obstacles. The results show that our approach can robustly locate gas sources, even in challenging environments with fluctuating wind directions, outperforming conventional methods by utilizing topography contextual information. This study underscores the importance of topographical context in gas source localization and offers a flexible and robust solution for real-world applications. Data and code are publicly available.

Abstract:
Robot picking and packing tasks require dexterous manipulation skills, such as rearranging objects to establish a good grasping pose, or placing and pushing items to achieve tight packing. These tasks are challenging for robots due to the complexity and variability of the required actions. To tackle the difficulty of learning and executing long-horizon tasks, we propose a novel framework called the Multi-Head Skill Transformer (MuST). This model is designed to learn and sequentially chain together multiple motion primitives (skills), enabling robots to perform complex sequences of actions effectively. MuST introduces a “progress value” for each skill, guiding the robot on which skill to execute next and ensuring smooth transitions between skills. Additionally, our model is capable of expanding its skill set and managing various sequences of sub-tasks efficiently. Extensive experiments in both simulated and real-world environments demonstrate that MuST significantly enhances the robot's ability to perform long-horizon dexterous manipulation tasks.

Affiliations: The Canadian Surgical Technologies and Advanced Robotics (CSTAR), London Health Sciences Centre (LHSC), School of Electrical and Computer Engineering, Western University, London, Ontario, Canada; The Canadian Surgical Technologies and Advanced Robotics (CSTAR), London Health Sciences Centre (LHSC), The School of Biomedical Engineering, Western University, London, Ontario, Canada; Department of Urology, Massachusetts General Hospital, and Harvard Medical School, Boston, MA, USA; Department of Electrical and Computer Engineering, Department of Surgery, Department of Clinical Neurological Sciences, CSTAR, School of Biomedical Engineering, Western University; The Department of Radiology at Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA

Abstract:
In this paper, we validate the effectiveness of the optimal planning algorithms we have developed for devising surgical plans for Percutaneous Nephrolithotomy (PCNL) using patient-specific Concentric-Tube Robots (CTRs). To do so, we built a life-sized phantom model of the right hemithorax, replicating the anatomy of a patient who suffered from kidney stone and underwent conventional PCNL. Two-dimensional CT scans of the phantom model and its 3D reconstruction enabled the creation of a surgical plan using our planning algorithms based on a puncture into the mid-pole of the kidney. This was compared with two other percutaneous tracts involving punctures into the lower and upper calyces for comparison. The optimal mid-pole plan achieved 84% stone coverage, significantly outperforming the lower pole (58 %) and upper pole (45 %) plans. These results validate the effectiveness of the algorithms and align with simulation-based findings from previous studies, which reported an average volume coverage of 81.6±19.6 % in clinical cases.

Abstract:
Laser-based surgical ablation relies heavily on surgeon involvement, restricting precision to the limits of human error and perception. The interaction between laser and tissue is governed by various laser parameters that control the laser irradiance on the tissue, including the power, distance, spot size, orientation, and exposure time. This complex interaction lends itself to robotic automation, allowing the surgeon to focus on high-level tasks, such as choosing the region and method of ablation, while the lower-level ablation plan can be handled autonomously. This paper describes a sampling-based model predictive control (MPC) scheme to plan ablation sequences for arbitrary tissue volumes. Using a steady-state point ablation model to simulate a single laser-tissue interaction, a random search technique explores the reachable state space while preserving sensitive tissue regions. The sampled MPC strategy provides an ablation sequence that accounts for parameter uncertainty without violating constraints, such as avoiding nerve bundles.

Abstract:
This paper introduces a novel framework for improving human-to-robot manipulability transfer and tracking in Learning by Demonstration. Our approach addresses key challenges, including manipulability ellipsoid (ME) domain adaptation between different kinematic structures, ME-IK feasibility checks and optimization across trajectories accounting for the robot's redundancy, and introducing a manipulability-aware control strategy. Leveraging a unified quadratic programming control with vector-field inequalities, our method enables robust tracking and optimization of manipulability, accommodating multiple demonstrations and the inherent variability in task execution. Experimental results demonstrate superior performance in precise tracking and force generation compared to traditional methods, highlighting the advantages of incorporating human implicit information for more effective robot control.

Abstract:
Preference-based reinforcement learning (PbRL) is a suitable approach for style adaptation of pre-trained robotic behavior: adapting the robot's policy to follow human user preferences while still being able to perform the original task. However, collecting preferences for the adaptation process in robotics is often challenging and time-consuming. In this work we explore the adaptation of pre-trained robots in the low-preference-data regime. We show that, in this regime, recent adaptation approaches suffer from catastrophic reward forgetting (CRF), where the updated reward model overfits to the new preferences, leading the agent to become unable to perform the original task. To mitigate CRF, we propose to enhance the original reward model with a small number of parameters (low-rank matrices) responsible for modeling the preference adaptation. Our evaluation shows that our method can efficiently and effectively adjust robotic behavior to human preferences across simulation benchmark tasks and multiple real-world robotic tasks. We provide videos of our results and source code at https://sites.google.com/view/preflora/.

Abstract:
Robotic picking and placing played an essential role in Industrial 4.0 and have long been recognized as significant contributions to industrial processes. Various scenarios involve picking and placing parts for assembly in industrial production, such as assembling different electronic components in the manufacturing process. Those tasks require high precision to complete. However, achieving high precision in the assembly of CPUs poses a significant challenge, particularly when dealing with reflective surfaces. This paper presents a strategic system design tailored to address these challenges effectively. We focus on system device choice and optimizing the key parameters of the sensor system to strike a balance between device cost and the required precision. We use methods to construct the whole robot manipulation system, such as geometric segmentation, binocular vision with structure light projection and, based on 3D information, 6D pose estimation to construct the system. The results of our study demonstrate the practical applicability and benefits of this strategic system design in industrial settings. By meeting strict system accuracy requirements, our approach contributes to advancing industry practices and growing its impact on society.

Abstract:
Terrestrial and aerial bimodal vehicles have gained significant interest due to their energy efficiency and versatile maneuverability across different domains. However, most existing passive-wheeled bimodal vehicles rely on attitude regulation to generate forward thrust, which inevitably results in energy waste on producing lifting force. In this work, we propose a novel passive-wheeled bimodal vehicle called TrofyBot that can rapidly change the thrust direction with a single servo motor and a transformable parallelogram linkage mechanism (TPLM). Cooperating with a bidirectional force generation module (BFGM) for motors to produce bidirectional thrust, the robot achieves flexible mobility as a differential driven rover on the ground. This design achieves 95.37% energy saving efficiency in terrestrial locomotion, allowing the robot continuously move on the ground for more than two hours in current setup. Furthermore, the design obviates the need for attitude regulation and therefore provides a stable sensor field of view (FoV). We model the bimodal dynamics for the system, analyze its differential flatness property, and design a controller based on hybrid model predictive control for trajectory tracking. A prototype is built and extensive experiments are conducted to verify the design and the proposed controller, which achieves high energy efficiency and seamless transition between modes.

Abstract:
This paper presents a novel actuator system combining a twisted string actuator (TSA) with a winch mechanism. Relative to traditional hydraulic and pneumatic systems in robotics, TSAs are compact and lightweight but face limitations in stroke length and force-transmission ratios. Our integrated TSA-winch system overcomes these constraints by providing variable transmission ratios through dynamic adjustment. It increases actuator stroke by winching instead of overtwisting, and it improves force output by twisting. The design features a rotating turret that houses a winch, which is mounted on a bevel gear assembly driven by a through-hole drive shaft. Mathematical models are developed for the combined displacement and velocity control of this system. Experimental validation demonstrates the actuator's ability to achieve a wide range of transmission ratios and precise movement control. We present performance data on movement precision and generated forces, discussing the results in the context of existing literature. This research contributes to the development of more versatile and efficient actuation systems for advanced robotic applications and improved automation solutions.

Abstract:
This study examines a novel setup of a micropositioning trajectory manipulator in X \theta, energized by a reluctance actuator (RA) and two accompanying moving magnet actuators (MMA). The design is characterized by a C -core RA, which features asymmetrical air gaps between the mover and the stator elements when under angular \theta rotation. When the stator coil is energized, a magnetic flux induces a force in the mover. Two MMAs can add force and torque dynamics to the system via solenoid and permanent magnet (PM) pairs to offer additional corrective actions. Facilitating control of a translational (x) and rotational (\theta) two-degree-of-freedom (2DOF) actuation system. Flexure hinges aid in the retraction force of the mover element and provide needed stiffness to the system without frictional effects. This was modeled analytically and optimized to achieve outlined performance objectives. The system was validated experimentally through triangle, and sinusoidal trajectories in open loop control. The most relevant application is scanning mirror systems where specific targeted rotational and translational trajectories can benefit light beam positioning. This system allows both translation and rotation specifications of a selected trajectory to be realized in one actuation unit, opening up more design possibilities for controlling precision positioning systems.

Abstract:
In reinforcement learning for legged robot locomotion, crafting effective reward strategies is crucial. Predefined gait patterns and complex reward systems are widely used to stabilize policy training. Drawing from the natural locomotion behaviors of humans and animals, which adapt their gaits to minimize energy consumption, we investigate the impact of incorporating an energy-efficient reward term that prioritizes distance-averaged energy consumption into the reinforcement learning framework. Our findings demonstrate that this simple addition enables quadruped robots to autonomously select appropriate gaits-such as four-beat walking at lower speeds and trotting at higher speeds-without the need for explicit gait regularizations. Furthermore, we provide a guideline for tuning the weight of this energy-efficient reward, facilitating its application in real-world scenarios. The effectiveness of our approach is validated through simulations and on a real Unitree Gol robot. This research highlights the potential of energy-centric reward functions to simplify and enhance the learning of adaptive and efficient locomotion in quadruped robots. Videos and more details are at https://sites.google.com/berkeley.edu/efficient-locomotion

Abstract:
Variable stiffness of a remote robot is crucial for a teleoperation system to deal with challenging tasks. External stiffness command interfaces have emerged as a promising solution to regulating the remote robot stiffness because of the benefits of their accuracy, ergonomics, and avoidance of the “coupling effect” that usually exists in muscle activity-based stiffness interfaces. However, the use of an external stiffness command interface requires good coordination between two limbs of an operator, which take care of the teleoperation task and the stiffness regulation task, respectively, at the same time, which is demanding for novice operators in dynamic situations necessitating agile and timely stiffness adjustments. In this paper, a new concept of Stiffness Regulation Co-pilot was proposed to facilitate the use of these interfaces. A co-pilot is a virtual agent that consists of a Stiffness Regulation Policy, which infers a reasonable stiffness regulation action from the task performance, and a feedback modality, which conveys the suggested stiffness regulation action to the operator. A preliminary user study was conducted to evaluate the efficacy of the co-pilot and the effect of different modalities of the co-pilot. The results showed that the cutaneous feedback or combined with another modality can potentially improve the task performance of the system and reduce the cognitive load of the operator compared to a teleoperation system without using the co-pilot.

Abstract:
While UWB-based methods can achieve high localization accuracy in small-scale areas, their accuracy and reliability are significantly challenged in large-scale environments. In this paper, we propose a learning-based framework named ULOC for Ultra-Wideband (UWB) based localization in such complex, large-scale environments. First, anchors are deployed in the environment without knowledge of their actual position. Then, UWB observations are collected when the vehicle travels in the environment. At the same time, map-consistent pose estimates are developed from registering onboard self-localization data (from VIO, LIO, and other SLAM methods) with the prior map to provide the training labels. We then propose a network based on MAMBA that learns the ranging patterns of UWBs over a complex, large-scale environment. The experiment demonstrates that our solution can ensure high localization accuracy on a large scale compared to the state-of-the-art. We release our source code to benefit the community at https://github.com/brytsknguyen/uloc.

Abstract:
Human brain possesses the ability to effectively focus on important environmental components, which enhances perception, learning, reasoning, and decision-making. Inspired by this cognitive mechanism, we introduced a novel concept termed relevance for Human-Robot Collaboration (HRC). Relevance is a dimensionality reduction process that incorporates a continuously operating perception module, evaluates cue sufficiency within the scene, and applies a flexible formulation and computation framework. In this paper, we present an enhanced two-loop framework that integrates real-time and asynchronous processing to quantify relevance and leverage it for safer and more efficient human-robot collaboration (HRC). The two-loop framework integrates an asynchronous loop, which leverages an LLM's world knowledge to quantify relevance, and a real-time loop, which performs scene understanding, human intent prediction, and decision-making based on relevance. HRC decision-making is enhanced by a relevancebased task allocation method, as well as a motion generation and collision avoidance approach that incorporates human trajectory prediction. Simulations and experiments show that our methodology for relevance quantification can accurately and robustly predict the human objective and relevance, with an average accuracy of up to 0.90 for objective prediction and up to 0.96 for relevance prediction. Moreover, our motion generation methodology reduces collision cases by 63.76% and collision frames by 44.74% when compared with a state-of-theart (SOTA) collision avoidance method. Our framework and methodologies, with relevance, guide the robot on how to best assist humans and generate safer and more efficient actions for HRC.

Abstract:
Noncontact manipulation of soft micro-fibers has great potential in advanced manufacturing, materials science, and biomedical engineering. However, current noncontact manipulation techniques primarily focus on objects with regular shapes, e.g., solid particles, cells, or droplets, with fewer solutions available for manipulating flexible and elongated structures. In this paper, an automated ultrasonic manipulation system is introduced for in-plane soft micro-fiber manipulation, which mainly consists of an ultrasonic transducer array and a microscope. A real-time trap generation algorithm is designed to manipulate the micro-fibers by the visual feedback from microscope. An adequate theoretical analysis is also provided for explanation of the deformation behavior of micro-fiber under external forces. The system is capable of precise in-plane positioning and motion trajectory planning to micro-fiber end, and in-plane morphological reshaping to the micro-fiber. Experiments validated the effectiveness of the proposed system for the in-plane manipulation of soft micro-fibers. Finally, the system was showcased by the practical application of material property characterization.

Abstract:
This paper presents range-based 6-DoF Monte Carlo SLAM with a gradient-guided particle update strategy. While non-parametric state estimation methods, such as particle filters, are robust in situations with high ambiguity, they are known to be unsuitable for high-dimensional problems due to the curse of dimensionality. To address this issue, we propose a particle update strategy that improves the sampling efficiency by using the gradient information of the likelihood function to guide particles toward its mode. Additionally, we introduce a keyframe-based map representation that represents the global map as a set of past frames (i.e., keyframes) to mitigate memory consumption. The keyframe poses for each particle are corrected using a simple loop closure method to maintain trajectory consistency. The combination of gradient information and keyframe-based map representation significantly enhances sampling efficiency and reduces memory usage compared to traditional RBPF approaches. To process a large number of particles (e.g., 100,000 particles) in real-time, the proposed framework is designed to fully exploit GPU parallel processing. Experimental results demonstrate that the proposed method exhibits extreme robustness to state ambiguity and can even deal with kidnapping situations, such as when the sensor moves to different floors via an elevator, with minimal heuristics.

Abstract:
Sim2Real transfer, particularly for manipulation policies relying on RGB images, remains a critical challenge in robotics due to the significant domain shift between syn-thetic and real-world visual data. In this paper, we propose SplatSim, a novel framework that leverages Gaussian Splatting as the primary rendering primitive to reduce the Sim2Real gap for RGB-based manipulation policies. By replacing traditional mesh representations with Gaussian Splats in simulators, SplatSim produces highly photorealistic synthetic data while maintaining the scalability and cost-efficiency of simulation. We demonstrate the effectiveness of our framework by training manipulation policies within SplatSim and deploying them in the real world in a zero-shot manner, achieving an average success rate of 86.25%, compared to 97.5% for policies trained on real-world data. Videos can be found on our project page: https://splatsim.github.io

Abstract:
Vision-and-Language Navigation (VLN) tasks require an agent to follow textual instructions to navigate through 3D environments. Traditional approaches use supervised learning methods, relying heavily on domain-specific datasets to train VLN models. Recent methods try to utilize closedsource large language models (LLMs) like GPT-4 to solve VLN tasks in zero-shot manners, but face challenges related to expensive token costs and potential data breaches in realworld applications. In this work, we introduce Open-Nav, a novel study that explores open-source LLMs for zero-shot VLN in the continuous environment. Open-Nav employs a spatial-temporal chain-of-thought (CoT) reasoning approach to break down tasks into instruction comprehension, progress estimation, and decision-making. It enhances scene perceptions with fine-grained object and spatial knowledge to improve LLM's reasoning in navigation. Our extensive experiments in both simulated and real-world environments demonstrate that Open-Nav achieves competitive performance compared to using closed-source LLMs.

Abstract:
Robotic dexterous grasping is important for interacting with the environment. To unleash the potential of data-driven models for dexterous grasping, a large-scale, highquality dataset is essential. While gradient-based optimization offers a promising way for constructing such datasets, previous works suffer from limitations, such as inefficiency, strong assumptions in the grasp quality energy, or limited object sets for experiments. Moreover, the lack of a standard benchmark for comparing different methods and datasets hinders progress in this field. To address these challenges, we develop a highly efficient synthesis system and a comprehensive benchmark with MuJoCo for dexterous grasping. We formulate grasp synthesis as a bilevel optimization problem, combining a novel lowerlevel quadratic programming (QP) with an upper-level gradient descent process. By leveraging recent advances in CUDAaccelerated robotic libraries and GPU-based QP solvers, our system can parallelize thousands of grasps and synthesize over 49 grasps per second on a single 3090 GPU. Our synthesized grasps for Shadow, Allegro, and Leap hands all achieve a success rate above 75 % in simulation, with a penetration depth under 1 mm, outperforming existing baselines on nearly all metrics. Compared to the previous large-scale dataset, DexGraspNet, our dataset significantly improves the performance of learning models, with a success rate from around 40 % to 80 % in simulation. Real-world testing of the trained model on the Shadow Hand achieves an 81 % success rate across 20 diverse objects. The codes and datasets are released on our project page: https://pku-epic.github.io/BODex.

Abstract:
Recent advancements in visual object tracking have markedly improved the capabilities of unmanned aerial vehicle (UAV) tracking, which is a critical component in real-world robotics applications. While the integration of hierarchical lightweight networks has become a prevalent strategy for enhancing efficiency in UAV tracking, it often results in a significant drop in network capacity, which further exacerbates challenges in UAV scenarios, such as frequent occlusions and extreme changes in viewing angles. To address these issues, we introduce a novel family of UAV trackers, termed CGTrack, which combines explicit and implicit techniques to expand network capacity within a coarse-to-fine framework. Specifically, we first introduce a Hierarchical Feature Cascade (HFC) module that leverages the spirit of feature reuse to increase network capacity by integrating the deep semantic cues with the rich spatial information, incurring minimal computational costs while enhancing feature representation. Based on this, we design a novel Lightweight Gated Center Head (LGCH) that utilizes gating mechanisms to decouple target-oriented coordinates from previously expanded features, which contain dense local discriminative information. Extensive experiments on three challenging UAV tracking benchmarks demonstrate that CGTrack achieves state-of-the-art performance while running fast. Code will be available at https://github.com/Nightwatch-Fox11/CGTrack.

Abstract:
Understanding and anticipating events and actions is critical for intraoperative assistance and decision-making during minimally invasive surgery. We propose a predictive neural network that is capable of understanding and predicting critical interaction aspects of surgical workflow based on endoscopic, intracorporeal video data, while flexibly leveraging surgical knowledge graphs. The approach incorporates a hypergraph-transformer (HGT) structure that encodes expert knowledge into the network design and predicts the hidden embedding of the graph. We verify our approach on established surgical datasets and applications, including the prediction of action-triplets, and the achievement of the Critical View of Safety (CVS), which is a critical safety measure. Moreover, we address specific, safety-related forecasts of surgical processes, such as predicting the clipping of the cystic duct or artery without prior achievement of the CVS. Our results demonstrate improvement in prediction of interactive event when incorporating with our approach compared to unstructured alternatives.

Abstract:
Surgical procedures are inherently complex and dynamic, with intricate dependencies and various execution paths. Accurate identification of the intentions behind critical actions, referred to as Primary Intentions (PIs), is crucial to understanding and planning the procedure. This paper presents a novel framework that advances PI recognition in instructional videos by combining top-down grammatical structure with bottom-up visual cues. The grammatical structure is based on a rich corpus of surgical procedures, offering a hierarchical perspective on surgical activities. A grammar parser, utilizing the surgical activity grammar, processes visual data obtained from laparoscopic images through surgical action detectors, ensuring a more precise interpretation of the visual information. Experimental results on the benchmark dataset demonstrate that our method outperforms existing surgical activity detectors that rely solely on visual features. Our research provides a promising foundation for developing advanced robotic surgical systems with enhanced planning and automation capabilities.

Abstract:
Automation in the apparel and textile industry has long been a pursuit. However, accurately locating the surface of a fabric remains a challenge, limiting the automation in sorting, packaging, and other processes. When humans locate clothing, they rely on contact feedback for the exact position of the clothing surface. As existing contact detection solutions are significantly affected by environmental factors, it is essential to develop a sensor with robust contact detection capabilities. In this work, we introduce a contact sensor with high robustness and high force resolution. This contact sensor detects contact by measuring the deformation of an elastomer using a distancemeasuring module. Based on the deformation characteristics of the elastomer, we designed a detection algorithm that not only reduces the noise of data but also extracts features such as trends and elastomer states, enabling reliable contact detection. Through experiments, we validated that this contact sensor can detect contact forces as low as 0.017 N and is robust to external interference or sensor movement. We also verified that the sensor can process data within 7.5 ms and return contact detection with 95% accuracy. Additionally, we assessed its effectiveness in real fabric contact scenarios.

Abstract:
State estimation in the context of dynamical systems is crucial for various applications, including control and monitoring. Moving Horizon Estimation (MHE) is an optimization-based state estimation algorithm that leverages a known dynamical model integrated over a moving horizon. The MHE optimization criterion corresponds to identify the initial state that best aligns the integrated trajectory with the system observation. In MHE setting, the state estimation performance increases with the considered length of the moving horizon but it can become computationally intensive which is a limiting factor for its applicability to fast-varying dynamical systems or on hardware with restricted computational power. Deep Learning (DL) methods can learn solutions to complex optimization problems without incurring any additional online computational cost beyond the inference of the considered architecture. In the context of state estimation we propose to study different type of DL architecture in order to provide full state estimation from partial and noisy system observations. The novel proposed method is based on an end-to-end differentiable formulation of the MHE optimization problem, enabling the offline training of a DL model to provide a state estimation that minimizes the MHE optimization criterion. Once training is completed, state estimations are generated through an explicit relationship learned by the DL model. The proposed method is compared to the online MHE formulation in various case studies, including scenarios with partially observed state and model discrepancies in the context of lateral vehicle dynamics. The results highlight improved state estimation performance both in terms of reduced computational time and accuracy with respect to the online MHE algorithm.

Abstract:
Automatic Valet Parking (AVP) has garnered significant attention from industry and academia due to its potential to enhance traffic efficiency, parking safety, and user experience. While AVP technologies have been successfully applied in standard parking scenarios with clear markings, real-world parking environments are far more diverse and complex, posing challenges for current systems. To address these limitations, we present Parking-SG, an open-vocabulary hierarchical 3D scene graph representation, facilitating the application of AVP in open and complex environments. Our approach builds an object-based, open-vocabulary map that integrates both ground-level and ground-above objects for comprehensive environmental understanding. Leveraging common sense reasoning and object behavior relationships, various standard or non-standard parking spaces are inferred in open environments. Additionally, we extract and analyze path topology to construct a hierarchical map representation, supporting complex AVP tasks. Parking-SG is validated in both simulated and real-world environments, demonstrating its ability to generate rich environmental representations, accurately and flexibly infer parking spaces, and effectively perform complex AVP tasks.

Abstract:
Due to the critical issues of privacy and partial occlusion, license plate information is not always available in vehicle recognition systems. Consequently, researchers have increasingly turned towards vehicle re-identification (reID) techniques to bridge the gap between cross-view camera systems. Despite the growing interest, one major challenge persists: the scarcity of authentic, large-scale training datasets. To address this challenge, this paper introduces a coarse-to-fine generation pipeline designed to synthesize high-fidelity vehicle data, thereby facilitating subsequent vehicle representation learning. Specifically, the proposed approach consists of three stages: Prompt Processing, Diffusion Fine-tuning, and Semantic Filtering. First, we collect detailed prompts from vehicle websites and companies with fine-grained vehicle prototype attributes. Next, we leverage the prior knowledge of these automotive prototypes to fine-tune diffusion models. Finally, to ensure the quality of the synthesized data, we employ pretrained vision-language models to filter out substandard images. Building upon the high-quality data generated by this pipeline, we validate the effectiveness using vanilla models. Extensive experimental evaluations demonstrate that our approach achieves competitive accuracy on public benchmarks such as VeRi-776, VehicleID and CityFlowV2, and is compatible with various model architectures.

Abstract:
In this paper, we address multi-robot formation planning where nonholonomic robots collaboratively transport objects using a deformable sheet in unstructured, cluttered environments. The formation can expand or contract to adjust the height of the object on the sheet. However, interactions between the robots and sheet introduce complex constraints for formation planning. Complexity increases further when the only feasible solution requires crossing an obstacle, i.e. where robots navigate in different homotopy classes around an obstacle such that the object hovers above it. Most existing nonholonomic formation planners do not admit obstacle crossing, limiting performance. In this paper, we present a two-stage iterative trajectory optimization framework which explicitly considers obstacle crossing. First, we capture the set of all feasible homotopy classes for each robot using a topological probabilistic roadmap. We then iteratively apply numerical optimization techniques to find a safe and feasible solution for the formation. We demonstrate the efficacy of our framework in simulation and on real robot hardware.

Abstract:
Transfer Learning (TL) is a powerful tool that enables robots to transfer learned policies across different environments, tasks, or embodiments. To further facilitate this process, efforts have been made to combine it with Learning from Demonstrations (LfD) for more flexible and efficient policy transfer. However, these approaches are almost exclusively limited to offline demonstrations collected before policy transfer starts, which may suffer from the intrinsic issue of covariance shift brought by LfD and harm the performance of policy transfer. Meanwhile, extensive work in the learning-from-scratch setting has shown that online demonstrations can effectively alleviate covariance shift and lead to better policy performance with improved sample efficiency. This work combines these insights to introduce online demonstrations into a policy transfer setting. We present Policy Transfer with Online Demonstrations, an active LfD algorithm for policy transfer that can optimize the timing and content of queries for online episodic expert demonstrations under a limited demonstration budget. We evaluate our method in eight robotic scenarios, involving policy transfer across diverse environment characteristics, task objectives, and robotic embodiments, with the aim to transfer a trained policy from a source task to a related but different target task. The results show that our method significantly outperforms all baselines in terms of average success rate and sample efficiency, compared to two canonical LfD methods with offline demonstrations and one active LfD method with online demonstrations. Additionally, we conduct preliminary sim-to-real tests of the transferred policy on three transfer scenarios in the real-world environment, demonstrating the policy effectiveness on a real robot manipulator.

Abstract:
In logistics automation, precise segmentation of unseen objects is crucial for efficient robotic manipulation in cluttered environments. Tasks such as bin-picking and shelfpicking require robust perception to handle occlusions, varying object shapes, and complex spatial arrangements. Traditional RGB-based methods tend to over-segment objects due to their reliance on texture, while depth-based methods often under-segment by focusing primarily on geometric features. To address these limitations, we propose DA-Fusion, a deformable attention-based RGB-D fusion Transformer designed for unseen object instance segmentation. DA-Fusion effectively combines the strengths of both RGB and depth data, enhancing segmentation accuracy in cluttered and multi-layered object environments. We also introduce the Object Clutter Bin Dataset (OCBD), a benchmark dataset specifically tailored for evaluating bin-picking scenarios in top-down views. Extensive evaluations demonstrate that DA-Fusion outperforms state-of-the-art methods across diverse environments, making it particularly suited for real-world logistics tasks.

Abstract:
Underwater navigation is an area of increasing research interest due to its fundamental complexity and industrial applications. However, due to convenience and current theoretical understanding, the vast majority of underwater platforms utilize thrusters, while other forms of propulsion, such as undulatory locomotion, have been given limited exposure. This paper provides the first real-time motion planning framework that produces energy and time efficient paths with empirical local optimality for articulated swimming robots in 3D, called SIMP. SIMP utilizes learned associations between parameterized dynamically feasible undulatory gaits with their expected energy cost, velocity, and swept-out volume of the robot during execution, to formulate a simplified optimization problem that decides the path to be followed with the corresponding consecutive gaits, and navigates the robot safely in complex 3D environments. The proposed pipeline is tested in numerical experiments with realistic dynamics for a 10 link underwater snake robot (USR) with anguilliform gaits, in simulated cluttered environments of significant challenge, displaying real-time replanning performance of more than 1 Hz.

Abstract:
Legged robots possess inherent advantages in traversing complex 3D terrains. However, previous work on lowcost quadruped robots with egocentric vision systems has been limited by a narrow front-facing view and exteroceptive noise, restricting omnidirectional mobility in such environments. While building a voxel map through a hierarchical structure can refine exteroception processing, it introduces significant computational overhead, noise, and delays. In this paper, we present MOVE, a one-stage end-to-end learning framework capable of multi-skill omnidirectional legged locomotion with limited view in 3D environments, just like what a real animal can do. When movement aligns with the robot's line of sight, exteroceptive perception enhances locomotion, enabling extreme climbing and leaping. When vision is obstructed or the direction of movement lies outside the robot's field of view, the robot relies on proprioception for tasks like crawling and climbing stairs. We integrate all these skills into a single neural network by introducing a pseudo-siamese network structure combining supervised and contrastive learning which helps the robot infer its surroundings beyond its field of view. Experiments in both simulations and real-world scenarios demonstrate the robustness of our method, broadening the operational environments for robotics with egocentric vision.

Abstract:
Staircases are a challenging terrain frequently encountered in urban environments. While leg-wheel robots take advantage of having both legged and wheeled modes, their ability to negotiate stairs still requires careful planning. This paper presents a novel approach to developing a stair-climbing behavior for leg-wheel transformable robots. A comprehensive stair-climbing strategy is constructed by analyzing the workspace of the leg-wheel mechanism, considering the position of the robot's center of mass, and accounting for foothold displacement owing to the possible leg-wheel forward rolling motion. This strategy enables the robot to safely navigate stairs using its leg-wheel's appropriate parts. Stability during transitions between steps is ensured, and a well-designed swing trajectory is proposed to minimize slippage and impact. The approach is validated through simulations and further tested experimentally on staircases with treads of 27 cm and risers of 12 cm, as well as staircases with treads of 24 cm and risers of 14 cm. The experimental results demonstrate the effectiveness and robustness of the proposed method.

Abstract:
We address the motion planning problem for multiple robotic manipulators in packed environments where shared workspace can result in goal positions occupied or blocked by other robots unless those other robots move away to make the goal positions free. While planning in a coupled configuration space (C-space) is straightforward, it struggles to scale with the number of robots and often fails to find solutions. Decoupled planning is faster but frequently leads to conflicts between trajectories. We propose a conflict resolution approach that inserts pauses into individually planned trajectories using an A^ search strategy to minimize the makespan–the total time until all robots complete their tasks. This method allows some robots to stop, enabling others to move without collisions, and maintains short distances in the C-space. It also effectively handles cases where goal positions are initially blocked by other robots. Experimental results show that our method successfully solves challenging instances where baseline methods fail to find feasible solutions.

Abstract:
Varying dynamics pose a fundamental difficulty when deploying safe control laws in the real world. Safety Index Synthesis (SIS) deeply relies on the system dynamics and once the dynamics change, the previously synthesized safety index becomes invalid. In this work, we show the real-time efficacy of Safety Index Adaptation (SIA) in varying dynamics. SIA enables real-time adaptation to the changing dynamics so that the adapted safe control law can still guarantee 1) forward invariance within a safe region and 2) finite time convergence to that safe region. This work employs SIA on a packagecarrying quadruped robot, where the payload weight changes in real-time. SIA updates the safety index when the dynamics change, e.g., a change in payload weight, so that the quadruped can avoid obstacles while achieving its performance objectives. Numerical study provides theoretical guarantees for SIA and a series of hardware experiments demonstrate the effectiveness of SIA in real-world deployment in avoiding obstacles under varying dynamics.

Abstract:
This paper addresses a distributed leader-follower formation control problem for a group of agents, each using a body-fixed camera with a limited field of view (FOV) for state estimation. The main challenge arises from the need to coordinate the agents' movements with their cameras' FOV to maintain visibility of the leader for accurate and reliable state estimation. To address this challenge, we propose a novel perception-aware distributed leader-follower safe control scheme that incorporates FOV limits as state constraints. A Control Barrier Function (CBF) based quadratic program is employed to ensure the forward invariance of a safety set defined by these constraints. Furthermore, new neural network based and double bounding boxes based estimators, combined with temporal filters, are developed to estimate system states directly from real-time image data, providing consistent performance across various environments. Comparison results in the Gazebo simulator demonstrate the effectiveness and robustness of the proposed framework in two distinct environments.

Abstract:
The integration of humanoid and animal-shaped robots into specialized domains, such as healthcare, multiterrain operations, and psychotherapy, necessitates a deep understanding of proxemics-the study of spatial behavior that governs effective human-robot interactions. Unlike traditional robots in manufacturing or logistics, these robots must navigate complex human environments where maintaining appropriate physical and psychological distances is crucial for seamless interaction. This study explores the application of proxemics in human-robot interactions, focusing specifically on quadruped robots, which present unique challenges and opportunities due to their lifelike movement and form. Utilizing a motion capture system, we examine how different interaction postures of a canine robot influence human participants' proxemic behavior in dynamic scenarios. By capturing and analyzing position and orientation data, this research aims to identify key factors that affect proxemic distances and inform the design of socially acceptable robots. The findings underscore the importance of adhering to human psychological and physical distancing norms in robot design, ensuring that autonomous systems can coexist harmoniously with humans.

Abstract:
Soft grippers have shown promising performance in safe and adaptive grasping tasks. However, they often suffer from limitations in grasping force. To address this challenge, this paper presents a novel pneumatic three-finger soft gripper to achieve robust adaptive grasping. The gripper consists of three identical fingers, each containing a pneumatic bending soft actuator and a pneumatic lateral soft actuator. The bending actuator features a tilted pneumatic network structure, which provides superior bending performance compared to traditional vertical pneumatic network structure. The lateral actuator is equipped with three deflection chambers at the finger root to mimic the lateral motions of a human finger. Kinematic and static models are established to predict the bending angle and grasping force of the soft finger under pressurized air. The performance of the proposed soft finger is analyzed through finite element simulations, and the effect of the chamber tilt angle is also examined. The theoretical and simulation results are compared to verify the validity of the analytical models. Finally, the proposed soft gripper is fabricated by 3D printing and molding. Experimental results show that the gripper is capable of grasping various objects of different sizes, shapes, materials, and weights, and can perform dexterous manipulation tasks, such as cap unscrewing. The proposed soft gripper exhibits significant potential for applications in robotic robust grasping tasks.

Abstract:
Ground texture localization using a downward-facing camera offers a low-cost, high-precision localization solution that is robust to dynamic environments and requires no environmental modification. We present a significantly improved bag-of-words (BoW) image retrieval system for ground texture localization, achieving substantially higher accuracy for global localization and higher precision and recall for loop closure detection in SLAM. Our approach leverages an approximate k-means (AKM) vocabulary with soft assignment, and exploits the consistent orientation and constant scale constraints inherent to ground texture localization. Identifying the different needs of global localization vs. loop closure detection for SLAM, we present both high-accuracy and high-speed versions of our algorithm. We test the effect of each of our proposed improvements through an ablation study and demonstrate our method's effectiveness for both global localization and loop closure detection. With numerous ground texture localization systems already using BoW, our method can readily replace other generic BoW systems in their pipeline and immediately improve their results.

Abstract:
This paper proposes a versatile graph-based lifelong localization framework using LiDAR, LiLoc, which enhances its timeliness by maintaining a single central session while improves the accuracy through multi-modal factors between the central and subsidiary sessions. First, an adaptive submap joining strategy is employed to generate prior submaps (keyframes and poses) for the central session, and to provide priors for subsidiaries when constraints are needed for robust localization. Next, a coarse-to-fine pose initialization for subsidiary sessions is performed using vertical recognition and ICP refinement in the global coordinate frame. To elevate the accuracy of subsequent localization, we propose an egocentric factor graph (EFG) module that integrates the IMU preintegration, LiDAR odometry and scan match factors in a joint optimization manner. Specifically, the scan match factors are constructed by a novel propagation model that efficiently distributes the prior constrains as edges to the relevant prior pose nodes, weighted by noises based on keyframe registration errors. Additionally, the framework supports flexible switching between two modes: relocalization (RLM) and incremental localization (ILM) based on the proposed overlap-based mechanism to select or update the prior submaps from central session. The proposed LiLoc is tested on public and custom datasets, demonstrating accurate localization performance against state-of-the-art methods. Our codes will be publicly available on https://github.com/Yixin-F/LiLoc.

Abstract:
Robot navigation in dense human crowds poses a significant challenge due to the complexity of human behavior in dynamic and obstacle-rich environments. In this work, we propose a dynamic weight adjustment scheme using a neural network to predict the optimal weights of objectives in an optimization-based motion planner. We adopt a spatial-temporal trajectory planner and incorporate diverse objectives to achieve a balance among safety, efficiency, and goal achievement in complex and dynamic environments. We design the network structure, observation encoding, and reward function to effectively train the policy network using reinforcement learning, allowing the robot to adapt its behavior in real time based on environmental and pedestrian information. Simulation results show improved safety compared to the fixed-weight planner and the state-of-the-art learning-based methods, and verify the ability of the learned policy to adaptively adjust the weights based on the observed situations. The feasibility of the approach is demonstrated in a navigation task using an autonomous delivery robot across a crowded corridor over a 300 m distance. Video: https://youtu.be/nSCbNaaF_VM

Abstract:
Task specification for robotic manipulation in open-world environments is challenging, requiring flexible and adaptive objectives that align with human intentions and can evolve through iterative feedback. We introduce Iterative Keypoint Reward (IKER), a visually grounded, Python-based reward function that serves as a dynamic task specification. Our framework leverages VLMs to generate and refine these reward functions for multi-step manipulation tasks. Given RGB-D observations and free-form language instructions, we sample keypoints in the scene and generate a reward function conditioned on these keypoints. IKER operates on the spatial relationships between keypoints, leveraging commonsense priors about the desired behaviors, and enabling precise SE(3) control. We reconstruct real-world scenes in simulation and use the generated rewards to train reinforcement learning (RL) policies, which are then deployed into the real world-forming a real-to-sim-to-real loop. Our approach demonstrates notable capabilities across diverse scenarios, including both prehensile and non-prehensile tasks, showcasing multi-step task execution, spontaneous error recovery, and on-the-fly strategy adjustments. The results highlight IKER's effectiveness in enabling robots to perform multi-step tasks in dynamic environments through iterative reward shaping. Project Page: https://iker-robot.github.io/

Abstract:
Robot models, particularly those trained with large amounts of data, have recently shown a plethora of real-world manipulation and navigation capabilities. Several independent efforts have shown that given sufficient training data in an environment, robot policies can generalize to demonstrated variations in that environment. However, needing to finetune robot models to every new environment stands in stark contrast to models in language or vision that can be deployed zero-shot for open-world problems. In this work, we present Robot Utility Models (RUMs), a framework for training and deploying zero-shot robot policies that can directly generalize to new environments without any finetuning. To create RUMs efficiently, we develop new tools to quickly collect data for mobile manipulation tasks, integrate such data into a policy with multi-modal imitation learning, and deploy policies ondevice on the Hello Robot Stretch, a cheap commodity robot, with an external mLLM verifier for retrying. We train five such utility models for opening cabinet doors, opening drawers, picking up napkins, picking up paper bags, and reorienting fallen objects. Our system, on average, achieves 90% success rate in unseen, novel environments interacting with unseen objects. Primary among our lessons are the importance of training data over training algorithm and policy class, guidance about data scaling, necessity for diverse yet high-quality demonstrations, and a recipe for robot introspection and retrying to improve performance on individual environments.

Abstract:
We present a novel method for scene change detection that leverages the robust feature extraction capabilities of a visual foundational model, DINOv2, and integrates full-image cross-attention to address key challenges such as varying lighting, seasonal variations, and viewpoint differences. In order to effectively learn correspondences and mis-correspondences between an image pair for the change detection task, we propose to a) “freeze” the backbone in order to retain the generality of dense foundation features, and b) employ “full-image” cross-attention to better tackle the viewpoint variations between the image pair. We evaluate our approach on two benchmark datasets, VL-CMU-CD and PSCD, along with their viewpoint-varied versions. Our experiments demonstrate significant improvements in F1-score, particularly in scenarios involving geometric changes between image pairs. The results indicate our method's superior generalization capabilities over existing state-of-the-art approaches, showing robustness against photometric and geometric variations as well as better overall generalization when fine-tuned to adapt to new environments. Detailed ablation studies further validate the contributions of each component in our architecture. Our source code is available at: https://github.com/ChadLin9596/Robust-Scene-Change-Detection.

Abstract:
Parking slot detection is an essential technology in autonomous parking systems. In general, the classification problem of parking slot detection consists of two tasks, a task determining whether localized candidates are junctions of parking slots or not, and the other that identifies a shape of detected junctions. Both classification tasks can easily face biased learning toward the majority class, degrading classification performances. Yet, the data imbalance issue has been overlooked in parking slot detection. We propose the first supervised contrastive learning framework for parking slot detection, Localized and Balanced Contrastive Learning for improving parking slot detection (LaB-CL). The proposed LaBCL framework uses two main approaches. First, we propose to include class prototypes to consider representations from all classes in every mini batch, from the local perspective. Second, we propose a new hard negative sampling scheme that selects local representations with high prediction error. Experiments with the benchmark dataset demonstrate that the proposed LaB-CL framework can outperform existing parking slot detection methods.

Abstract:
Although extensive research has been conducted on modeling the stable longitudinal controller of automated vehicles (AVs) to dampen traffic oscillations, the real-world performance of these controllers in actual vehicles remains uncertain. In the operation of real-world AVs, the delay between actual dynamics and the commands prevents the controller's command from being effectively implemented to dampen traffic oscillations. Thus, this study adapts the designed controllers within an AV test platform to compare the theoretically stable conditions with the actual oscillation dampening performance. Initially, we compute the stable conditions for both the traditional car-following controller, which assumes no delay, and the longitudinal controller that accounts for the dynamic response of the vehicle. Through empirical experiments, we demonstrate that the longitudinal controller predicts vehicle stability more accurately than conventional car-following controller, showing an improvement from an average prediction accuracy rate of 0.59 to 0.91. Also, the experiments uncover specific delays inherent in dynamics systems, with a response delay of 0.34 seconds. Our work makes two principal contributions to the field of AV control systems. First, it empirically validates that the longitudinal model, which accounts for the vehicle's dynamic responses, offers a more precise representation of vehicular behavior. Second, the relatively brief response delay identified expands the stability region, thereby enhancing vehicle control and safety. The longitudinal controller is critical for enhancing AV performance and reliability in dampening traffic oscillations.

Abstract:
Online planning and execution of minimum-time maneuvers on three-dimensional (3D) circuits is an open challenge in autonomous vehicle racing. In this paper, we present an artificial race driver (ARD) to learn the vehicle dynamics, plan and execute minimum-time maneuvers on a 3D track. ARD integrates a novel kineto-dynamical (KD) vehicle model for trajectory planning with economic nonlinear model predictive control (E-NMPC). We use a high-fidelity vehicle simulator (VS) to compare the closed-loop ARD results with a minimum-lap-time optimal control problem (MLT-VS), solved offline with the same VS. Our ARD sets lap times close to the MLT-VS, and the new KD model outperforms a literature benchmark. Finally, we study the vehicle trajectories, to assess the re-planning capabilities of ARD under execution errors. A video with the main results is available as supplementary material.

Abstract:
Accurate localization is essential for the safe and effective navigation of autonomous vehicles, and Simultaneous Localization and Mapping (SLAM) is a cornerstone technology in this context. However, The performance of the SLAM system can deteriorate under challenging conditions such as low light, adverse weather, or obstructions due to sensor degradation. We present A2DO, a novel end-to-end multi-sensor fusion odometry system that enhances robustness in these scenarios through deep neural networks. A2DO integrates LiDAR and visual data, employing a multilayer, multi-scale feature encoding module augmented by an attention mechanism to mitigate sensor degradation dynamically. The system is pretrained extensively on simulated datasets covering a broad range of degradation scenarios and fine-tuned on a curated set of real-world data, ensuring robust adaptation to complex scenarios. Our experiments demonstrate that A2DO maintains superior localization accuracy and robustness across various degradation conditions, showcasing its potential for practical implementation in autonomous vehicle systems.

Abstract:
We present a new method to combine several rigidly connected but physically separated IMUs through a weighted average into a single virtual IMU (VIMU). This has the benefits of (i) reducing process noise through averaging, and (ii) allowing for tuning the location of the VIMU. The VIMU can be placed to be coincident with, for example, a camera frame or GNSS frame, thereby offering a quality-of-life improvement for users. Specifically, our VIMU removes the need to consider any lever-arm terms in the propagation model. We also present a quadratic programming method for selecting the weights to minimize the noise of the VIMU while still selecting the placement of its reference frame. We tested our method in simulation and validated it on a real dataset. The results show that our averaging technique works for IMUs with large separation and performance gain is observed in both the simulation and the real experiment compared to using only a single IMU.

Abstract:
We report a new full-optical pretouch dual-modal and dual-mechanism (PDM2) sensor based on an air-coupled fiber-tip surface micromachined optical ultrasound transducer (SMOUT). Compared to ring-shaped piezoelectric acoustic receivers in previous PDM2 sensors, the acoustic signal received by the new fiber-tip SMOUT is readout optically, which is naturally resistant to surrounding electromagnetic interference (EMI) and makes the complex grounding and shielding unnecessary. In addition, the new fiber-tip SMOUT receiver has a much smaller size, which makes it possible to further miniaturize the sensor package into a more compact structure. For verification, a prototype of the full-optical PDM2 sensor has been designed, fabricated, and characterized. The experimental results show that even with the much smaller acoustic receiver, the new sensor can still achieve ranging and material/structure sensing performances comparable with the previous ones. Therefore, the new full optical PDM2 sensor design is promising in providing a practical and miniaturized solution for ranging and material/structure sensing to assist robotic grasping of unknown objects.

Abstract:
End-to-end autonomous driving offers a stream-lined alternative to the traditional modular pipeline, integrating perception, prediction, and planning within a single framework. While Deep Reinforcement Learning (DRL) has recently gained traction in this domain, existing approaches often overlook the critical connection between feature extraction of DRL and perception. In this paper, we bridge this gap by mapping the DRL feature extraction network directly to the perception phase, en-abling clearer interpretation through semantic segmentation. By leveraging Bird's-Eye- View (BEV) representations, we propose a novel DRL-based end-to-end driving framework that utilizes multi-sensor inputs to construct a unified three-dimensional understanding of the environment. This BEV-based system extracts and translates critical environmental features into high-level abstract states for DRL, facilitating more informed control. Extensive experimental evaluations demonstrate that our approach not only enhances interpretability but also significantly outperforms state-of-the-art methods in autonomous driving control tasks, reducing the collision rate by 20 %.

Abstract:
Generalizing language-conditioned robotic policies to new tasks remains a significant challenge, hampered by the lack of suitable simulation benchmarks. In this paper, we address this gap by introducing GemBench, a novel benchmark to assess generalization capabilities of vision-language robotic manipulation policies. GemBench incorporates seven general action primitives and four levels of generalization, spanning novel placements, rigid and articulated objects, and complex long-horizon tasks. We evaluate state-of-the-art approaches on GemBench and also introduce a new method. Our approach 3D- LOTUS leverages rich 3D information for action prediction conditioned on language. While 3D-LOTUS excels in both efficiency and performance on seen tasks, it struggles with novel tasks. To address this, we present 3D-LOTUS++, a framework that integrates 3D-LOTUS's motion planning capabilities with the task planning capabilities of LLMs and the object grounding accuracy of VLMs. 3D-LOTUS++ achieves state-of-the-art per-formance on novel tasks of GemBench, setting a new standard for generalization in robotic manipulation. Code, dataset, real robot videos and trained models are available at https://www.di.ens.fr/willow/research/gembench/.

Abstract:
There has been a lot of interest in grounding natural language to physical entities through visual context. While Vision Language Models (VLMs) can ground linguistic instructions to visual sensory information, they struggle with grounding non-visual attributes, like the weight of an object. Our key insight is that non-visual attribute detection can be effectively achieved by active perception guided by visual reasoning. To this end, we present a perception-action API that consists of VLMs and Large Language Models (LLMs) as backbones, together with a set of robot control functions. When prompted with this API and a natural language query, an LLM generates a program to actively identify attributes given an input image. Offline testing on the Odd-One-Out (\mathbfO^\mathbf3) dataset demonstrates that our framework outperforms vanilla VLMs in detecting attributes like relative object location, size, and weight. Online testing in realistic household scenes on AI2THOR and a real robot demonstration on a DJI RoboMaster EP robot highlight the efficacy of our approach.

Abstract:
Designing trajectories for manipulation through contact is challenging as it requires reasoning of object & robot trajectories as well as complex contact sequences simultaneously. In this paper, we present a novel framework for simultaneously designing trajectories of robots, objects, and contacts efficiently for contact-rich manipulation. We propose a hierarchical optimization framework where Mixed-Integer Linear Program (MILP) selects optimal contacts between robot & object using approximate dynamical constraints, and then a NonLinear Program (NLP) optimizes trajectory of the robot(s) and object considering full nonlinear constraints. We present a convex relaxation of bilinear constraints using binary encoding technique such that MILP can provide tighter solutions with better computational complexity. The proposed framework is evaluated on various manipulation tasks where it can reason about complex multi-contact interactions while providing computational advantages. We also demonstrate our framework in hardware experiments using a bimanual robot system.

Abstract:
We present a novel statistical approach to incorporate uncertainty awareness in model-free distributional deep reinforcement learning for mission and safety-critical robotics. Deep learning predictions are influenced by uncertainties in the data, termed as aleatoric uncertainties, as well as uncertainties in the learning process and model structure, known as epistemic uncertainties. The proposed algorithm, called as Calibrated Evidential Quantile Regression in Deep-Q Networks (CEQR-DQN), addresses key challenges associated with separately estimating aleatoric and epistemic uncertainty in stochastic robotic environments. It combines deep evidential learning with quantile calibration based on the principles of conformal inference to provide explicit, sample-free computations of global uncertainty as opposed to local estimates based on simple variance. Thereby, the proposed approach overcomes limitations of traditional methods in computational and statistical efficiency and handling of out-of-distribution (OOD) observations. Tested on a suite of representative miniaturized Atari games (i.e., MinAtar), CEQR-DQN is shown to surpass similar existing frameworks in scores and learning speed. Its ability to rigorously evaluate uncertainties improves exploration strategies and can serve as a blueprint for other uncertainty-aware robotic algorithms.

Abstract:
Predicting potential collision events is beneficial to ensure the driving safety of autonomous vehicles. Existing graph-based collision prediction methods rely heavily on domain knowledge and predefined semantic relations, limiting their flexibility and adaptability in complex driving scenarios. To overcome these challenges, this paper introduces a novel collision prediction framework named HGAT-CP, which integrates a Heterogeneous Graph Attention Network (HGAT) with a Long Short-Term Memory network (LSTM) to model the spatial-temporal interactions in scenes. First, the proposed method employs a data-driven scene graph embedding module to autonomously learn relationships between vehicles and lanes and construct flexible scene graphs. Then, the HGAT module utilizes a dual-level attention mechanism, operating at both the node level and type level, to capture spatial interactions without relying on predefined semantic rules. The LSTM module models temporal dependencies of the scene graph embeddings to improve the prediction of collision events over time. Experimental evaluations on public datasets demonstrate that our proposed method achieves state-of-the-art performance, outperforming existing methods across all metrics.

Abstract:
Simultaneous localization and mapping (SLAM) is a crucial component of unmanned systems, playing a key role in autonomous navigation. Currently, most LiDAR SLAM methods are focused on structured environments. However, highly irregular off-road terrain poses more challenges for LiDAR SLAM tasks, but these environments are not fully represented in existing datasets. To address this issue, we introduce the first dedicated LiDAR SLAM benchmark dataset for off-road environments, named Jlurobot Off-Road Dadaset (JORD). This dataset is collected using a custom avenger data collection platform in large-scale forest off-road scenes, consisting of 8 LiDAR sequences with a total length of approximately 6.07 kilometers, containing 49,144 point cloud frames along with accurate 6DoF ground truth. The dataset includes multiple revisit information within the sequences, making it suitable for LiDAR place recognition and SLAM tasks. Furthermore, we employe several state-of-the-art methods for benchmarking to validate the dataset's challenges. The release of JORD aims to provide researchers with valuable resources to develop new approaches and explore novel directions for unmanned systems in off-road environments. The complete dataset and code is available at https://github.com/jiurobots/JORD.

Abstract:
Autonomous ground robots navigating unstructured off-road environments face perceptual challenges, such as sensor obscuration or failure, which can lead to inaccurate perception or navigation failures. While robot adaptation has recently gained increasing attention, self-reflective robot adaptation, where robots understand and adjust to their own sensor limitations, remains under-explored. This paper proposes a novel approach for self-reflective perceptual adaptation in order to enhance robust off-road navigation. Our approach enables a robot to identify its own perceptual difficulties and dynamically adapt in challenging environments. The key novelty is learning a modality-invariant perceptual representation that encodes shared sensor data into a compact feature space. Within this representation space, the robot's dynamics model is also learned, which enables accurate prediction of future navigation paths. Extensive experiments in off-road environments with sensor obstructions and failures demonstrate that our method significantly improves adaptive capabilities and outperforms baseline and state-of-the-art approaches. More details of this work are provided on the project website: https://hcrlab.gitlab.io/project/srpa.

Abstract:
Rapid autonomous traversal of unstructured terrain is essential for scenarios such as disaster response, search and rescue, and planetary exploration. As a vehicle navigates at the limit of its capabilities over extreme terrain, its dynamics can change suddenly and dramatically. For example, varying terrain can affect parameters such as traction, tire slip, and rolling resistance. To achieve effective planning in such environments, it is crucial to have a dynamics model that can accurately anticipate these conditions and respond before an issue can occur. In this work, we present a hybrid model that predicts the changing dynamics induced by the terrain as a function of visual inputs. We leverage a pre-trained visual foundation model (VFM) DINOv2, which provides rich features that encodes fine-grained semantic information. To use this dynamics model for planning, we propose an end-to-end training architecture for a projection distance independent feature encoder that compresses the information from the VFM, enabling the creation of a lightweight map of the environment at runtime. We validate our architecture on an extensive dataset (hundreds of kilometers of aggressive off-road driving) collected across multiple locations as part of the DARPA Robotic Autonomy in Complex Environments with Resiliency (RACER) program. https://youtu.be/aydHxLGmnx8

Abstract:
Contrary to on-road autonomous navigation, off-road autonomy is complicated by various factors ranging from sensing challenges to terrain variability. In such a milieu, data-driven approaches have been commonly employed to capture intricate vehicle-environment interactions effectively. However, the success of data-driven methods depends crucially on the quality and quantity of data, which can be compromised by large variability in off-road environments. To address these concerns, we present a novel methodology to recreate the exact vehicle and its target operating conditions digitally for domain-specific data generation. This enables us to effectively model off-road vehicle dynamics from simulation data using the Koopman operator theory, and employ the obtained models for local motion planning and optimal vehicle control. The capabilities of the proposed methodology are demonstrated through an autonomous navigation problem of a 1:5 scale vehicle, where a terrain-informed planner is employed for global mission planning. Results indicate a substantial improvement in off-road navigation performance with the proposed algorithm (↑ 5.84×) and underscore the efficacy of digital twinning in terms of improving the sample efficiency (↑ 3.2×) and reducing the sim2real gap (↓ 5.2%).

Abstract:
This work presents an overview of the techni-cal details behind a high-performance reinforcement learning policy deployment with the Spot RL Researcher Development Kit for low-level motor access on Boston Dynamic's Spot. This represents the first public demonstration of an end-to-end reinforcement learning policy deployed on Spot hardware with training code publicly available through Nvidia IsaacLab and deployment code available through Boston Dynamics. We utilize Wasserstein Distance and Maximum Mean Discrepancy to quantify the distributional dissimilarity of data collected on hardware and in simulation to measure our sim-to-real gap. We use these measures as a scoring function for the Covariance Matrix Adaptation Evolution Strategy to optimize simulated parameters that are unknown or difficult to measure from Spot. Our procedure for modeling and training produces high-quality reinforcement learning policies capable of multiple gaits, including a flight phase. We deploy policies capable of over 5.2m/s locomotion, more than triple Spot's default controller maximum speed, robustness to slippery surfaces, disturbance rejection, and overall agility previously unseen on Spot. We detail our method and release our code to support future work on Spot with the low-level API.

Abstract:
We present a Morphology-Informed Heterogeneous Graph Neural Network (MI-HGNN) for learning-based contact perception. The architecture and connectivity of the MI-HGNN are constructed from the robot morphology, in which nodes and edges are robot joints and links, respectively. By incorporating the morphology-informed constraints into a neural network, we improve a learning-based approach using model-based knowledge. We apply the proposed MI-HGNN to two contact perception problems, and conduct extensive experiments using both real-world and simulated data collected using two quadruped robots. Our experiments demonstrate the superiority of our method in terms of effectiveness, generalization ability, model efficiency, and sample efficiency. Our MI-HGNN improved the performance of a state-of-the-art model that leverages robot morphological symmetry by 8.4 % with only 0.21 % of its parameters. Although MI-HGNN is applied to contact perception problems for legged robots in this work, it can be seamlessly applied to other types of multi-body dynamical systems and has the potential to improve other robot learning frameworks. Our code is made publicly available at https://github.com/lunarlab-gatech/Morphology-Informed-HGNN.

Abstract:
This paper introduces a motion planning approach for navigating in a dynamic environment. The path is represented using a Non-Uniform Rational B-Spline (NURBS) to ensure smoothness, curvature continuity, and proper orientation by adjusting its parameters. A Differential Evolution algorithm optimizes the curve parameters and traversal speed at each replanning interval, taking into account speed limits, maximum curvature, and obstacles in the environment. A constraintbased on Velocity Obstacle (VO) ensures collision-free motion, considering bounds provided by lower-level controllers. The feasibility of the approach is validated through simulations and real-world experiments with the Crazyflie 2.1 micro quadcopter.

Abstract:
Autonomous navigation in ice-covered waters poses significant challenges due to the frequent lack of viable collision-free trajectories. When complete obstacle avoidance is infeasible, it becomes imperative for the navigation strategy to minimize collisions. Additionally, the dynamic nature of ice, which moves in response to ship maneuvers, complicates the path planning process. To address these challenges, we propose a novel deep learning model to estimate the coarse dynamics of ice movements triggered by ship actions through occupancy estimation. To ensure real-time applicability, we propose a novel approach that caches intermediate prediction results and seamlessly integrates the predictive model into a graph search planner. We evaluate the proposed planner both in simulation and in a physical testbed against existing approaches and show that our planner significantly reduces collisions with ice when compared to the state-of-the-art. Codes and demos of this work are available at https://github.com/IvanIZ/predictive-asv-planner.

Abstract:
We present a novel mission-planning strategy for heterogeneous multi-robot teams, taking into account the specific constraints and capabilities of each robot. Our approach employs hierarchical trees to systematically break down complex missions into manageable sub-tasks. We develop specialized APIs and tools, which are utilized by Large Language Models (LLMs) to efficiently construct these hierarchical trees. Once the hierarchical tree is generated, it is further decomposed to create optimized schedules for each robot, ensuring adherence to their individual constraints and capabilities. We demonstrate the effectiveness of our framework through detailed examples covering a wide range of missions, showcasing its flexibility and scalability.

Abstract:
Achieving precise target jumping with legged robots poses a significant challenge due to the long flight phase and the uncertainties inherent in contact dynamics and hardware. Forcefully attempting these agile motions on hardware could result in severe failures and potential damage. Motivated by this challenge, we propose an Iterative Learning Control (ILC) approach to learn and refine jumping skills from easy to difficult, instead of directly learning these challenging tasks. We verify that learning from simplicity can enhance safety and target jumping accuracy over trials. Compared to other ILC approaches for legged locomotion, our method can tackle the problem of a long flight phase where control input is not available. In addition, our approach allows the robot to apply what it learns from a simple jumping task to accomplish more challenging tasks within a few trials directly in hardware, instead of learning from scratch. We validate the method through extensive experiments on the Al model and hardware for various tasks. Starting from a small jump (e.g., a forward jump 40cm), our learning approach empowers the robot to accomplish a variety of challenging targets, including jumping onto a 20cm high box, leaping to a greater distance of up to 60cm, as well as performing jumps while carrying an unknown payload of 2kg. Our framework allows the robot to reach the desired position and orientation targets with approximate errors of 1 cm and 10 within a few trials.

Abstract:
Cloth state estimation is an important problem in robotics. It is essential for the robot to know the accurate state to manipulate cloth and execute tasks such as robotic dressing, stitching, and covering/uncovering human beings. However, accurately estimating the cloth state remains challenging due to the high flexibility and self-occlusion of cloth. This paper proposes a diffusion model-based pipeline that formulates the cloth state estimation as an image generation problem by representing the cloth state as an RGB image that describes the point-wise translation (translation map) between a predefined flattened mesh and the deformed mesh in a canonical space. Then we train a conditional diffusion-based image generation model to predict the translation map based on an observation. Experiments are conducted in both simulation and the real world to validate the performance of our method. Results indicate that our method outperforms two recent methods in both accuracy and speed. More results and code are available on our project website: https://chirikjianlab.github.io/RaggeDi/

Abstract:
Mobile robots are increasingly deployed in shared environments where they must learn to navigate alongside humans. Deep Reinforcement Learning (DRL) techniques have shown promise in developing navigation policies that account for interactions within crowds, fostering socially acceptable movement. However, these techniques often depend heavily on collision avoidance rewards to ensure safe navigation. In this study, we introduce a novel reward component based on relative velocity for collision avoidance, which integrates both the robot's and humans' kinematics within personal distance constraints. We conducted a thorough evaluation comparing this new reward model against a conventional one in simulated environments using advanced DRL methods. Our findings indicate that the proposed reward model improves the robots' ability to avoid collisions and navigate towards their goals while being socially acceptable.

Abstract:
Between 1999 and 2020, gastrointestinal cancers were responsible for over three million deaths, emphasizing the critical role of minimally invasive surgical techniques like Endoscopic Submucosal Dissection (ESD) in managing such life-threatening conditions. ESD, which dissects the connective tissue between the mucosal and muscular layers using an electrosurgical knife connected to an endoscope, requires a constant traction force to stabilize tissues and expose underlying anatomical structures. This paper introduces a miniature magnetic flexible robot, actuated by a permanent magnet on a robotic manipulator, designed to enhance ESD by providing traction forces consistently on lesions. The robot was fabricated by casting magnetic silicone composites, and its safe deployment through the endoscope instrument channel was successfully demonstrated, avoiding tissue contact. Experiments in a rubber intestine model validated the feasibility of providing constant traction and 2 DOF orientation control via the robot, allowing real-time fine-tuning of the force direction. This reduces the difficulty and improves the precision and safety of ESD. This research presents a practical method for achieving stable force output in medical miniature robots, particularly in gastrointestinal procedures.

Abstract:
Robotic platforms provide consistent and precise tool positioning that significantly enhances retinal microsurgery. Integrating such systems with intraoperative optical coherence tomography (iOCT) enables image-guided robotic interventions, allowing autonomous performance of advanced treatments, such as injecting therapeutic agents into the subretinal space. However, tissue deformations due to tool-tissue interactions constitute a significant challenge in autonomous iOCT-guided robotic subretinal injections. Such interactions impact correct needle positioning and procedure outcomes. This paper presents a novel method for autonomous subretinal injection under iOCT guidance that considers tissue deformations during the insertion procedure. The technique is achieved through real-time segmentation and 3D reconstruction of the surgical scene from densely sampled iOCT B-scans, which we refer to as B5_ scans. Using B5-scans we monitor the position of the instrument relative to a virtual target layer between the ILM and RPE. Our experiments on ex-vivo porcine eyes demonstrate dynamic adjustment of the insertion depth and overall improved accuracy in needle positioning compared to prior autonomous insertion approaches. Compared to a 35% success rate in subretinal bleb generation with previous approaches, our method reliably created subretinal blebs in 90% our experiments. The source code and data used in this study are publicly available on GitHub11https://github.com/demirarikan/virtual-Iayer-retinal-surgery.

Abstract:
Efficient and high-fidelity reconstruction of deformable surgical scenes is a critical yet challenging task. Building on recent advancements in 3D Gaussian splatting, current methods have seen significant improvements in both reconstruction quality and rendering speed. However, two major limitations remain: (1) difficulty in handling irreversible dynamic changes, such as tissue shearing, which are common in surgical scenes; and (2) the lack of hierarchical modeling for surgical scene deformation, which reduces rendering speed. To address these challenges, we introduce EH-SurGS, an efficient and high-fidelity reconstruction algorithm for deformable surgical scenes. We propose a deformation modeling approach that incorporates the life cycle of 3D Gaussians, effectively capturing both regular and irreversible deformations, thus enhancing reconstruction quality. Additionally, we present an adaptive motion hierarchy strategy that distinguishes between static and deformable regions within the surgical scene. This strategy reduces the number of 3D Gaussians passing through the deformation field, thereby improving rendering speed. Extensive experiments on public datasets captured with static endoscopes demonstrate that our method surpasses existing state-of-the-art approaches in both reconstruction quality and rendering speed. Ablation studies further validate the effectiveness and necessity of our proposed components. We will open-source our code upon acceptance of the paper.

Abstract:
Reinforcement learning (RL) has long struggled with exploration in vast state-action spaces, particularly for intricate tasks that necessitate a series of well-coordinated actions. Meanwhile, large language models (LLMs) equipped with fundamental knowledge have been utilized for task planning across various domains. However, using them to plan for long-term objectives can be demanding, as they function independently from task environments where their knowledge might not be perfectly aligned, hence often overlooking possible physical limitations. To this end, we propose a goal-based RL framework that leverages prior knowledge of LLMs to benefit the training process. We introduce a hierarchical module that features a goal generator to segment a long-horizon task into reachable subgoals and a policy planner to generate action sequences based on the current goal. Subsequently, the policies derived from LLMs guide the RL to achieve each subgoal sequentially. We validate the effectiveness of the proposed framework across different simulation environments and long-horizon tasks with complex state and action spaces. The LLM prompts we use and more details can be found at https://chirikjianlab.github.io/G2RL-LM/.

Abstract:
In robotics, the design of robot behavior trees generally requires roboticists to comprehensively and customizable consider all the relevant factors including the robot hardware capabilities, task descriptions, etc, posing great challenges for design quality and efficiency. The mainstream practice of BT design paradigm has been utilizing the BT component framework to develop task-specific BT structures manually. In contrast, the latest advances in Generative Pretrained Transformers (GPTs) have also opened up the possibility of BT design automation. However, these approaches generally show low efficiency or are less trustworthy for complex robot task goals due to time-consuming manual design and unreliable GPT reasoning. To solve the above limitations, this paper proposes a novel knowledge-driven approach that develops a specialized knowledge graph from multi-sourced and heterogeneous highquality robot knowledge to reason out a trustworthy robot plan for achieving complex task goals. Then we present the plan transformation and BT merging algorithms to automatically generate the plan-level BT structure. The comparative experiment results have shown that our approach can generate highquality and trustworthy BT structure regarding the task plan accuracy and consistency, as well as the BT generation time, compared with the manual design and GPT-based approaches.

Abstract:
As robots increasingly coexist with humans, they must navigate complex, dynamic environments rich in visual information and implicit social dynamics, like when to yield or move through crowds. Addressing these challenges requires significant advances in vision-based sensing and a deeper understanding of socio-dynamic factors, particularly in tasks like navigation. To facilitate this, robotics researchers need advanced simulation platforms offering dynamic, photorealistic environments with realistic actors. Unfortunately, most existing simulators fall short, prioritizing geometric accuracy over visual fidelity, and employing unrealistic agents with fixed trajectories and low-quality visuals. To overcome these limitations, we developed a simulator that incorporates three essential elements: (1) photorealistic neural rendering of environments, (2) neurally animated human entities with behaviour management, and (3) an ego-centric robotic agent providing multi-sensor output. By utilizing advanced neural rendering techniques in a dual-NeRF simulator, our system produces high-fidelity, photorealistic renderings of both environments and human entities. Additionally, it integrates a state-of-the-art Social Force Model (SoFM) to model dynamic human-human and human-robot interactions, creating the first photorealistic and accessible human-robot simulation system powered by neural rendering. The code for the simulator is available at https://gitlab.surrey.ac.uk/gn00217/radiance-of-neural-fields-simulator/.

Abstract:
Recent advancements in Large Language Models (LLMs) have sparked a revolution across many research fields. In robotics, the integration of common-sense knowledge from LLMs into task and motion planning has drastically advanced the field by unlocking unprecedented levels of context awareness. Despite their vast collection of knowledge, large language models may generate infeasible plans due to hal-lucinations or missing domain information. To address these challenges and improve plan feasibility and computational efficiency, we introduce DELTA, a novel LLM-informed task planning approach. By using scene graphs as environment representations within LLMs, DELTA achieves rapid generation of precise planning problem descriptions. To enhance planning performance, DELTA decomposes long-term task goals with LLMs into an autoregressive sequence of sub-goals, enabling automated task planners to efficiently solve complex problems. In our extensive evaluation, we show that DELTA enables an efficient and fully automatic task planning pipeline, achieving higher planning success rates and significantly shorter planning times compared to the state of the art. Project webpage: https://delta-llm.github.io/

Abstract:
Quadruped robots face limitations in long-range navigation efficiency due to their reliance on legs. To ameliorate the limitations, we introduce a Reinforcement Learning-based Active Transporter Riding method (RL-ATR), inspired by humans' utilization of personal transporters, including Segways. The RL-ATR features a transporter riding policy and two state estimators. The policy devises adequate maneuvering strategies according to transporter-specific control dynamics, while the estimators resolve sensor ambiguities in non-inertial frames by inferring unobservable robot and transporter states. Comprehensive evaluations in simulation validate proficient command tracking abilities across various transporter-robot models and reduced energy consumption compared to legged locomotion. Moreover, we conduct ablation studies to quantify individual component contributions within the RL-ATR. This riding ability could broaden the locomotion modalities of quadruped robots, potentially expanding the operational range and efficiency.

Abstract:
A core strength of Model Predictive Control (MPC) for quadrupedal locomotion has been its ability to enforce constraints and provide interpretability of the sequence of commands over the horizon. However, despite being able to plan, MPC struggles to scale with task complexity, often failing to achieve robust behavior on rapidly changing surfaces. On the other hand, model-free Reinforcement Learning (RL) methods have outperformed MPC on multiple terrains, showing emergent motions but inherently lack any ability to handle constraints or perform planning. To address these limitations, we propose a framework that integrates proprioceptive planning with RL, allowing for agile and safe locomotion behaviors through the horizon. Inspired by MPC, we incorporate an internal model that includes a velocity estimator and a Dreamer module. During training, the framework learns an expert policy and an internal model that are co-dependent, facilitating exploration for improved locomotion behaviors. During deployment, the Dreamer module solves an infinite-horizon MPC problem, adapting actions and velocity commands to respect the constraints. We validate the robustness of our training framework through ablation studies on internal model components and demonstrate improved robustness to training noise. Finally, we evaluate our approach across multi-terrain scenarios in both simulation and hardware.

Abstract:
Millimeter-wave (mmWave) radar offers robust sensing capabilities in diverse environments, making it a highly promising solution for human body reconstruction due to its privacy-friendly and non-intrusive nature. However, the significant sparsity of mm Wave point clouds limits the estimation accuracy. To overcome this challenge, we propose a two-stage deep learning framework that enhances mm Wave point clouds and improves human body reconstruction accuracy. Our method includes a mm Wave point cloud enhancement module that densifies the raw data by leveraging temporal features and a multi-stage completion network, followed by a 2D-3D fusion module that extracts both 2D and 3D motion features to refine SMPL parameters. The mm Wave point cloud enhancement module learns the detailed shape and posture information from 2D human masks in single-view images. However, image-based supervision is involved only during the training phase, and the inference relies solely on sparse point clouds to maintain privacy. Experiments on multiple datasets demonstrate that our approach outperforms state-of-the-art methods, with the enhanced point clouds further improving performance when integrated into existing models.

Abstract:
Coupling-Tiltable Unmanned Aerial-Aquatic Vehicles (UAAVs) have gained increasing importance, yet lack comprehensive analysis and suitable controllers. This paper analyzes the underwater motion characteristics of a self-designed UAAV, Mirs-Alioth, and designs a controller for it. The effectiveness of the controller is validated through experiments. The singularities of Mirs-Alioth are derived as Singular Thrust Tilt Angle (STTA), which serve as an essential tool for an analysis of its underwater motion characteristics. The analysis reveals several key factors for designing the controller. These include the need for logic switching, using a Nussbaum function to compensate control direction uncertainty in the auxiliary channel, and employing an auxiliary controller to mitigate coupling effects. Based on these key points, a control scheme is designed. It consists of a controller that regulates the thrust tilt angle to the singular value, an auxiliary controller incorporating a Saturated Nussbaum function, and a logic switch. Eventually, two sets of experiments are conducted to validate the effectiveness of the controller and demonstrate the necessity of the Nussbaum function.

Abstract:
This paper presents a comparative analysis of model-based control strategies for a Bionic Ankle Tensegrity Exoskeleton (BATE), designed to emulate the self-stress equilibrium and self-supporting characteristics of the human ankle biotensegrity structure. Model-based control strategies are conventional methods that can discover the principles of the BATE exoskeleton. The high dimensions and non-linearity of the BATE pose challenges for theoretical modeling and model-based control strategies. To address this, we propose a modeling method based on the force density that accounts for interaction forces. We evaluated the trajectory tracking performance and robustness of BATE under three power-assisted control methods: position control (PC), force control (FC) and hybrid force-position control (FPC). Experimental results demonstrate that the PC method offers superior performance in both trajectory tracking and robustness, making it suitable for early rehabilitation training to enhance flexibility. Our findings highlight the advantages of tensegrity exoskeletons over current wearable exoskeletons and introduce novel concepts for developing high-performance exoskeletons.

Abstract:
Autonomous high-speed flight in unknown, clut-tered environments is essential for a variety of quadrotor applications, such as inspection, search, and rescue. In this study, we propose a novel trajectory planner designed to achieve efficient, high-speed, collision-free flights in such environments. The proposed approach begins by generating a safe flight corridor based on the path found by Lazy Theta, representing the safe regions with polytopic sets. These sets are then used to define discrete-time control barrier function (DCBF), ensuring the quadrotor stays within safe bounds during flight. By selecting a single waypoint ahead of the quadrotor on the path as the next waypoint, the trajectory is optimized by considering both the total flight time and safety constraints. Extensive simulations and real-world experiments have confirmed our method's feasibility, demonstrating its capability for high-speed performance and reliable obstacle avoidance. [video44https://www.youtube.com/playlist?list=PLJFduoH7QICOhcIX3JFsZwB4IgS4_-sPt]

Abstract:
With increasing efficiency and reliability, autonomous systems are becoming valuable assistants to humans in various tasks. In the context of robot-assisted delivery, we investigate how robot performance and trust repair strategies impact human trust. In this task, while handling a secondary task, humans can choose to either send the robot to deliver autonomously or manually control it. The trust repair strategies examined include short and long explanations, apology and promise, and denial. Using data from human participants, we model human behavior using an Input-Output Hidden Markov Model (IOHMM) to capture the dynamics of trust and human action probabilities. Our findings indicate that humans are more likely to deploy the robot autonomously when their trust is high. Furthermore, state transition estimates show that long explanations are the most effective at repairing trust following a failure, while denial is most effective at preventing trust loss. We also demonstrate that the trust estimates generated by our model are isomorphic to self-reported trust values, making them interpretable. This model lays the groundwork for developing optimal policies that facilitate real-time adjustment of human trust in autonomous systems.

Abstract:
Dual-arm manipulation is an area of growing interest in the robotics community. Enabling robots to perform tasks that require the coordinated use of two arms, is essential for complex manipulation tasks such as handling large objects, assembling components, and performing human-like interactions. However, achieving effective dual-arm manipulation is challenging due to the need for precise coordination, dynamic adaptability, and the ability to manage interaction forces between the arms and the objects being manipulated. We propose a novel pipeline that combines the advantages of policy learning based on environment feedback and gradient-based optimization to learn controller gains required for the control outputs. This allows the robotic system to dynamically modulate its impedance in response to task demands, ensuring stability and dexterity in dual-arm operations. We evaluate our pipeline on a trajectory-tracking task involving a variety of large, complex objects with different masses and geometries. The performance is then compared to three other established methods for controlling dual-arm robots, demonstrating superior results. Project page: https://dualarmvil.github.io/Dual-Arm-VIL/

Abstract:
Designing planners and controllers for contact-rich manipulation is extremely challenging as contact violates the smoothness conditions that many gradient-based controller synthesis tools assume. Contact smoothing approximates a non-smooth system with a smooth one, allowing one to use these synthesis tools more effectively. However, applying classical control synthesis methods to smoothed contact dynamics remains relatively under-explored. This paper analyzes the efficacy of linear controller synthesis using differential simulators based on contact smoothing. We introduce natural baselines for leveraging contact smoothing to compute (a) open-loop plans robust to uncertain conditions and/or dynamics, and (b) feedback gains to stabilize around open-loop plans. Using robotic bimanual whole-body manipulation as a testbed, we perform extensive empirical experiments on over 300 trajectories and analyze why LQR seems insufficient for stabilizing contact-rich plans.

Abstract:
Planning a kinodynamically feasible path for a tractor-trailer vehicle is challenging for both search-based and learning-based methods due to the vehicle's unique kinematics and complex obstacles. These factors increase the likelihood of infeasible paths and exacerbate long-horizon issues. We introduce ME-PATS: a framework that mutually enhances the search-based planner and the learning-based agent for tractortrailer systems. The search-based planner provides successful trajectories to help the learning-based agent update its policy, while the agent improves the planner's efficiency through direct path simulation. Additionally, we propose two approaches to apply our framework to more challenging tasks: designing obstacle-aware networks to enhance the learning-based agents capabilities, and combining the planner's paths with the trained agent's simulated paths through multi-segment integration. Full details and results are available on our project website at https://github.com/FrankSinatral/TTsystems.

Abstract:
SLAM is a fundamental capability of unmanned systems, with LiDAR-based SLAM gaining widespread adoption due to its high precision. Current SLAM systems can achieve centimeter-level accuracy within a short period. However, there are still several challenges when dealing with largescale mapping tasks including significant storage requirements and difficulty of reusing the constructed maps. To address this, we first design an elastic and lightweight map representation called CELLmap, composed of several CELLS, each representing the local map at the corresponding location. Then, we design a general backend including CELL-based bidirectional registration module and loop closure detection module to improve global map consistency. Our experiments have demonstrated that CELLmap can represent the precise geometric structure of large-scale maps of KITTI dataset using only about 60 MB. Additionally, our general backend achieves up to a 26.88% improvement over various LiDAR odometry methods.

Abstract:
Bin-picking is a practical and challenging robotic manipulation task, where accurate 6D pose estimation plays a pivotal role. The workpieces in bin-picking are typically texture-less and randomly stacked in a bin, which poses a significant challenge to 6D pose estimation. Existing solutions are typically learning-based methods, which require object-specific training. Their efficiency of practical deployment for novel workpieces is highly limited by data collection and model retraining. Zero-shot 6D pose estimation is a potential approach to address the issue of deployment efficiency. Nevertheless, existing zero-shot 6D pose estimation methods are designed to leverage feature matching to establish point-to-point correspondences for pose estimation, which is less effective for workpieces with textureless appearances and ambiguous local regions. In this paper, we propose ZeroBP, a zero-shot pose estimation frame-work designed specifically for the bin-picking task. ZeroBP learns Position-Aware Correspondence (PAC) between the scene instance and its CAD model, leveraging both local features and global positions to resolve the mismatch issue caused by ambiguous regions with similar shapes and appearances. Extensive experiments on the ROBI dataset demonstrate that ZeroBP outperforms state-of-the-art zero-shot pose estimation methods, achieving an improvement of 9.1 % in average recall of correct poses.

Abstract:
The problem of mimicking the concertina locomotion mode of biological snakes through narrow channels of uncertain width, using a multi-link planar serpenoid robot, is considered. A novel algorithm for generating a reference trajectory that accurately reproduces this natural gait pattern is proposed and analysed for straight channels. A modification of this algorithm leverages feedback from the joints' currents and angular velocities to dynamically adjust the robot's movements within channels of unknown and varying widths. Experiments through rugged artificial channels of varying width show remarkable ability of the programmed snake robot to negotiate such terrains with agility and reasonable speed.

Abstract:
We present a model-predictive control (MPC) framework for legged robots that avoids the singularities associated with common three-parameter attitude representations like Euler angles during large-angle rotations. Our method parameterizes the robot's attitude with singularity-free unit quaternions and makes modifications to the iterative linear-quadratic regulator (iLQR) algorithm to deal with the resulting geometry. The derivation of our algorithm requires only elementary calculus and linear algebra, deliberately avoiding the abstraction and notation of Lie groups. We demonstrate the performance and computational efficiency of quaternion MPC in several experiments on quadruped and humanoid robots.

Abstract:
Collision-free planning is essential for bipedal robots operating within unstructured environments. This paper presents a real-time Model Predictive Control (MPC) frame-work that addresses both body and foot avoidance for dynamic bipedal robots. Our contribution is two-fold: we introduce (1) a novel formulation for adjusting step timing to facilitate faster body avoidance and (2) a novel 3D foot-avoidance formulation that implicitly selects swing trajectories and footholds that either steps over or navigate around obstacles with awareness of Center of Mass (COM) dynamics. We achieve body avoidance by applying a half-space relaxation of the safe region but introduce a switching heuristic based on tracking error to detect a need to change foot-timing schedules. To enable foot avoidance and viable landing footholds on all sides of foot-level obstacles, we decompose the non-convex safe region on the ground into several convex polygons and use Mixed-Integer Quadratic Programming to determine the optimal candidate. We found that introducing a soft minimum-travel-distance constraint is effective in preventing the MPC from being trapped in local minima that can stall half-space relaxation methods behind obstacles. We demonstrated the proposed algorithms on multibody simulations on the bipedal robot platforms, Cassie and Digit, as well as hardware experiments on Digit.

Abstract:
The perception of an object's surface is important for robotic applications enabling robust object manipulation. The level of accuracy in such a representation affects the outcome of the action planning, especially during tasks that require physical contact, e.g. grasping. In this paper, we propose a novel iterative method for 3D shape reconstruction consisting of two steps. At first, a mesh is fitted on data points acquired from the object's surface, based on a single primitive template. Subsequently, the mesh is properly adjusted to adequately represent local deformities. Moreover, a novel proactive tactile exploration strategy aims at minimizing the total uncertainty with the least number of contacts, while reducing the risk of contact failure in case the estimated surface differs significantly from the real one. The performance of the methodology is evaluated both in 3D simulation and on a real setup.

Abstract:
Robotic systems face challenges in performing open-set and personalized fitness evaluations, especially when adapting to new exercises and individual user needs. This paper introduces FitnessAgent, a unified agent framework designed to address these challenges. Unlike traditional systems that rely on pre-trained neural networks or fixed rule-based criteria, FitnessAgent can assess any exercise without prior training, adapting evaluation metrics based on expert knowledge and user-specific requirements. The system breaks down fitness evaluation tasks into combinations of metrics, each calculated using measurable operators such as angles, distances, and positions. By leveraging a set of primitive, exercise-agnostic operators, a large language model (LLM)-based planner dynamically selects and combines these operators for each task. The open-set capability of FitnessAgent is validated through experiments on both the widely-used Functional Movement Screen dataset and a newly collected isometric pose dataset. Results highlight the system's flexibility in handling new movements and its ability to adapt to personalized evaluation criteria without the need for code or algorithm modifications. FitnessAgent offers a scalable and personalized solution for fitness evaluation, making it well-suited for robotic applications that require adaptability to diverse user needs.

Abstract:
As environmental disasters happen more frequently and severely, seeking the source of pollutants or harmful particulates using plume tracking becomes even more important. Plume tracking on small quadrotors would allow these systems to operate around humans and fly in more confined spaces, but can be challenging due to poor sensitivity and long response times from gas sensors that fit on small quadrotors. In this work, we present an approach to complement chemical plume tracking with airflow source-seeking behavior using a custom flow sensor that can sense both airflow magnitude and direction on small quadrotors (< 100 g). We use this sensor to implement a modified version of the ‘Cast and Surge’ algorithm that takes advantage of flow direction sensing to find and navigate towards flow sources. A series of characterization experiments verified that the system can detect airflow while in flight and reorient the quadrotor toward the airflow. Several trials with random starting locations and orientations were used to show that our source-seeking algorithm can reliably find a flow source. This work aims to provide a foundation for future platforms that can use flow sensors in concert with other sensors to enable richer plume tracking data collection and source-seeking.

Abstract:
In leader-follower consensus, strong r-robustness of the communication graph provides a sufficient condition for followers to achieve consensus in the presence of misbehaving agents. Previous studies have assumed that robots can form and/or switch between predetermined network topologies with known robustness properties. However, robots with distancebased communication models may not be able to achieve these topologies while moving through spatially constrained environments, such as narrow corridors, to complete their objectives. This paper introduces a Control Barrier Function (CBF) that ensures robots maintain strong r-robustness of their communication graph above a certain threshold without maintaining any fixed topologies. Our CBF directly addresses robustness, allowing robots to have flexible reconfigurable network structure while navigating to achieve their objectives. The efficacy of our method is tested through various simulation and hardware experiments [code] https://github.com/joonlee16/Resilient-Leader-Follower-CBF-QP.

Abstract:
Point cloud completion is crucial for reconstructing accurate shapes in many 3D visual applications. Recent approaches incorporate images into the completion pipeline, introducing geometric clues and global constraints. However, their fusion processes often fail to reconstruct detailed parts and maintain global consistency simultaneously. Except for images, text is another important clue for recognizing the target's characteristics. Thus, in this work, we propose to combine multiple modalities including points, images and texts for point cloud completion. Specifically, inspired by recently pre-trained large language models, we generate the description texts for images by Visual Question Answering (VQA) models and introduce Visual-Textual Embedding (VTE) models to extract joint features of image-text pairs. Furthermore, we describe the edge geometric patterns by multi-scale edge convolution to guide the refinement of shapes in local areas. Then we adopt cross attention mechanism to effectively fuse multi-modal features and refine the coarse shape. Extensive experiments on commonly used benchmarks demonstrate our method's superior performance over previous uni-modal and cross-modal methods.

Abstract:
The 3D weakly-supervised visual grounding task aims to localize oriented 3D boxes in point clouds based on natural language descriptions without requiring annotations to guide model learning. This setting presents two primary challenges: category-level ambiguity and instance-level complexity. Category-level ambiguity arises from representing objects of fine-grained categories in a highly sparse point cloud format, making category distinction challenging. Instance-level complexity stems from multiple instances of the same category coexisting in a scene, leading to distractions during grounding. To address these challenges, we propose a novel weaklysupervised grounding approach that explicitly differentiates between categories and instances. In the category-level branch, we utilize extensive category knowledge from a pre-trained external detector to align object proposal features with sentencelevel category features, thereby enhancing category awareness. In the instance-level branch, we utilize spatial relationship descriptions from language queries to refine object proposal features, ensuring clear differentiation among objects. These designs enable our model to accurately identify target-category objects while distinguishing instances within the same category. Compared to previous methods, our approach achieves state-of-the-art performance on three widely used benchmarks: Nr3D, Sr3D, and ScanRef.

Abstract:
The safety-critical nature of autonomous vehicle (AV) operation necessitates development of task-relevant algorithms that can reason about safety at the system level and not just at the component level. To reason about the impact of a perception failure on the entire system performance, such task-relevant algorithms must contend with various challenges: complexity of AV stacks, high uncertainty in the operating environments, and the need for real-time performance. To overcome these challenges, in this work, we introduce a Q-network called SPARQ (abbreviation for Safety evaluation for Perception And Recovery Q-network) that evaluates the safety of a plan generated by a planning algorithm, accounting for perception failures that the planning process may have over-looked. This Q-network can be queried during system runtime to assess whether a proposed plan is safe for execution or poses potential safety risks. If a violation is detected, the network can then recommend a corrective plan while accounting for the perceptual failure. We validate our algorithm using the NuPlan-Vegas dataset, demonstrating its ability to handle cases where a perception failure compromises a proposed plan while the corrective plan remains safe. We observe an overall accuracy and recall of 90% while sustaining a frequency of 42Hz on the unseen testing dataset. We compare our performance to a popular reachability - based baseline and analyze some interesting properties of our approach in improving the safety properties of an AV pipeline. Website: vatsuak.github.io/sparq

Abstract:
When agents must execute multiple tasks at spatially distinct locations, it is common to formulate and solve a Traveling Salesman Problem (TSP) to find the order of locations (targets) that requires the smallest travel cost. Approaching such task sequencing problems as a TSP is restrictive, as it requires that unique locations be specified for each task. In reality a set of acceptable locations might be available. The Close Enough Traveling Salesman Problem (CETSP) is a generalization of the Traveling Salesman Problem in which the agent needs only visit a spherical neighborhood surrounding each target, and can thus address this task sequencing problem when any location in a sphere is acceptable. Prior work has developed a branch-and-bound approach that finds globally optimal solutions to instances of the CETSP by solving a sequence of Second-Order Cone Programs (SOCP). We demonstrate it is possible to eliminate 2 / 3 of the variables and 1 / 2 of the constraints in these SOCPs, show how to reuse computation and memory allocation across multiple SOCPs in the sequence, and propose a strategy to warm-start the SOCPs using solutions obtained earlier in the sequence. Collectively, these three changes halve the time required to solve 210 random CETSP instances to optimality. We also obtained improved lower bounds on 73 instances from the literature, including solving one instance to optimality for the first time.

Abstract:
The multi-drone-truck collaborative delivery, where unmanned trucks serve as mobile supply stations for drones, effectively combines the strengths of both vehicles and presents wide application prospects. But the majority of existing literature restricts drone launch and retrieve operations (LARO) to stationary trucks, and potential drone route collisions are mostly ignored. This leads to inability to fully exploit the capability of drones. We address these gaps and introduce a new variant of multi-drone-truck collaborative delivery. However, the scheduling for drones and truck faces high-dimensional solution space and complex constraints, making it almost impossible for centralized solving. To this end, we develop a hierarchical solution framework that decomposes the complete problem into two levels of subproblem. The upper solver centrally allocates tasks and schedules when drones to launch, while the lower solver, based on multi-agent reinforcement learning (MARL), plans paths for each drone agent in a decentralized but cooperative manner. In addition, we validate the effectiveness of our method by benchmarking it against three state-of-the-art approaches, demonstrating its superiority in terms of both efficiency and collision avoidance.

Abstract:
In agricultural automation, inherent occlusion presents a major challenge for robotic harvesting. We propose a novel imitation learning-based viewpoint planning approach to actively adjust camera viewpoint and capture unobstructed images of the target crop. Traditional viewpoint planners and existing learning-based methods, depend on manually designed evaluation metrics or reward functions, often struggle to generalize to complex, unseen scenarios. Our method employs the Action Chunking with Transformer (ACT) algorithm to learn effective camera motion policies from expert demonstrations. This enables continuous six-degree-of-freedom (6-DoF) viewpoint adjustments that are smoother, more precise and reveal occluded targets. Extensive experiments in both simulated and realworld environments, featuring agricultural scenarios and a 6 DoF robot arm equipped with an RGB-D camera, demonstrate our method's superior success rate and efficiency, especially in complex occlusion conditions, as well as its ability to generalize across different crops without reprogramming. This study advances robotic harvesting by providing a practical “learn from demonstration” (LfD) solution to occlusion challenges, ultimately enhancing autonomous harvesting performance and productivity.

Abstract:
Even if the depth maps captured by RGB-D sensors deployed in real environments are often characterized by large areas missing valid depth measurements, the vast majority of depth completion methods still assumes depth values covering all areas of the scene. To address this limitation, we introduce SteeredMarigold, a training-free, zero-shot depth completion method capable of producing metric dense depth, even for largely incomplete depth maps. SteeredMarigold achieves this by using the available sparse depth points as conditions to steer a denoising diffusion probabilistic model. Our method outperforms relevant top-performing methods on the NYUv2 dataset, in tests where no depth was provided for a large area, achieving state-of-art performance and exhibiting remarkable robustness against depth map incompleteness. Our source code is publicly available at https://steeredmarigold.github.io.

Abstract:
Mobile manipulators enable a wide range of operations with mobility and advanced manipulation capabilities. Despite their potential, existing approaches typically treat the mobile base and the manipulator separately, thereby limiting the optimality of the system for composite whole-body behaviors. In this work, we present a Whole-Body Model Predictive Control framework for mobile manipulation involving tasks with varying timelines. We integrate task priorities across both task and time dimensions, bringing inherent transition ability with enhanced performance. Our approach improves the trajectory tracking performance by up to 36 % in terms of manipulability and reduces the maximum velocity during task priority transitions by 53% compared to the existing approach while maintaining a low computational cost of 4.3 ms, allowing for high reactivity in real-world applications. We demonstrate its effectiveness through a door-opening and traversing behavior, showcasing the first successful implementation of a non-holonomic mobile manipulator in such a scenario. See https://wbmpc.github.io/ for supplemental materials.

Abstract:
Unmanned planetary rovers have traversed kilometers of Lunar and Martian terrain while performing valuable science. However, they still face mobility challenges including steep slopes and unstable soil that can entrap vehicles, as demonstrated by NASA's Spirit rover. Vehicles on Earth can depend on a human operator or rescue vehicle to tow them out of an entrapment, but remote rovers cannot, limiting their route to highly conservative path selections. To increase rover mobility on slopes and unstable soils, we present a resettable anchor launcher for independent self-rescue. The device launches a tethered land anchor away from the rover and then uses a winch to tow the rover up a hill or out of an entrapment. This paper presents the design of the launcher and its integration into a half-meter-long rover mobility platform with field testing at the NASA Glenn Research Center SLOPE Lab. We demonstrate repeatable launching and winching to help the rover climb a 17° slope of loose GRC-1 Lunar regolith simulant that it otherwise could not climb. Our work presents an alternative method to increase rover mobility, especially up slopes, and enables independent rover rescue, which could eventually increase mission duration and reduce risk of entrapment during extraterrestrial exploration.

Abstract:
The segmentation of road markings plays a crucial role in visual perception for the autonomous driving system. It enables vehicles to recognize road markings at the pixel-level, and facilitates subsequent path planning, localization, and map construction tasks. Current techniques mainly focus on normal driving scenes (i.e., clear daytime), and the performance would decrease significantly for adverse weather conditions. This work proposes RMSeg-UDA: an unsupervised domain adaptive road marking segmentation framework. By combining schedule self-training and class-conditioned adversarial training, the network utilizes both labeled normal data and unlabeled data from other domains to train a road marking segmentation model. For the evaluation on adverse conditions, a new image dataset, RLMDAC, is established with rainy and nighttime driving scenes. The experiments conducted using both public and our datasets have demonstrated the effectiveness of the proposed technique. Code and dataset are available at https://github.com/stu9113611/RMSeg-UDA.

Abstract:
Point cloud semantic segmentation is crucial in various applications such as autonomous driving, robotics, and virtual reality, aiming to assign labels to each point in a cloud to reflect spatial relationships and boundaries. While previous methods primarily focus on geometric features, they often overlook the auxiliary role of color information, especially in scenes where geometric structures are less distinct. In this paper, we propose the Color Point Cloud Enhancement (CPCE) method to effectively leverage color information for improved 3D scene understanding. CPCE introduces a color information enhancement module with multi-scale consistency, enriching point features throughout the encoder stages. Additionally, we develop a novel contrastive learning module that uses relative color coordinates for point cloud serialization, allowing for the capture of positive and negative samples from distant points with similar color textures. Furthermore, we design a contrastive learning module tailored for scenes with weak geometric structures, enhancing feature representation through color-augmented contrast. Our method achieved a 78.1% mIoU on the ScanNet dataset, outperforming existing models trained on a single dataset. These results highlight the effectiveness of CPCE in scenarios where traditional methods struggle, particularly in enhancing segmentation accuracy by utilizing color as a critical feature.

Abstract:
Defect diagnosis in urban infrastructure is crucial for public safety. Traditional manual inspections face significant challenges in terms of accuracy and cost-effectiveness. In this paper, we propose a lightweight and hardware-friendly large-scale infrastructure detector, CUPID, highly suitable for unmanned aerial vehicles (UAVs). Given the significant challenges in automatically detecting defects of varying intensity and size within complex infrastructure, along with the tendency of lightweight models to lose detail and fail to fully capture features during the defect extraction process, we propose the CUPID_Block, a multi-level information fusion block to construct the backbone, featuring the CUPID_Conv module equipped with our proposed CCA (CrissCross Attention). Furthermore, CUPID features an auxiliary training branch that assimilates lower feature maps, helping to recover details lost in deeper convolutional layers. To verify the effectiveness of CUPID and to address the lack of a suitable dataset in the community, we establish a multi-scenario infrastructure defect dataset, CUBIT2024, to conduct extensive experiments. Finally, to assess the efficiency and adaptability of CUPID in UAV for online infrastructure inspection, we design a compact autonomous drone, CU-Astro, where the proposed CUPID is deployed on the Jetson Orin NX computer onboard to evaluate the speed and power consumption of the inference.

Abstract:
Open-vocabulary detectors are proposed to locate and recognize objects in novel classes. However, variations in vision-aware language vocabulary data used for open-vocabulary learning can lead to unfair and unreliable evaluations. Recent evaluation methods have attempted to address this issue by incorporating object properties or adding locations and characteristics to the captions. Nevertheless, since these properties and locations depend on the specific details of the images instead of classes, detectors can not make accurate predictions without precise descriptions provided through human annotation. This paper introduces 3F-OVD, a novel task that extends supervised fine-grained object detection to the open-vocabulary setting. Our task is intuitive and challenging, requiring a deep understanding of Fine-grained captions and careful attention to Fine-grained details in images in order to accurately detect Fine-grained objects. Additionally, due to the scarcity of qualified fine-grained object detection datasets, we have created a new dataset, NEU-171K, tailored for both supervised and open-vocabulary settings. We benchmark state-of-the-art object detectors on our dataset for both settings. Furthermore, we propose a simple yet effective post-processing technique. Our data, annotations and codes are available at https://github.com/tengerye/3FOVD.

Abstract:
Vision sensors are extensively used for localizing a robot's pose, particularly in environments where global localization tools such as GPS or motion capture systems are unavailable. In many visual navigation systems, localization is achieved by detecting and tracking visual features or landmarks, which provide information about the sensor's relative pose. For reliable feature tracking and accurate pose estimation, it is crucial to maintain visibility of a sufficient number of features. This requirement can sometimes conflict with the robot's overall task objective. In this paper, we approach it as a constrained control problem. By leveraging the invariance properties of visibility constraints within the robot's kinematic model, we propose a real-time safety filter based on quadratic programming. This filter takes a reference velocity command as input and produces a modified velocity that minimally deviates from the reference while ensuring the information score from the currently visible features remains above a user-specified threshold. Numerical simulations demonstrate that the proposed safety filter preserves the invariance condition and ensures the visibility of more features than the required minimum. We also validated its real-world performance by integrating it into a visual simultaneous localization and mapping (SLAM) algorithm, where it maintained high estimation quality in challenging environments, outperforming a simple tracking controller.

Abstract:
Improving the performance of motion planning algorithms for high-degree-of-freedom robots usually requires reducing the cost or frequency of computationally expensive operations. Traditionally, and especially for asymptotically optimal sampling-based motion planners, the most expensive operations are local motion validation and querying the nearest neighbours of a configuration. Recent advances have significantly reduced the cost of motion validation by using single instruction/multiple data (SIMD) parallelism to improve solution times for satisficing motion planning problems. These advances have not yet been applied to asymptotically optimal motion planning. This paper presents Fully Connected Informed Trees (FCIT), the first fully connected, informed, anytime almost-surely asymptotically optimal (ASAO) algorithm. FCIT exploits the radically reduced cost of edge evaluation via SIMD parallelism to build and search fully connected graphs. This removes the need for nearest-neighbours structures, which are a dominant cost for many sampling-based motion planners, and allows it to find initial solutions faster than state-of-the-art ASAO (VAMP, OMPL) and satisficing (OMPL) algorithms on the MotionBenchMaker dataset while converging towards optimal plans in an anytime manner.

Abstract:
Learning policies in simulation and transferring them to the real world has become a promising approach in dexterous manipulation. However, bridging the sim-to-real gap for each new task requires substantial human effort, such as careful reward engineering, hyperparameter tuning, and system identification. In this work, we present a system that leverages low-level skills to address these challenges for more complex tasks. Specifically, we introduce a hierarchical policy for in-hand object reorientation based on previously acquired rotation skills. This hierarchical policy learns to select which low-level skill to execute based on feedback from both the environment and the low-level skill policies themselves. Compared to learning from scratch, the hierarchical policy is more robust to out-of-distribution changes and transfers easily from simulation to real-world environments. Additionally, we propose a generalizable object pose estimator that uses proprioceptive information, low-level skill predictions, and control errors as inputs to estimate the object's pose over time. We demonstrate that our system can reorient objects, including symmetrical and textureless ones, to a desired pose.

Abstract:
Robots can influence people to accomplish their tasks more efficiently: autonomous cars can inch forward at an intersection to pass through, and tabletop manipulators can go for an object on the table first. However, a robot's ability to influence can also compromise the physical safety of nearby people if naively executed. In this work, we pose and solve a novel robust reach-avoid dynamic game which enables robots to be maximally influential, but only when a safety backup control exists. On the human side, we model the human's behavior as goal-driven but conditioned on the robot's plan, enabling us to capture influence. On the robot side, we solve the dynamic game in the joint physical and belief space, enabling the robot to reason about how its uncertainty in human behavior will evolve over time. We instantiate our method, called SLIDE (Safely Leveraging Influence in Dynamic Environments), in a high-dimensional (39-D) simulated human-robot collaborative manipulation task solved via offline game-theoretic reinforcement learning. We compare our approach to a robust baseline that treats the human as a worst-case adversary, a safety controller that does not explicitly reason about influence, and an energy-function-based safety shield. We find that SLIDE consistently enables the robot to leverage the influence it has on the human when it is safe to do so, ultimately allowing the robot to be less conservative while still ensuring a high safety rate during task execution. Project website: https://cmu-intentlab.github.io/safe-influence/

Abstract:
Ensuring safety is crucial to promote the application of robot manipulators in open workspaces. Factors such as sensor errors or unpredictable collisions make the environment full of uncertainties. In this work, we investigate these potential safety challenges on redundant robot manipulators, and propose a taskoriented planning and control framework to achieve multi-layered safety while maintaining efficient task execution. Our approach consists of two main parts: a task-oriented trajectory planner based on multiple-shooting model predictive control (MPC) method, and a torque controller that allows safe and efficient collision reaction using only proprioceptive data. Through extensive simulations and real-hardware experiments, we demonstrate that the proposed framework11Code is available at https://github.com/jia-xinyu/arm-safety. can effectively handle uncertain static or dynamic obstacles, and perform disturbance resistance in manipulation tasks when unforeseen contacts occur.

Abstract:
The language-guided robot grasping task requires a robot agent to integrate multimodal information from both visual and linguistic inputs to predict actions for target-driven grasping. While recent approaches utilizing Multimodal Large Language Models (MLLMs) have shown promising results, their extensive computation and data demands limit the feasibility of local deployment and customization. To address this, we propose a novel CLIP-based [1] multimodal parameter-efficient tuning (PET) framework designed for three language-guided object grounding and grasping tasks: (1) Referring Expression Segmentation (RES), (2) Referring Grasp Synthesis (RGS), and (3) Referring Grasp Affordance (RGA). Our approach introduces two key innovations: a bi-directional vision-language adapter that aligns multimodal inputs for pixel-level language understanding and a depth fusion branch that incorporates geometric cues to facilitate robot grasping predictions. Experiment results demonstrate superior performance in the RES object grounding task compared with existing CLIP-based full-model tuning or PET approaches. In the RGS and RGA tasks, our model not only effectively interprets object attributes based on simple language descriptions but also shows strong potential for comprehending complex spatial reasoning scenarios, such as multiple identical objects present in the workspace. Project page: https://z.umn.edu/etog-etrg

Abstract:
In this paper, we introduce time-correlated model predictive path integral (TC-MPPI), a novel approach to mitigate action noise in sampling-based control methods. Unlike conventional smoothing techniques that rely on post-processing or additional state variables, TC-MPPI directly incorporates temporal correlation of actions into stochastic optimal control, effectively enforcing quadratic costs on action derivatives. This reformulation enables us to generate smooth action sequences without extra modifications, using a time-correlated and conditional Gaussian sampling distribution. We demonstrate the effectiveness of our approach through simulations on various robotic platforms, including a pendulum, cart-pole, 2D bicopter, 3D quadcopter, and autonomous vehicle. Simulation videos are available at https://youtu.be/nWfJ2MAV2JI.

Abstract:
Multi-axle autonomous mobile robots (AMRs) are set to revolutionize the future of robotics in logistics. As the backbone of next-generation solutions, these robots face a critical challenge: managing and minimizing swept volume during turns while maintaining precise control. Traditional systems designed for standard vehicles often struggle with the complex dynamics of multi-axle configurations, leading to inefficiency and increased safety risk in confined spaces. Our innovative framework overcomes these limitations by combining swept volume minimization with Signed Distance Field (SDF) path planning and model predictive control (MPC) for independent wheel steering. This approach not only plans paths with an awareness of the swept volume, but actively minimizes it in real-time, allowing each axle to follow a precise trajectory while significantly reducing the space the vehicle occupies. By predicting future states and adjusting the turning radius of each wheel, our method enhances both maneuverability and safety, even in the most constrained environments. Unlike previous works, our solution goes beyond basic path calculation and tracking, offering real-time path optimization with minimal swept volume and efficient individual axle control. To our knowledge, this is the first comprehensive approach to tackle these challenges, delivering life-saving improvements in control, efficiency, and safety for multi-axle AMRs. Furthermore, we will open-source our work to foster collaboration and enable others to advance safer and more efficient autonomous systems.

Abstract:
The Robotic Space Simulator was developed as a physical simulation for in-space manipulation tasks. It incorporates external inputs to its dynamics simulation via force/torque sensors mounted to the 2 6-DoF Stewart platforms which compose its primary structure. Each platform is augmented with an additional degree of freedom in the form of an auxiliary axis - one in translation and one in rotation. Previous work has not effectively included the additional workspace provided by these auxiliary axes. Additionally, it limited the use of external force/torque inputs to the case of platform translation only because the external forces/torques due to platform motion and gravitational force were not removed from the sensor inputs prior to inclusion in the dynamic simulation. In this work, we address each of these limitations. We develop and test two methods of auxiliary axis control: Cartesian Workspace and Joint Cost-Function, and find that both methods are an improvement over the existing system. Additionally we develop and test a method for calculating the mass properties of hardware mounted to the force/torque sensors and a dynamics compensation method for this hardware. Using this technique we are able to effectively compensate for gravitational force in different platform orientations, and achieve zero-g behavior of the system.

Abstract:
Various forms of biohybrid robots have been developed; however, creating robots with multiple degrees of freedom remains a challenging task. In this paper, we developed a multi-joint biohybrid robot by using skeletal muscle tissue. To achieve this, we first developed a modular bio-actuator actuated by skeletal muscle tissues. The objective of this study was to enhance the contraction force of the actuator and establish optimal experimental conditions for creating high-performance robots. By applying continuous electrical stimulation for five days during culture of bio-actuator, we were able to increase the contraction force by more than threefold. Additionally, we determined the appropriate electric field based on the electrode distance, which enabled us to establish an optimal experimental setup. We also confirmed that connecting the actuators in series can significantly increase the moving distance. Connecting two actuators in series resulted in a total movement distance equivalent to the sum of the distances of each actuator. This finding suggests the potential to create robots with a larger operational workspace. Using these actuators, we first constructed a manipulator with a rotational joint. This research is expected to contribute not only to the development of various robots utilizing bio-actuators but also to advancements in biology technology.

Abstract:
Fish use their lateral lines to sense flows and pressure gradients, enabling them to detect nearby objects and organisms. Towards replicating this capability, we demonstrated successful leader-follower formation swimming using flow pressure sensing in our undulatory robotic fish (\mu Bot/MUBot). The follower \mu Bot is equipped at its head with bilateral pressure sensors to detect signals excited by both its own and the leader's movements. First, using experiments with static formations between an undulating leader and a stationary follower, we determined the formation that resulted in strong pressure variations measured by the follower. This formation was then selected as the desired formation in free swimming for obtaining an expert policy. Next, a long short-term memory neural network was used as the control policy that maps the pressure signals along with the robot motor commands and the Euler angles (measured by the onboard IMU) to the steering command. The policy was trained to imitate the expert policy using behavior cloning and Dataset Aggregation (DAgger). The results show that with merely two bilateral pressure sensors and less than one hour of training data, the follower effectively tracked the leader within distances of up to 200 \textmm(=1 body length) while swimming at speeds of 155 \textmm / \mathrms(=0.8 body lengths/s). This work highlights the potential of fish-inspired robots to effectively navigate fluid environments and achieve formation swimming through the use of flow pressure feedback. Video—https://youtu.be/DIDYGi9Td0I

Abstract:
This paper presents a system for enabling real-time synthesis of whole-body locomotion and manipulation policies for real-world legged robots. Motivated by recent advancements in robot simulation, we leverage the efficient parallelization capabilities of the MuJoCo simulator on a multi-core CPU to achieve fast sampling over the robot state and action trajectories. Our results show surprisingly effective real-world locomotion and manipulation capabilities with a very simple control strategy. We demonstrate our approach on several hardware and simulation experiments: robust locomotion over flat and uneven terrains, climbing over a box whose height is comparable to the robot, and pushing a box to a goal position. To our knowledge, this is the first successful deployment of whole-body sampling-based MPC on real-world legged robot hardware. Experiment videos and code can be found at: whole-body-mppi.github.io.

Abstract:
Safe navigation in real-time is an essential task for humanoid robots in real-world deployment. Since humanoid robots are inherently underactuated thanks to unilateral ground contacts, a path is considered safe if it is obstacle-free and respects the robot's physical limitations and underlying dynamics. Existing approaches often decouple path planning from gait control due to the significant computational challenge caused by the full-order robot dynamics. In this work, we develop a unified, safe path and gait planning framework that can be evaluated online in real-time, allowing the robot to navigate clustered environments while sustaining stable locomotion. Our approach uses the popular Linear Inverted Pendulum (LIP) model as a template model to represent walking dynamics. It incorporates heading angles in the model to evaluate kinematic constraints essential for physically feasible gaits properly. In addition, we leverage discrete control barrier functions (DCBF) for obstacle avoidance, ensuring that the subsequent foot placement provides a safe navigation path within clustered environments. To guarantee real-time computation, we use a novel approximation of the DCBF to produce linear DCBF (LDCBF) constraints. We validate the proposed approach in simulation using a Digit robot in randomly generated environments. The results demonstrate that our approach can generate safe gaits for a nontrivial humanoid robot to navigate environments with randomly generated obstacles in real-time.

Abstract:
Decentralized safe control plays an important role in multi-agent systems given the scalability and robustness without reliance on a central authority. However, without an explicit global coordinator, the decentralized control methods are often prone to deadlock - a state where the system reaches equilibrium, causing the robots to stall. In this paper, we propose a generalized decentralized framework that unifies the Control Lyapunov Function (CLF) and Control Barrier Function (CBF) to facilitate efficient task execution and ensure deadlock-free trajectories for the multi-agent systems. As the agents approach the deadlock-related undesirable equilibrium, the framework can detect the equilibrium and drive agents away before that happens. This is achieved by a secondary deadlock resolution design with an auxiliary CBF to prevent the multi-agent systems from converging to the undesirable equilibrium. To avoid dominating effects due to the deadlock resolution over the original task-related controllers, a deadlock indicator function using CBF-inspired risk measurement is proposed and encoded in the unified framework for the agents to adaptively determine when to activate the deadlock resolution. This allows the agents to follow their original control tasks and seamlessly unlock or deactivate deadlock resolution as necessary, effectively improving task efficiency. We demonstrate the effectiveness of the proposed method through theoretical analysis, numerical simulations, and real-world experiments.

Abstract:
Few-shot adaptation is an important capability for intelligent robots that perform tasks in open-world settings such as everyday environments or flexible production. In this paper, we propose a novel approach for non-prehensile manipulation which incrementally adapts a physics-based dynamics model for model-predictive control (MPC). The model prediction is aligned with a few examples of robot-object interactions collected with the MPC. This is achieved by using a parallelizable rigid-body physics simulation as dynamic world model and sampling-based optimization of the model parameters. In turn, the optimized dynamics model can be used for MPC using efficient sampling-based optimization. We evaluate our fewshot adaptation approach in object pushing experiments in simulation and with a real robot.

Abstract:
This paper studies point cloud perception within outdoor environments. Existing methods face limitations in recognizing objects located at a distance or occluded, due to the sparse nature of outdoor point clouds. In this work, we observe a significant mitigation of this problem by accumulating multiple temporally consecutive LiDAR sweeps, resulting in a remarkable improvement in perception accuracy. However, the computation cost also increases, hindering previous approaches from utilizing a large number of LiDAR sweeps. To tackle this challenge, we find that a considerable portion of points in the accumulated point cloud is redundant, and discarding these points has minimal impact on perception accuracy. We introduce a simple yet effective Gumbel Spatial Pruning (GSP) layer that dynamically prunes points based on a learned end-toend sampling. The GSP layer is decoupled from other network components and thus can be seamlessly integrated into existing point cloud network architectures. Extensive experiments show that our pruning strategy improves several perception algorithms in multiple tasks.

Abstract:
Camera-LiDAR 3D object detection is currently becoming a crucial component in the field of autonomous driving perception. However, previous models only performed feature fusion in the deep-level BEV hierarchy when dealing with camera-LiDAR feature fusion. This approach lacks interaction with the shallow-level sensor features, which is beneficial in constructing the corresponding BEV features. However, a simple shallow-level feature interaction can introduce sensor noise caused by intrinsic and extrinsic camera calibration errors. To address this, we propose RoBiFusion, a novel camera-LiDAR 3D object detection framework designed for effective sensor feature interaction and mitigating sensor noise interference. This framework consists of three submodules: the Camera-LiDAR Feature Matching module, the LiDAR-to-Camera module, and the Camera-to-LiDAR module. Firstly, in the Camera-LiDAR Feature Matching module, we use the cross-attention module to dynamically match the camera features and the LiDAR features, which solves the problem of feature inconsistency caused by noise in the camera's intrinsic and extrinsic parameters. Secondly, in the LiDAR-to-Camera module, we propose a novel depth representation that can effectively mitigate LiDAR noise interference. Thirdly, in the Camera-to-LiDAR module, we introduce deformable attention to help LiDAR feature capture instance-level semantic features. Additionally, we design a novel differentiable and efficient grid sample module to accelerate the process since the bilinear grid sample module in deformable attention is time-consuming and not deployment-friendly. We compared RoBiFusion to the state-of-the-art BEVFusion on the nuScenes dataset and found that RoBiFusion surpasses BEVFusion by 1.5% mAP and 2.4% NDS. Furthermore, we designed a series of ablation experiments to verify the effectiveness of the aforementioned modules.

Affiliations: Department of Electronic Engineering, Chinese University of Hong Kong, Hong Kong, SAR, China; Centre for Artificial Intelligence and Robotics (CAIR), Chinese Academy of Sciences, Hong Kong Institute of Science & Innovation, Hong Kong, SAR, China; Department of Electrical Engineering, The City University of Hong Kong, Hong Kong, SAR, China; Chinese Academy of Sciences, Institute of Automation, Beijing, China; The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China

Abstract:
Automated diagnostic systems (ADS) have shown significant potential in the early detection of polyps during endoscopic examinations, thereby reducing the incidence of colorectal cancer. However, due to high annotation costs and strict privacy concerns, acquiring high-quality endoscopic images poses a considerable challenge in the development of ADS. Despite recent advancements in generating synthetic images for dataset expansion, existing endoscopic image generation algorithms failed to accurately generate the details of polyp boundary regions and typically required medical priors to specify plausible locations and shapes of polyps, which limited the realism and diversity of the generated images. To address these limitations, we present Polyp-Gen, the first full-automatic diffusion-based endoscopic image generation framework. Specifically, we devise a spatial-aware diffusion training scheme with a lesion-guided loss to enhance the structural context of polyp boundary regions. Moreover, to capture medical priors for the localization of potential polyp areas, we introduce a hierarchical retrieval-based sampling strategy to match similar fine-grained spatial features. In this way, our Polyp-Gen can generate realistic and diverse endoscopic images for building reliable ADS. Extensive experiments demonstrate the state-of-the-art generation quality, and the synthetic images can improve the downstream polyp detection task. Additionally, our Polyp-Gen has shown remarkable zeroshot generalizability on other datasets. The source code is available at https://github.com/CUHK-AIM-Group/Polyp-Gen.

Abstract:
Humans effortlessly integrate common-sense knowledge with sensory input from vision and touch to understand their surroundings. Emulating this capability, we introduce FusionSense, a novel 3D reconstruction framework that enables robots to fuse priors from foundation models with highly sparse observations from vision and tactile sensors. FusionSense addresses three key challenges: (i) How can robots efficiently acquire robust global shape information about the surrounding scene and objects? (ii) How can robots strategically select touch points on the object using geometric and commonsense priors? (iii) How can partial observations such as tactile signals improve the overall representation of the object? Our framework employs 3D Gaussian Splatting as a core representation and incorporates a hierarchical optimization strategy involving global structure construction, object visual hull pruning and local geometric constraints. This advancement results in fast and robust perception in environments with traditionally challenging objects that are transparent, reflective, or dark, enabling more downstream manipulation or navigation tasks. Experiments on real-world data suggest that our framework outperforms previously state-of-the-art sparse-view methods. All code and data are open-sourced on the project website.

Abstract:
Cooperative Simultaneous Localization and Mapping (C-SLAM) enables multiple agents to work together in mapping unknown environments while simultaneously estimating their own positions. This approach enhances robustness, scalability, and accuracy by sharing information between agents, reducing drift, and enabling collective exploration of larger areas. In this paper, we present Decentralized Visual Monocular SLAM (DVM-SLAM), the first open-source decentralized monocular C-SLAM system. By only utilizing low-cost and light-weight monocular vision sensors, our system is well suited for small robots and micro aerial vehicles (MAVs). DVMSLAM's real-world applicability is validated on physical robots with a custom collision avoidance framework, showcasing its potential in real-time multi-agent autonomous navigation scenarios. We also demonstrate comparable accuracy to state-of-the-art centralized monocular C-SLAM systems. We opensource our code and provide supplementary material online ^1.

Abstract:
Most birds in nature rely on jumping for takeoff. Flapping-Wing Robots can flap and fly like birds but require an operator to take off, which are unable to generate sufficient lift to maintain flight at a low airspeed and must accelerate to take-off speed in a short time. It poses a challenge for the design of the jumping mechanism. This study is inspired by the jump-takeoff of birds and designs a simple and lightweight jumping leg, which is capable of storing and releasing energy with only one degree of freedom. In addition, a prototype was developed and tested, with a wingspan of 2 meters and a mass of 1.6 kilograms, accelerating to 4 m/s in 52 ms by jumping, achieving the jumping take-off from the ground.

Abstract:
Drag-based swimming using rowing appendages, fins, and webbed feet is a widely adopted mode of locomotion in aquatic animals. To develop efficient underwater and swimming vehicles, various bioinspired drag-based paddle designs have been proposed, often facing a trade-off between propulsive efficiency and versatility. Webbed feet generate effective propulsive force during the power phase, while being lightweight, robust, and partially foldable during the recovery phase. However, the time-consuming process of mechanically folding and unfolding webbed feet extends the transition periods between the recovery and power phases, which in turn increased drag, and reduces overall paddling efficiency. In this study, we draw inspiration from the coupling tendons of aquatic birds. We implement tendon coupling mechanisms to minimize the transition time between the recovery and power phases. Hardware experiments demonstrate that our proposed mechanism improves propulsive efficiency by factors of \mathbf2. 0 and \mathbf2. 4 compared to designs without extensor tendons and based on passive paddles, respectively. Additionally, we find that distal leg joint clutching-previously shown to enhance efficiency in terrestrial walking-plays a negligible role in swimming locomotion. In sum, we present a novel principle for efficient drag-based leg and paddle design, with implications for understanding the swimming mechanics of aquatic birds and advancing bioinspired aquatic propulsion systems.

Abstract:
Deformable linear objects (DLOs), such as ropes, cables, and rods, are common in various scenarios, and accurate occlusion reconstruction of them is crucial for effective robotic manipulation. Previous studies for DLO reconstruction either rely on supervised learning, which is limited by the availability of labeled real-world data, or geometric approaches, which fail to capture global features and often struggle with occlusions and complex shapes. This paper presents a novel DLO occlusion reconstruction framework that integrates self-supervised point cloud completion with traditional techniques like clustering, sorting, and fitting to generate ordered key points. A memory module is proposed to enhance the self-supervised training process by consolidating prototype information, while DLO shape constraints are utilized to improve reconstruction accuracy. Experimental results on both synthetic and real-world datasets demonstrate that our method outperforms state-of-the-art algorithms, particularly in scenarios involving complex occlusions and intricate self-intersections.

Abstract:
Tactile sensing is vital for human dexterous manipulation, however, it has not been widely used in robotics. Compact, low-cost sensing platforms can facilitate a change, but unlike their popular optical counterparts, they are difficult to deploy in high-fidelity tasks due to their low signal dimensionality and lack of a simulation model. To overcome these challenges, we introduce PseudoTouch which links high-dimensional structural information to low-dimensional sensor signals. It does so by learning a low-dimensional visual-tactile embedding, wherein we encode a depth patch from which we decode the tactile signal. We collect and train PseudoTouch on a dataset comprising aligned tactile and visual data pairs obtained through random touching of eight basic geometric shapes. We demonstrate the utility of our trained PseudoTouch model in two downstream tasks: object recognition and grasp stability prediction. In the object recognition task, we evaluate the learned embedding's performance on a set of five basic geometric shapes and five household objects. Using PseudoTouch, we achieve an object recognition accuracy 84% after just ten touches, surpassing a proprioception baseline. For the grasp stability task, we use ACRONYM labels to train and evaluate a grasp success predictor using PseudoTouch's predictions derived from virtual depth information. Our approach yields a 32% absolute improvement in accuracy compared to the baseline relying on partial point cloud data. We make the data, code, and trained models publicly available at https://pseudotouch.cs.uni-freiburg.de.

Abstract:
This paper presents a set of simple and intuitive robot collision detection algorithms that show substantial scaling improvements for high geometric complexity and large numbers of collision queries by leveraging hardware-accelerated ray tracing on GPUs. It is the first leveraging hardware-accelerated ray-tracing for direct volume mesh-to-mesh discrete collision detection and applying it to continuous collision detection. We introduce two methods: Ray-Traced Discrete-Pose Collision Detection for exact robot mesh to obstacle mesh collision detection, and Ray-Traced Continuous Collision Detection for robot sphere representation to obstacle mesh swept collision detection, using piecewise-linear or quadratic B-splines. For robot link meshes totaling 24k triangles and obstacle meshes of over 190k triangles, our methods were up to 2.8 times faster in batched discrete-pose queries than a state-of-the-art GPU-based method using a sphere robot representation. For the same obstacle mesh scene, our sphere-robot continuous collision detection was up to 7 times faster depending on trajectory batch size. We also performed detailed measurements of the volume coverage accuracy of various sphere/mesh pose/path representations to provide insight into the tradeoffs between speed and accuracy of different robot collision detection methods.

Abstract:
This paper proposes a fair control framework for multi-robot systems, which integrates the newly introduced Alternative Authority Control (AAC) and Flexible Control Barrier Function (F-CBF). Control authority refers to a single robot which can plan its trajectory while considering others as moving obstacles, meaning the other robots do not have authority to plan their own paths. The AAC method dynamically distributes the control authority, enabling fair and coordinated movement across the system. This approach significantly improves computational efficiency, scalability, and robustness in complex environments. The proposed F-CBF extends traditional CBFs by incorporating obstacle shape, velocity, and orientation. FCBF enhances safety by accurate dynamic obstacle avoidance. The framework is validated through simulations in multi-robot scenarios, demonstrating its safety, robustness and computational efficiency.

Abstract:
There are many ways for a gripper to estimate the forces between its fingers. If powered by direct-drive brushless motors, then one technique is to measure their current. This is not the most accurate technique, but it is simple, keeps the sensor remote, and requires no new components. The estimation involves multiplying current signals through by the torque constant and the inverse transpose of the Jacobian. The Jacobian either amplifies the signal from fingertip force to motor current (at the cost of tip force production), or diminishes it (with the gain of tip force production), indicating an inherent trade-off. However, the Jacobian is a function of configuration, and for any workspace point there are multiple configurations (multiple inverse kinematics solutions), therefore a selection of Jacobian exists. For a given workspace point, the number of Jacobian choices is just a few, but these choices can be designed (through dimensional synthesis) to overcome the trade-off. The problem can be framed as velocity ellipse synthesis over multiple output modes. In this work, we conduct optimal synthesis to compute a new gripper design. The gripper was built and tested. It transitions between two different modes: sense mode and grip mode. Sense mode can sense forces 3 times smaller than grip mode. Grip mode can exert forces 4 times greater than sense mode.

Abstract:
Methods for generative design of robot physical configurations can automatically find optimal and innovative solutions for challenging tasks in complex environments. The vast search-space includes the physical design-space and the controller parameter-space, making it a challenging problem in machine learning and optimisation in general. Evolutionary algorithms (EAs) have shown promising results in generating robot designs via gradient-free optimisation. Morpho-evolution with learning (MEL) uses EAs to concurrently generate robot designs and learn the optimal parameters of the controllers. Two main issues prevent MEL from scaling to higher complexity tasks: i) computational cost and ii) premature convergence to sub-optimal designs. To address these issues, we propose combining morpho-evolution with intrinsic motivations. Intrinsically motivated behaviour arises from embodiment and simple learning rules without external guidance. We use a homeokinetic controller that generates exploratory behaviour in a few seconds with minimal knowledge of the robot's design. Homeokinesis replaces costly learning phases, reducing computational time and favouring diversity, preventing premature convergence. We compare our approach with current MEL methods in several downstream tasks. The generated designs score higher in all the tasks, are more diverse, and are quickly generated compared to morpho-evolution with static parameters. Source and containers available at github.com/AutonomousRoboticEvolution.

Abstract:
Modern non-linear model-based controllers require an accurate physics model and model parameters to be able to control mobile robots at their limits. Also, due to surface slipping at high speeds, the friction parameters may continually change (like tire degradation in autonomous racing), and the controller may need to adapt rapidly. Many works derive a task-specific robot model with a parameter adaptation scheme that works well for the task but requires a lot of effort and tuning for each platform and task. In this work, we design a full model-learning-based controller based on meta pretraining that can very quickly adapt using few-shot dynamics data to any wheel-based robot with any model parameters, while also reasoning about model uncertainty. We demonstrate our results in small-scale numeric simulation, the large-scale Unity simulator, and on a medium-scale hardware platform with a wide range of settings. We show that our results are comparable to domain-specific well-engineered controllers, and have excellent generalization performance across all scenarios.

Abstract:
Risk-bounded motion planning in dynamic environments for autonomous driving presents complex challenges, particularly in solving the nonconvex problem of ensuring continuous, safe, and real-time navigation towards a destination. This paper introduces an online graph-based local planning approach constrained by a user-defined driving style in terms of a risk budget \Delta for the entire mission. Our online approach assigns a risk bound to each motion planning decision, ensuring that the total risk consumed remains within \Delta. First, we construct a spatial lattice graph that adheres to the vehicle's curvature constraints. Then, the trajectory planning problem is reformulated as an online optimization problem, where decisions must be made sequentially without prior knowledge of future events. Therefore, we propose a reduction to the problem to be online multiple-choice knapsack problem (ON-MCKP), where the knapsack items are candidate paths generated by solving constrained shortest-path problems. To solve the ON-MCKP, we deploy online algorithms that offer theoretical guarantees on the risk allocation throughout the entire mission. The effectiveness of our method is demonstrated empirically, showing significant improvements in the objective without violating safety constraints.

Abstract:
Collecting data from simulated scenarios for training robotic skills provides a safer and more controllable alternative to real-world environments. However, it demands considerable effort, including the manual construction of simulation environments, the careful design of tasks, and the challenge of obtaining effective trajectories. These limitations hinder the efficiency of data collection from simulated scenarios. In this paper, we leverage the prior knowledge of Large Language Models (LLMs) and Large Multimodal Models (LMMs) to generate simulated scenarios and embodied tasks. We introduce a novel framework, ASCENT (Autonomous Skill learning toward Complex Embodied tasks with fouNdaTion models), designed to efficiently accomplish these tasks and generate trajectory data. ASCENT features a fully autonomous skill learning mechanism based on AI agent. During task training, the AI agent identifies suitable atomic skills from an atomic skill library to either directly complete the task or serve as an initial policy for further training. Newly acquired atomic skills are subsequently added to the library. To address training failures and enhance efficiency, the AI agent uses an LLM to automatically optimize the skill training process based on feedback received from simulations. Experimental results indicate that the number of training steps required for learning new tasks can be reduced by up to 65.9 %.

Abstract:
Autonomous navigation is crucial for various robotics applications in agriculture. However, many existing methods depend on RTK-GPS devices, which can be susceptible to loss of radio signal or intermittent reception of corrections from the internet. Consequently, research has increasingly focused on using RGB cameras for crop-row detection, though challenges persist when dealing with grown plants. This paper introduces a LiDAR-based navigation system that can achieve crop-agnostic over-canopy autonomous navigation in row-crop fields, even when the canopy fully blocks the inter-row spacing. Our algorithm can detect crop rows across diverse scenarios, encompassing various crop types, growth stages, illumination conditions, the presence of weeds, curved rows, and discontinuities. Without utilizing a global localization method (i.e., based on GPS), our navigation system can perform autonomous navigation in these challenging scenarios, detect the end of the crop rows, and navigate to the next crop row autonomously, providing a crop-agnostic approach to navigate an entire field. The proposed navigation system has undergone tests in various simulated and real agricultural fields, achieving an average cross-track error of 3.55 cm without human intervention. The system has been deployed on a customized UGV robot, which can be reconfigured depending on the field conditions.

Abstract:
Intelligent and reliable task planning is a core capability for generalized robotics, which requires a descriptive domain representation that sufficiently models all object and state information for the scene. We present CLIMB, a continual learning framework for robot task planning that leverages foundation models and feedback from execution to guide the construction of domain models. CLIMB can build a model from a natural language description, learn non-obvious predicates while solving tasks, and store that information for future problems. We demonstrate the ability of CLIMB to improve performance in common planning environments compared to baseline methods. We also developed the BlocksWorld++ domain, a simulated environment with an easily usable real counterpart, together with a curriculum of tasks with progressing difficulty to evaluate continual learning. Code and additional details for this system can be found at https://plan-with-climb.github.io/.

Abstract:
Imitation learning from human demonstrations is an effective means to teach robots manipulation skills. But data acquisition is a major bottleneck in applying this paradigm more broadly, due to the high costs and human efforts involved. There has been significant interest in imitation learning for bimanual dexterous robots, like humanoids. Unfortunately, data collection is even more challenging here due to the difficulty of simultaneously controlling the two arms and multi-fingered hands. Automated data generation in simulation is a compelling, scalable alternative to fuel this need for training data. To this end, we introduce DexMimicGen, a large-scale automated data generation system that synthesizes trajectories from a handful of human demonstrations for bimanual robots with dexterous hands. We present a collection of simulation environments in the setting of bimanual dexterous manipulation, spanning a range of manipulation behaviors and different requirements for coordination among the two arms. We generate 21K demos across these tasks from just 60 source human demos and study the effect of several data generation and policy learning decisions on agent performance. Finally, we present a real-to-sim-to-real pipeline and deploy it on a real-world humanoid can sorting task. Generated datasets, simulation environments and additional results are at dexmimicgen.github.io.

Abstract:
Achieving robust dexterous manipulation in un-structured domestic environments remains a significant challenge in robotics. Even with state-of-the-art robot learning methods, haptic-oblivious control strategies (i.e. those relying only on external vision and/or proprioception) often fall short due to occlusions, visual complexities, and the need for precise contact interaction control. To address these limitations, we introduce PolyTouch, a novel robot finger that integrates camera-based tactile sensing, acoustic sensing, and peripheral visual sensing into a single design that is compact and durable. PolyTouch provides high-resolution tactile feedback across multiple temporal scales, which is essential for efficiently learning complex manipulation tasks. Experiments demonstrate an at least 20-fold increase in lifespan over commercial tactile sensors, with a design that is both easy to manufacture and scalable. We then use this multimodal tactile feedback along with visuo-proprioceptive observations to synthesize a tactile-diffusion policy from human demonstrations; the resulting contact-aware control policy significantly outperforms haptic-oblivious policies in multiple contact-aware manipulation policies. This paper highlights how effectively integrating multimodal contact sensing can hasten the development of effective contact-aware manipulation policies, paving the way for more reliable and versatile domestic robots. More information can be found at https://polytouch.alanz.info/.

Abstract:
In this paper, we present a new compact vision sensor consisting of two fisheye event cameras mounted back-to-back, which offers a full 360-degree view of the surrounding environment. We describe the optical design, projection model and practical calibration using the incoming stream of events, of the novel stereo camera, called SFERA. The potential of SFERA for real-time target tracking is evaluated using a Bayesian estimator adapted to the geometry of the sphere. Real-world experiments with a prototype of SFERA, including two synchronized Prophesee EVK4 cameras and a DJI Mavic Air 2 quadrotor, show the effectiveness of the proposed system for aerial surveillance.

Abstract:
Comparing graph-structured maps is a task of paramount importance in robotic exploration and cartography, but unfortunately the computational cost of the existing similarity measures, such as the graph edit distance (GED), is prohibitive for large graphs. In this paper, we introduce and characterize three new graph distance measures which satisfy the requirements for a metric. The first one, LogEig, computes the square root of the sum of the squared logarithms of the generalized eigenvalues of the shifted Laplacian matrices associated with the two graphs, while the second calculates the Bures distance between these positive definite matrices. The third distance, Rank, computes the rank of the difference of the graph shift operators associated with the two graphs, e.g. the adjacency or the Laplacian matrix. Examples and numerical experiments with graphs from a publicly-available dataset, show the accuracy and computational efficiency of the new metrics for 2D topological-map matching, compared to the GED. The effect of spectral sparsification on the new graph distance measures is examined as well.

Abstract:
We present LiDAR-EDIT, a novel paradigm for generating synthetic LiDAR data for autonomous driving. Our framework edits real-world LiDAR scans by introducing new object layouts while preserving the realism of the background environment. Compared to end-to-end frameworks that generate LiDAR point clouds from scratch, LiDAR-EDIT offers users full control over the object layout, including the number, type, and pose of objects, while keeping most of the original real-world background. Our method also provides object labels for the generated data. Compared to novel view synthesis techniques, our framework allows for the creation of counterfactual scenarios with object layouts significantly different from the original real-world scene. LiDAR-EDIT uses spherical voxelization to enforce correct LiDAR projective geometry in the generated point clouds by construction. During object removal and insertion, generative models are employed to fill the unseen background and object parts that were occluded in the original real LiDAR scans. Experimental results demonstrate that our framework produces realistic LiDAR scans with practical value for downstream tasks. Project website with open-sourced code: https://sites.google.com/view/lidar-edit

Abstract:
Vision-based tactile sensors (VBTSs) provide highresolution tactile images crucial for robot in-hand manipulation. However, force sensing in VBTSs is underutilized due to the costly and time-intensive process of acquiring paired tactile images and force labels. In this study, we introduce a transferable force prediction model, TransForce, designed to leverage collected image-force paired data for new sensors under varying illumination colors and marker patterns while improving the accuracy of predicted forces, especially in the shear direction. Our model effectively achieves translation of tactile images from the source domain to the target domain, ensuring that the generated tactile images reflect the illumination colors and marker patterns of the new sensors while accurately aligning the elastomer deformation observed in existing sensors, which is beneficial to force prediction of new sensors. As such, a recurrent force prediction model trained with generated sequential tactile images and existing force labels is employed to estimate higher-accuracy forces for new sensors with lowest average errors of 0.69 N (5.8 % in full work range) in x-axis, 0.70 N(5.8%) in y-axis, and 1.11 N(6.9%) in z-axis compared with models trained with single images. The experimental results also reveal that pure marker modality is more helpful than the RGB modality in improving the accuracy of force in the shear direction, while the RGB modality show better performance in the normal direction.

Abstract:
In this paper, we tackle the problem of estimating 3D contact forces using vision-based tactile sensors. In particular, our goal is to estimate contact forces over a large range (up to 15 N) on any objects while generalizing across different vision-based tactile sensors. Thus, we collected a dataset of over 200K indentations using a robotic arm that pressed various indenters onto a GelSight Mini sensor mounted on a force sensor and then used the data to train a multi-head transformer for force regression. Strong generalization is achieved via accurate data collection and multi-objective optimization that leverages depth contact images. Despite being trained only on primitive shapes and textures, the regressor achieves a mean absolute error of 4% on a dataset of unseen real-world objects. We further evaluate our approach's generalization capability to other GelSight mini and DIGIT sensors, and propose a reproducible calibration procedure for other sensors. Finally, the method was evaluated on real-world tasks, including weighing objects and controlling the deformation of delicate objects. Supplementary material and demo are available at http://prg.cs.umd.edu/FeelAnyForce.

Abstract:
One of the essential aspects of humanoid robot running is determining the limb-swinging trajectories. During the flight phases, where the ground reaction forces are not available for regulation, the limb swinging trajectories are significant for the stability of the next stance phase. Due to the conservation of angular momentum, improper leg and arm swinging results in highly tilted and unsustainable body configurations at the next stance phase landing. In such cases, the robotic system fails to maintain locomotion independent of the stability of the center of mass trajectories. This problem is more apparent for fast and high flight time trajectories. This paper proposes a real-time nonlinear limb trajectory optimization problem for humanoid running. The optimization problem is tested on two different humanoid robot models, and the generated trajectories are verified using a running algorithm for both robots in a simulation environment.

Abstract:
Accurate estimation of contact forces is crucial for effective control of quadrupedal robots, especially in complex locomotion scenarios. In this paper, we introduce a novel force estimation technique for robots equipped with transformable leg-wheels. Unlike conventional methods that focus on forces at specific contact points, our approach expresses varying contact points through a simplified kinematic model and derives the corresponding Jacobian matrices. This allows us to apply the virtual work method to evaluate contact forces across the entire surface of the leg-wheel, including the tips, sides, and other contact regions. This adaptability is particularly advantageous in hybrid locomotion modes, where different parts of the leg-wheel interact with the terrain. The proposed method is highly efficient, relying solely on motor current and position feedback without the need for additional sensors. We validate our approach through simulations and real-world experiments, demonstrating its accuracy, robustness, and applicability under diverse operational conditions.

Abstract:
Robotic assistive feeding holds significant promise for improving the quality of life for individuals with eating disabilities. However, acquiring diverse food items under varying conditions and generalizing to unseen food presents unique challenges. Existing methods that rely on surface-level geometric information (e.g., bounding box and pose) derived from visual cues (e.g., color, shape, and texture) often lacks adaptability and robustness, especially when foods share similar physical properties but differ in visual appearance. We employ imitation learning (IL) to learn a policy for food acquisition. Existing methods employ IL or Reinforcement Learning (RL) to learn a policy based on off-the-shelf image encoders such as ResNet-50. However, such representations are not robust and struggle to generalize across diverse acquisition scenarios. To address these limitations, we propose a novel approach, IMRL (Integrated Multi-Dimensional Representation Learning), which integrates visual, physical, temporal, and geometric representations to enhance the robustness and generalizability of IL for food acquisition. Our approach captures food types and physical properties (e.g., solid, semi-solid, granular, liquid, and mixture), models temporal dynamics of acquisition actions, and introduces geometric information to determine optimal scooping points and assess bowl fullness. IMRL enables IL to adaptively adjust scooping strategies based on context, improving the robot's capability to handle diverse food acquisition scenarios. Experiments on a real robot demonstrate our approach's robustness and adaptability across various foods and bowl configurations, including zero-shot generalization to unseen settings. Our approach achieves an improvement up to 35 % in success rate compared with the best-performing baseline. More details can be found on our website https://ruiiu.github.io/imrl.

Abstract:
Vehicle-to-everything technologies (V2X) have become an ideal paradigm to extend the perception range and see through the occlusion. Exiting efforts focus on single-frame cooperative perception, however, how to capture the temporal cue between frames with V2X to facilitate the prediction task even the planning task is still underexplored. In this paper, we introduce the Co-MTP, a general cooperative trajectory prediction framework with multi-temporal fusion for autonomous driving, which leverages the V2X system to fully capture the interaction among agents in both history and future domains to benefit the planning. In the history domain, V2X can complement the incomplete history trajectory in single-vehicle perception, and we design a heterogeneous graph transformer to learn the fusion of the history feature from multiple agents and capture the history interaction. Moreover, the goal of prediction is to support future planning. Thus, in the future domain, V2X can provide the prediction results of surrounding objects, and we further extend the graph transformer to capture the future interaction among the ego planning and the other vehicles' intentions and obtain the final future scenario state under a certain planning action. We evaluate the Co-MTP framework on the real-world dataset V2X-Seq, and the results show that Co-MTP achieves state-of-the-art performance and that both history and future fusion can greatly benefit prediction. Our code is available on our project website: https://xiaomiaozhang.github.io/Co-MTP/

Abstract:
This paper presents a fully autonomous robotic system that performs sim-to-real transfer in complex longhorizon tasks involving navigation, recognition, grasping, and stacking in an environment with multiple obstacles. The key feature of the system is the ability to overcome typical sensing and actuation discrepancies during sim-to-real transfer and to achieve consistent performance without any algorithmic modifications. To accomplish this, a lightweight noise-resistant visual perception system and a nonlinearityrobust servo system are adopted. We conduct a series of tests in both simulated and realworld environments. The visual perception system achieves the speed of 11 ms per frame due to its lightweight nature, and the servo system achieves sub-centimeter accuracy with the proposed controller. Both exhibit high consistency during sim-to-real transfer. Benefiting from these, our robotic system took first place in the mineral searching task of the Robotic Sim2Real Challenge hosted at ICRA 2024.

Abstract:
The interaction of robots with their environment requires robust object-centric perception capabilities, typically achieved using learning-based methods trained on synthetic data. However, real-world deployment demands evaluating these capabilities in relevant environments, often involving extensive manual annotation for a quantitative analysis. Additionally, standardized evaluations for robotic tasks, such as grasping, need reproducible object scene configurations and performance benchmarks. We propose a solution to both problems by temporarily employing 3D-printed components, socalled fixtures, which can be designed for any rigid object. Once the scene is set up and object poses are extracted, the fixtures are removed, leaving the natural scene without any artificial distractions. The presented approach is seemingly applicable for pre-determined configurations of multiple objects, which enables precise re-building of scenes with consistent object-toobject relations. Our suggested annotation procedure achieves strong pose accuracy solely on RGB images without any manual involvement. We evaluate and show the usability of the proposed fixtures for automated real-world data annotation to fine-tune a detector and for benchmarking object pose estimation algorithms for robotic grasping. Code and fixture meshes for 3D printing are available at https://github.com/DLRRM/fixture_generation.

Abstract:
In the realm of robotics, teleoperation plays a pivotal role in performing high-risk or intricate tasks, and obtaining precise 3D whole-body pose is crucial for this purpose. Traditional two-stage methods have limitations in estimating different body parts, leading to complex systems and higher estimation errors. In order to address these issues,the paper introduces a novel framework called Graph High-Resolution Network (GraphHRNet) for accurate 3D whole-body pose estimation, which is essential for the teleoperation of humanoid robots. GraphHRNet effectively captures global structural information and local details by integrating a High-Resolution Module and a Multi-branch Regression Module. The High-Resolution Module utilizes an enhanced graph convolution kernel to fuse multi-scale features, capturing global information, while the Multi-branch Regression Module focuses on refining and predicting accurate 3D coordinates for intricate body parts such as hands and face. Experimental results on the H3WB dataset demonstrate that GraphHRNet surpasses state-of-the-art (SOTA) methods in 3D whole-body pose estimation, significantly improving performance. Furthermore, the paper explores the potential application of this approach in a tele-operation system for humanoid robots, providing an intuitive and high-fidelity solution for remotely executing complex tasks. The code have been publicly available at https://github.com/Z-mingyu/GraphHRNet.git

Abstract:
Teleoperation is an important technology to enable supervisors to control agricultural robots remotely. However, environmental factors in dense crop rows and limitations in network infrastructure hinder the reliability of data streamed to teleoperators. These issues result in delayed and variable frame rate video feeds that often deviate significantly from the robot's actual viewpoint. We propose a modular learning-based vision pipeline to generate delay-compensated images in real-time for supervisors. Our extensive offline evaluations demonstrate that our method generates more accurate images compared to state-of-the-art approaches in our setting. Additionally, ours is one of the few works to evaluate a delay-compensation method in outdoor field environments with complex terrain on data from a real robot in real-time. Resulting videos and code are provided at https://sites.google.com/illinois.edu/comp-teleop.

Abstract:
In this work, we focus on addressing the long-horizon packing tasks in densely cluttered scenes. Such tasks require policies to effectively manage severe occlusions among objects and continually produce precise actions based on visual observations. We propose a vision-based Hierarchical policy for Cluttered-scene Long-horizon Manipulation (HCLM). It employs a high-level policy and three options to select and instantiate three parameterized action primitives: push, pick, and place. We first train the two-stream pick and place options by behavior cloning (BC). Subsequently, we use hierarchical reinforcement learning (HRL) to train the high-level policy and push option. During HRL, we propose a Spatially Extended Q-update (SEQ) to augment the updates for the push option and a Two-Stage Update Scheme (TSUS) to alleviate the non-stationary transition problem in updating the high-level policy. We demonstrate that HCLM significantly outperforms baseline methods in terms of success rate and efficiency in diverse tasks both in simulation and real world. The ablation studies also validate the key roles of SEQ and TSUS in HRL.

Abstract:
Multi-agent systems (MAS) have shown great potential in executing complex tasks, but coordination and safety remain significant challenges. Multi-Agent Reinforcement Learning (MARL) offers a promising framework for agent collaboration, but it faces difficulties in handling complex tasks and designing reward functions. The introduction of Large Language Models (LLMs) has brought stronger reasoning and cognitive abilities to MAS, but existing LLM-based systems struggle to respond quickly and accurately in dynamic environments. To address these challenges, we propose LLM-based Graph Collaboration MARL (LGC-MARL), a framework that efficiently combines LLMs and MARL. This framework decomposes complex tasks into executable subtasks and achieves efficient collaboration among multiple agents through graph-based coordination. Specifically, LGC-MARL consists of two main components: an LLM planner and a graph-based collaboration meta policy. The LLM planner transforms complex task instructions into a series of executable subtasks, evaluates the rationality of these subtasks using a critic model, and generates an action dependency graph. The graph-based collaboration meta policy facilitates communication and collaboration among agents based on the action dependency graph, and adapts to new task environments through meta-learning. Experimental results on the AI2-THOR simulation platform demonstrate the superior performance and scalability of LGC-MARL in completing various complex tasks.

Abstract:
World models have become increasingly popular in acting as learned traffic simulators. Recent work has explored replacing traditional traffic simulators with world models for policy training. In this work, we explore the robustness of existing metrics to evaluate world models as traffic simulators to see if the same metrics are suitable for evaluating a world model as a pseudo-environment for policy training. Specifically, we analyze the metametric employed by the Waymo Open Sim-Agents Challenge (WOSAC) and compare world model predictions on standard scenarios where the agents are fully or partially controlled by the world model (partial replay). Furthermore, since we are interested in evaluating the ego actionconditioned world model, we extend the standard WOSAC evaluation domain to include agents that are causal to the ego vehicle. Our evaluations reveal a significant number of scenarios where top-ranking models perform well under no perturbation but fail when the ego agent is forced to replay the original trajectory. To address these cases, we propose new metrics to highlight the sensitivity of world models to uncontrollable objects and evaluate the performance of world models as pseudo-environments for policy training and analyze some state-of-the-art world models under these new metrics.

Abstract:
The Mini Wheelbot is a balancing, reaction wheel unicycle robot designed as a testbed for learning-based control. It is an unstable system with highly nonlinear yaw dynamics, non-holonomic driving, and discrete contact switches in a small, powerful, and rugged form factor. The Mini Wheelbot can use its wheels to stand up from any initial orientation - enabling automatic environment resets in repetitive experiments and even challenging half flips. We illustrate the effectiveness of the Mini Wheelbot as a testbed by implementing two popular learning-based control algorithms. First, we showcase Bayesian optimization for tuning the balancing controller. Second, we use imitation learning from an expert nonlinear MPC that uses gyroscopic effects to reorient the robot and can track higher-level velocity and orientation commands. The latter allows the robot to drive around based on user commands - for the first time in this class of robots. The Mini Wheelbot is not only compelling for testing learning-based control algorithms, but it is also just fun to work with, as demonstrated in the video of our experiments at https://youtu.be/_d7AqTRjz6g.

Abstract:
We explore a navigation planning problem under uncertainty for a simple robot with extremely limited sensing. Our robot can turn subject to significant proportional error and move forward. As it moves in an environment with a known terrain map, the robot can detect changes in the terrain at its current position. Given an initial pose and a goal segment, the robot should find some sequence of actions to travel reliably from start to goal, if such a sequence exists. The resulting plan should guarantee the robot reaches the goal segment despite any movement errors experienced within some known error bound. In this paper, we propose an algorithm to find such an action sequence, implement and evaluate this algorithm, and present evidence for the feasibility of such an algorithm in an underwater navigation setting.

Abstract:
This paper introduces Bidirectional Lazy Informed Trees (BLIT), the first algorithm to incorporate anytime incremental lazy bidirectional heuristic search (Bi-HS) into batch-wise sampling-based motion planning (Bw-SBMP). BLIT operates on batches of informed states (states that can potentially improve the cost of the incumbent solution) structured as an implicit random geometric graph (RGG). The computational cost of collision detection is mitigated via a new lazy edge-evaluation strategy by focusing on states near obstacles. Experimental results, especially in high dimensions, show that BLIT outperforms existing Bw-SBMP planners by efficiently finding an initial solution and effectively improving the quality as more computational resources are available.

Abstract:
Panoptic 3D reconstruction from a monocular video is a fundamental perceptual task in robotic scene understanding. However, existing efforts suffer from inefficiency in terms of inference speed and accuracy, limiting their practical applicability. We present EPRecon, an efficient real-time panoptic 3D reconstruction framework. Current volumetric-based reconstruction methods usually utilize multi-view depth map fusion to obtain scene depth priors, which is time-consuming and poses challenges to real-time scene reconstruction. To address this issue, we propose a lightweight module to directly estimate scene depth priors in a 3D volume for reconstruction quality improvement by generating occupancy probabilities of all voxels. In addition, compared with existing panoptic segmentation methods, EPRecon extracts panoptic features from both voxel features and corresponding image features, obtaining more detailed and comprehensive instance-level semantic information and achieving more accurate segmentation results. Experimental results on the ScanNet V2dataset demonstrate the superiority of EPRecon over current state-of-the-art methods in terms of both panoptic 3D reconstruction quality and real-time inference. Code is available at https://github.com/zhen6618/EPRecon.

Abstract:
In warehousing systems, to enhance efficiency amid surging demand volumes, much attention has been placed on how to reasonably allocate tasks of delivery to robots. However, the labor of robots is still inevitably wasted to some extent. In this paper, we propose a pre-scheduling enhanced warehousing framework aiming to foresee and act in advance, which consists of task flow prediction and hybrid task allocation. For task prediction, we design the spatio-temporal representations of the task flow and introduce a periodicity-decoupled mechanism tailored for the generation patterns of aggregated orders, and then further extract spatial features of task distribution with a novel combination of graph structures. In hybrid tasks allocation, we consider the known tasks and predicted future tasks simultaneously to optimize the task allocation. In addition, we consider factors such as predicted task uncertainty and sector-level efficiency to realize more balanced and rational allocations. We validate our task prediction model across datasets derived from factories, achieving SOTA performance. Furthermore, we implement our system in a real-world robotic warehouse, demonstrating more than 30% improvements in efficiency.

Abstract:
LiDAR place recognition is a crucial module in localization that matches the current location with previously observed environments. Most existing approaches in LiDAR place recognition dominantly focus on the spinning type LiDAR to exploit its large FOV for matching. However, with the recent emergence of various LiDAR types, the importance of matching data across different LiDAR types has grown significantly-a challenge that has been largely overlooked for many years. To address these challenges, we introduce HeLiOS, a deep network tailored for heterogeneous LiDAR place recognition, which utilizes small local windows with spherical transformers and optimal transport-based cluster assignment for robust global descriptors. Our overlap-based data mining and guided-triplet loss overcome the limitations of traditional distance-based mining and discrete class constraints. HeLiOS is validated on public datasets, demonstrating performance in heterogeneous LiDAR place recognition while including an evaluation for longterm recognition, showcasing its ability to handle unseen LiDAR types. We release the HeLiOS code as an open source for the robotics community at https://github.com/minwoo0611/HeLiOS.

Abstract:
Motion planning failures during autonomous navigation often occur when safety constraints are either too conservative, leading to deadlocks, or too liberal, resulting in collisions. To improve robustness, a robot must dynamically adapt its safety constraints to ensure it reaches its goal while balancing safety and performance measures. To this end, we propose a Soft Actor-Critic (SAC)-based policy for adapting Control Barrier Function (CBF) constraint parameters at runtime, ensuring safe yet non-conservative motion. The proposed approach is designed for a general high-level motion planner, low-level controller, and target system model, and is trained in simulation only. Through extensive simulations and physical experiments, we demonstrate that our framework effectively adapts CBF constraints, enabling the robot to reach its final goal without compromising safety.

Abstract:
As the global population ages, effective rehabilitation and mobility aids will become increasingly critical. Gait assistive robots are promising solutions, but designing adaptable controllers for various impairments poses a significant challenge. This paper presented a Human-In-The-Loop (HITL) simulation framework tailored specifically for gait assistive robots, addressing unique challenges posed by passive support systems. We incorporated a realistic physical human-robot interaction (pHRI) model to enable a quantitative evaluation of robot control strategies, highlighting the performance of a speed-adaptive controller compared to a conventional PID controller in maintaining compliance and reducing gait distortion. We assessed the accuracy of the simulated interactions against that of the real-world data and revealed discrepancies in the adaptation strategies taken by the human and their effect on the human's gait. This work underscored the potential of HITL simulation as a versatile tool for developing and fine-tuning personalized control policies for various users.

Abstract:
Visual dense SLAM can facilitate pose estimation and map reconstruction for sensor carriers in unknown environments. However, in uncontrolled environments such as offices, shopping malls, and train stations, frequent occurrences of people walking back and forth or temporary movement of objects within the scene are common. Most existing visual dense SLAM systems do not account for these dynamic factors, leading to localization drift and map distortion. In this paper, we propose DGS-SLAM, a system capable of achieving robust localization and high-fidelity static map reconstruction in dynamic environments. We utilize semantic 3D Gaussians for scene representation, effectively eliminating interference from dynamic objects and refining the reconstruction of static background. We enhance the tracking accuracy and mapping quality of dense SLAM by using a distance distribution-based Gaussian pruning algorithm and implementing a coarse-to-fine tracking strategy with bundle adjustment and differentiable rendering. We perform qualitative and quantitative evaluations on two publicly available dynamic environment datasets. The results indicate that our method effectively reduces the interference caused by dynamic objects, enabling visual dense SLAM to maintain competitive tracking accuracy and mapping performance in dynamic environments.

Abstract:
In this work, to facilitate the real-time processing of multi-scan registration error minimization on factor graphs, we devise a point cloud downsampling algorithm based on coreset extraction. This algorithm extracts a subset of the residuals of input points such that the subset yields exactly the same quadratic error function as that of the original set for a given pose. This enables a significant reduction in the number of residuals to be evaluated without approximation errors at the sampling point. Using this algorithm, we devise a complete SLAM framework that consists of odometry estimation based on sliding window optimization and global trajectory optimization based on registration error minimization over the entire map, both of which can run in real time on a standard CPU. The experimental results demonstrate that the proposed framework outperforms state-of-the-art CPU-based SLAM frameworks without the use of GPU acceleration.

Abstract:
Cooperative perception enhances the individual perception capabilities of autonomous vehicles (AVs) by providing a comprehensive view of the environment. However, balancing perception performance and transmission costs remains a significant challenge. Current approaches that transmit regionlevel features across agents are limited in interpretability and demand substantial bandwidth, making them unsuitable for practical applications. In this work, we propose CoopDETR, a novel cooperative perception framework that introduces objectlevel feature cooperation via object query. Our framework consists of two key modules: single-agent query generation, which efficiently encodes raw sensor data into object queries, reducing transmission cost while preserving essential information for detection; and cross-agent query fusion, which includes Spatial Query Matching (SQM) and Object Query Aggregation (OQA) to enable effective interaction between queries. Our experiments on the OPV2V and V2XSet datasets demonstrate that CoopDETR achieves state-of-the-art performance and significantly reduces transmission costs to 1/782 of previous methods.

Abstract:
Autonomous navigation and exploration in unmapped environments remains a significant challenge in robotics due to the difficulty robots face in making commonsense inference of unobserved geometries. Recent advancements have demonstrated that generative modeling techniques, particularly diffusion models, can enable systems to infer these geometries from partial observation. In this work, we present implementation details and results for real-time, online occupancy prediction using a modified diffusion model. By removing attention-based visual conditioning and visual feature extraction components, we achieve a 73% reduction in runtime with minimal accuracy reduction. These modifications enable occupancy prediction across the entire map, rather than limiting it to the area around the robot where sensor data can be collected. We introduce a probabilistic update method for merging predicted occupancy data into running occupancy maps, resulting in a 71% improvement in predicting occupancy at map frontiers compared to previous methods. Finally, our code and a ROS node for on-robot operation can be found on our website: https://arpg.github.io/scenesense/.

Abstract:
Current open-vocabulary scene graph generation algorithms highly rely on both 3D scene point cloud data and posed RGB-D images and thus have limited applications in scenarios where RGB-D images or camera poses are not readily available. To solve this problem, we propose Point2Graph, a novel end-to-end point cloud-based 3D open-vocabulary scene graph generation framework in which the requirement of posed RGB-D image series is eliminated. This hierarchical framework contains room and object detection/segmentation and openvocabulary classification. For the room layer, we leverage the advantage of merging the geometry-based border detection algorithm with the learning-based region detection to segment rooms and create a “Snap-Lookup” framework for openvocabulary room classification. In addition, we create an end-toend pipeline for the object layer to detect and classify 3D objects based solely on 3D point cloud data. Our evaluation results show that our framework can outperform the current state-of-the-art (SOTA) open-vocabulary object and room segmentation and classification algorithm on widely used real-scene datasets.

Abstract:
In this paper, we propose VLM-Vac, a novel framework designed to enhance the autonomy of smart robot vacuum cleaners. Our approach integrates the zero-shot object detection capabilities of a Vision-Language Model (VLM) with a Knowledge Distillation (KD) strategy. By leveraging the VLM, the robot can categorize objects into actionable classes-either to avoid or to suck-across diverse backgrounds. However, frequently querying the VLM is computationally expensive and impractical for real-world deployment. To address this issue, we implement a KD process that gradually transfers the essential knowledge of the VLM to a smaller, more efficient model. Our real-world experiments demonstrate that this smaller model progressively learns from the VLM and requires significantly fewer queries over time. Additionally, we tackle the challenge of continual learning in dynamic home environments by exploiting a novel experience replay method based on languageguided sampling. Our results show that this approach not only reduces energy consumption by 53 % compared to cumulative learning but also surpasses conventional vision-based clustering methods, particularly in detecting small objects across diverse backgrounds.

Abstract:
Quadrupeds exhibit remarkable locomotion performance through the coordination between their limbs and torso. From past biological knowledge, it is understood that during walking, the forelimbs primarily contribute to braking, while the hindlimb are responsible for propulsion. However, in the field of quadruped robot dynamics, effectively leveraging this coordination remains a challenge. To investigate the torso-limb coordination, this study explores the walking performance of a quadruped walker with a compliant torso, driven by the forelimb or the hindlimb. Through numerical simulations, we analyze the walking behavior under different control drive methods. The findings provide insights into the design of compliant-bodied robots and the optimal distribution of propulsion forces between the forelimbs and hindlimbs.

Abstract:
Characterized by their elongate bodies and relatively simple legs, multi-legged robots have the potential to locomote through complex terrains for applications such as search-and-rescue and terrain inspection. Prior work has developed effective and reliable locomotion strategies for multilegged robots by propagating the two waves of lateral body undulation and leg stepping, which we will refer to as the twowave template. However, these robots have limited capability to climb over obstacles with sizes comparable to their heights. We hypothesize that such limitations stem from the twowave template that we used to prescribe the multi-legged locomotion. Seeking effective alternative waves for obstacleclimbing, we designed a five-segment robot with static (nonactuated) legs, where each cable-driven joint has a rotational degree-of-freedom (DoF) in the sagittal plane (vertical wave) and a linear DoF (peristaltic wave). We tested robot locomotion performance on a flat terrain and a rugose terrain. While the benefit of peristalsis on flat-ground locomotion is marginal, the inclusion of a peristaltic wave substantially improves the locomotion performance in rugose terrains: it not only enables obstacle-climbing capabilities with obstacles having a similar height as the robot, but it also significantly improves the traversing capabilities of the robot in such terrains. Our results demonstrate an alternative actuation mechanism for multilegged robots, paving the way towards all-terrain multi-legged robots.

Abstract:
Articulated wheeled robots play a crucial role in the logistics industry. However, conventional tractor-driven articulated wheeled robots exhibit poor internal stability and are prone to jackknifing, while also consuming a significant amount of energy. By deploying distributed drives and coordinating control among multiple drives, these issues can be effectively addressed. However, the flexible connections between the bodies of articulated vehicles pose significant challenges to the coordinated control of distributed drives. This paper proposes a multi-drive unit coordinated control algorithm based on driving force equivalence and allocation. A neural network is used to predict the driving force, and through non-linear driving force equivalence, a feedforward driving force is obtained. This is combined with a closed-loop feedback compensation controller to form a control architecture that integrates feedforward and feedback, resulting in the equivalent total driving force for the vehicle queue. Subsequently, an equivalent distribution strategy allocates the required driving force to each drive, enabling the vehicle bodies to achieve accurate and stable speed tracking while allowing each drive to operate near its efficient operating point, thereby reducing total energy consumption. Experiments demonstrate that our algorithm significantly lowers the total energy consumption of the vehicle queue under standard operating conditions while ensuring speed-tracking accuracy and improving internal stability.

Abstract:
In the classical reach-avoid problem, autonomous mobile robots are tasked to reach a goal while avoiding obstacles. However, it is difficult to provide guarantees on the robot's performance when the obstacles form a narrow gap and the robot is a black-box (i.e. the dynamics are not known analytically, but interacting with the system is cheap). To address this challenge, this paper presents NeuralPARC. The method extends the authors' prior Piecewise Affine Reach-avoid Computation (PARC) method to systems modeled by rectified linear unit (ReLU) neural networks, which are trained to represent parameterized trajectory data demonstrated by the robot. NeuralPARC computes the reachable set of the network while accounting for modeling error, and returns a set of states and parameters with which the black-box system is guaranteed to reach the goal and avoid obstacles. NeuralPARC is shown to outperform PARC, generating provably-safe extreme vehicle drift parking maneuvers in simulations and in real life on a model car, as well as enabling safety on an autonomous surface vehicle (ASV) subjected to large disturbances and controlled by a deep reinforcement learning (RL) policy.

Abstract:
In recent years, the Robotics field has initiated several efforts toward building generalist robot policies through large-scale multi-task Behavior Cloning. However, direct deployments of these policies have led to unsatisfactory performance, where the policy struggles with unseen states and tasks. How can we break through the performance plateau of these models and elevate their capabilities to new heights? In this paper, we propose FLaRe, a large-scale Reinforcement Learning fine-tuning framework that integrates robust pre-trained representations, large-scale training, and gradient stabilization techniques. Our method aligns pre-trained policies towards task completion, achieving state-of-the-art (SoTA) performance both on previously demonstrated and on entirely novel tasks and embodiments. Specifically, on a set of long-horizon mobile manipulation tasks, FLaRe achieves an average success rate of 79.5% in unseen environments, with absolute improvements of +23.6 % in simulation and +30.7 % on real robots over prior SoTA methods. By utilizing only sparse rewards, our approach can enable generalizing to new capabilities beyond the pretraining data with minimal human effort. Moreover, we demonstrate rapid adaptation to new embodiments and behaviors with less than a day of fine-tuning. Videos, code, and appendix can be found on the project website at robot-flare.github.io

Abstract:
Loop closing is critically important in Simultaneous Localization and Mapping (SLAM) due to its ability to correct accumulated localization errors. However, existing methods are hindered by the difficulty of acquiring pose labels and the unreliability of ground truth data. In this paper, we propose U^2 Frame, a unified LiDAR-based loop closing framework that handles both loop closure detection and relative pose estimation without any ground truth training data. Specifically, the natural temporal-spatial correlation in point cloud sequences is first leveraged to supervise the network training, where near scans are treated as positives and vice versa as negatives. A new neural architecture is then constructed to jointly learn highly discriminative local and global features for loop closure detection. Additionally, an effective candidate verification module that exploits high-order geometric information is presented to further filter out false loop closures and estimate precise poses. We extensively evaluate U^2 Frame on multiple datasets according to two tasks derived from loop closing: loop closure detection and loop pose estimation. Comparative experiments demonstrate that our method outperforms existing state-of-the-art supervised techniques and has a strong generalization ability across unseen scenarios. Our code is released at https://github.com/yxin-zhang/U2Frame.

Abstract:
Automatic parameter tuning methods for planning algorithms, which integrate pipeline approaches with learning-based techniques, are regarded as promising due to their stability and capability to handle highly constrained environments. While existing parameter tuning methods have demonstrated considerable success, further performance improvements require a more structured approach. In this paper, we propose a hierarchical architecture for reinforcement learning-based parameter tuning. The architecture introduces a hierarchical structure with low-frequency parameter tuning, mid-frequency planning, and high-frequency control, enabling concurrent enhancement of both upper-layer parameter tuning and lower-layer control through iterative training. Experimental evaluations in both simulated and real-world environments show that our method surpasses existing parameter tuning approaches. Furthermore, our approach achieves first place in the Benchmark for Autonomous Robot Navigation (BARN) Challenge.

Abstract:
As drone use has become more widespread, there is a critical need to ensure safety and security. A key element of this is robust and accurate drone detection and localization. While cameras and other optical sensors like LiDAR are commonly used for object detection, their performance degrades under adverse lighting and environmental conditions. Therefore, this has generated interest in finding more reliable alternatives, such as millimeter-wave (mmWave) radar. Recent research on mmWave radar object detection has predominantly focused on 2D detection of road users. Although these systems demonstrate excellent performance for 2D problems, they lack the sensing capability to measure elevation, which is essential for 3D drone detection. To address this gap, we propose CubeDN, a single-stage end-to-end radar object detection network specifically designed for flying drones. CubeDN overcomes challenges such as poor elevation resolution by utilizing a dual radar configuration and a novel deep learning pipeline. It simultaneously detects, localizes, and classifies drones of two sizes, achieving decimeter-level tracking accuracy at closer ranges with overall 95% average precision (AP) and 85% average recall (AR). Furthermore, CubeDN completes data processing and inference at 10Hz, making it highly suitable for practical applications.

Abstract:
Collision-resilient quadrotors have gained significant attention given their potential for operating in cluttered environments and leveraging impacts to perform agile maneuvers. However, existing designs are typically single-mode: either safeguarded by propeller guards that prevent deformation or deformable but lacking rigidity, which is crucial for stable flight in open environments. This paper introduces DART, a Dual-stiffness Aerial RoboT, that adapts its post-collision response by either engaging a locking mechanism for a rigid mode or disengaging it for a flexible mode, respectively. Comprehensive characterization tests highlight the significant difference in post-collision responses between its rigid and flexible modes, with the rigid mode offering seven times higher stiffness compared to the flexible mode. To understand and harness the collision dynamics, we propose a novel collision response prediction model based on the linear complementarity system theory. We demonstrate the accuracy of predicting collision forces for both the rigid and flexible modes of DART. Experimental results confirm the accuracy of the model and underscore its potential to advance collision-inclusive trajectory planning in aerial robotics.

Abstract:
Real-time ego-motion tracking for endoscope is a significant task for efficient navigation and robotic automation of endoscopy. In this paper, a novel framework is proposed to perform real-time ego-motion tracking for endoscope. Firstly, a multi-modal visual feature learning network is proposed to perform relative pose prediction, in which the motion feature from the optical flow, the scene features and the joint feature from two adjacent observations are all extracted for prediction. Due to more correlation information in the channel dimension of the concatenated image, a novel feature extractor is designed based on an attention mechanism to integrate multi-dimensional information from the concatenation of two continuous frames. To extract more complete feature representation from the fused features, a novel pose decoder is proposed to predict the pose transformation from the concatenated feature map at the end of the framework. At last, the absolute pose of endoscope is calculated based on relative poses. The experiment is conducted on three datasets of various endoscopic scenes and the results demonstrate that the proposed method outperforms state-of-the-art methods. Besides, the inference speed of the proposed method is over 30 frames per second, which meets the real-time requirement. The project page is here: remote-bmxs.netlify.app

Abstract:
Simultaneous Localization and Mapping (SLAM) is essential for precise surgical interventions and robotic tasks in minimally invasive procedures. While recent advancements in 3D Gaussian Splatting (3DGS) have improved SLAM with high-quality novel view synthesis and fast rendering, these systems struggle with accurate depth and surface reconstruction due to multi-view inconsistencies. Simply incorporating SLAM and 3DGS leads to mismatches between the reconstructed frames. In this work, we present Endo-2DTAM, a real-time endoscopic SLAM system with 2D Gaussian Splatting (2DGS) to address these challenges. Endo-2DTAM incorporates a surface normal-aware pipeline, which consists of tracking, mapping, and bundle adjustment modules for geometrically accurate reconstruction. Our robust tracking module combines point-topoint and point-to-plane distance metrics, while the mapping module utilizes normal consistency and depth distortion to enhance surface reconstruction quality. We also introduce a pose-consistent strategy for efficient and geometrically coherent keyframe sampling. Extensive experiments on public endoscopic datasets demonstrate that Endo-2DTAM achieves an RMSE of 1.87 \pm 0.63 \mathbfm m for depth reconstruction of surgical scenes while maintaining computationally efficient tracking, high-quality visual appearance, and real-time rendering. Our code will be released at github.com/lastbasket/Endo-2DTAM.

Abstract:
This paper explores the design and development of a language-based interface for dynamic mission programming of autonomous underwater vehicles (AUVs). The proposed 'Word2Wave' (W2W) framework enables interactive programming and parameter configuration of AUVs for remote subsea missions. The W2W framework includes: (i) a set of novel language rules and command structures for efficient language-to-mission mapping; (ii) a GPT-based prompt engineering module for training data generation; (iii) a small language model (SLM)-based sequence-to-sequence learning pipeline for mission command generation from human speech or text; and (iv) a novel user interface for 2D mission map visualization and human-machine interfacing. The proposed learning pipeline adapts an SLM named T5-Small that can learn language-to-mission mapping from processed language data effectively, providing robust and efficient performance. In addition to a benchmark evaluation with state-of-the-art, we conduct a user interaction study to demonstrate the effectiveness of W2W over commercial AUV programming interfaces. Across participants, W2W-based programming required less than 10% time for mission programming compared to traditional interfaces; it is deemed to be a simpler and more natural paradigm for subsea mission programming with a usability score of 76.25. W2W opens up promising future research opportunities on hands-free AUV mission programming for efficient subsea deployments.

Abstract:
Formulating the intended behavior of a dynamic system can be challenging. Signal temporal logic (STL) is frequently used for this purpose due to its suitability in formalizing comprehensible, modular, and versatile spatiotemporal specifications. Due to scaling issues with respect to the complexity of the specifications and the potential occurrence of non-differentiable terms, classical optimization methods often solve STL-based problems inefficiently. Smoothing and approximation techniques can alleviate these issues but require changing the optimization problem. This paper proposes a novel sampling-based method based on model predictive path integral control to solve optimal control problems with STL cost functions. We demonstrate the effectiveness of our method on benchmark motion planning problems and compare its performance with state-of-the-art methods. The results show that our method efficiently solves optimal control problems with STL costs.

Abstract:
Recently, radars have been widely featured in robotics for their robustness in challenging weather conditions. Two commonly used radar types are spinning radars and phased-array radars, each offering distinct sensor characteristics. Existing datasets typically feature only a single type of radar, leading to the development of algorithms limited to that specific kind. In this work, we highlight that combining different radar types offers complementary advantages, which can be leveraged through a heterogeneous radar dataset. Moreover, this new dataset fosters research in multi-session and multirobot scenarios where robots are equipped with different types of radars. In this context, we introduce the HeRCULES dataset, a comprehensive, multi-modal dataset with heterogeneous radars, FMCW LiDAR, IMU, GPS, and cameras. This is the first dataset to integrate 4D radar and spinning radar alongside FMCW LiDAR, offering unparalleled localization, mapping, and place recognition capabilities. The dataset covers diverse weather and lighting conditions and a range of urban traffic scenarios, enabling a comprehensive analysis across various environments. The sequence paths with multiple revisits and ground truth pose for each sensor enhance its suitability for place recognition research. We expect the HeRCULES dataset to facilitate odometry, mapping, place recognition, and sensor fusion research. The dataset and development tools are available at https://sites.google.com/view/herculesdataset.

Abstract:
Recent advances in diffusion-based robot policies have demonstrated significant potential in imitating multi-modal behaviors. However, these approaches typically require large quantities of demonstration data paired with corresponding robot action labels, creating a substantial data collection burden. In this work, we propose a plan-then-control framework aimed at improving the action-data efficiency of inverse dynamics controllers by leveraging observational demonstration data. Specifically, we adopt a Deep Koopman Operator framework to model the dynamical system and utilize observation-only trajectories to learn a latent action representation. This latent representation can then be effectively mapped to real high-dimensional continuous actions using a linear action decoder, requiring minimal action-labeled data. Through experiments on simulated robot manipulation tasks and a real robot experiment with multi-modal expert demonstrations, we demonstrate that our approach significantly enhances action-data efficiency and achieves high task success rates with limited action data.

Abstract:
This paper presents a novel approach through the design and implementation of Cycloidal Quasi-Direct Drive actuators for legged robotics. The cycloidal gear mechanism, with its inherent high torque density and mechanical robustness, offers significant advantages over conventional designs. By integrating cycloidal gears into the Quasi-Direct Drive framework, we aim to enhance the performance of legged robots, particularly in tasks demanding high torque and dynamic loads, while still keeping them lightweight. Additionally, we develop a torque estimation framework for the actuator using an Actuator Network, which effectively reduces the sim-toreal gap introduced by the cycloidal drive's complex dynamics. This integration is crucial for capturing the complex dynamics of a cycloidal drive, which contributes to improved learning efficiency, agility, and adaptability for reinforcement learning.

Abstract:
Delivery by aerial robots is an emerging topic in many scenarios, such as logistics, construction industry, and disaster response. Compared to the standard styles that deploy cage or sling, grasping style by gripper can handle objects in various shapes. A multi-limbed structure with distributed vectorable rotors called SPIDAR shows a higher potential to grasp large object in a three-dimensional manner. Therefore, in this paper, we focus on the advanced usage of the vectored thrust forces to achieve aerial grasping by this robot. First, a vectored thrust control to avoid the aerointerference on the underwind segments (e.g., grasped object) during flight is proposed. Then, an optimization-based planning method that utilizes redundant vectored thrust forces for firm grasping is developed. Finally, we demonstrate the feasibility of the proposed flight control and grasp planning by performing challenging grasping and transporting motion with a spherical object of which the diameter is 0.6 m. To the best of our knowledge, this work is the first to achieve multi-finger-like grasping to carry a large object in midair.

Abstract:
As legged robots continue to evolve, new control methods are being developed to provide fast, robust, accurate and computationally efficient algorithms for traversing challenging environments. This paper presents a realtime adaptive locomotion controller for quadrupeds, designed to maintain stability and controllability on various surfaces, including highly slippery terrains. The proposed approach optimizes control effort distribution based on the probability of slippage by utilizing a surface-independent adaptation layer. By balancing the robot's redundant kinematic system through rank relaxation-similar to loosening constraints in optimization problems-this method demonstrates significant performance improvements. Unlike Reinforcement Learning (RL) approaches, which depend on pre-trained policies and may struggle to adapt velocity tracking control across different terrains, our method rapidly adjusts to changing conditions, as validated by extensive simulation experiments.

Abstract:
Smart factories enhance production efficiency and sustainability, but emergencies like human errors, machinery failures and natural disasters pose significant risks. In critical situations, such as fires or earthquakes, collaborative robots can assist first-responders by entering damaged buildings and locating missing persons, mitigating potential losses. Unlike previous solutions that overlook the critical aspect of energy management, in this paper we propose REACT, a smart energy-aware orchestrator that optimizes the exploration phase, ensuring prolonged operational time and effective area coverage. Our solution leverages a fleet of collaborative robots equipped with advanced sensors and communication capabilities to explore and navigate unknown indoor environments, such as smart factories affected by fires or earthquakes, with high density of obstacles. By leveraging real-time data exchange and cooperative algorithms, the robots dynamically adjust their paths, minimize redundant movements and reduce energy consumption. Extensive simulations confirm that our approach significantly improves the efficiency and reliability of search and rescue missions in complex indoor environments, improving the exploration rate by 10% over existing methods and reaching a map coverage of 97% under time critical operations, up to nearly 100% under relaxed time constraint.

Abstract:
Simulation is an indispensable tool in the development and testing of autonomous vehicles (AVs), offering an efficient and safe alternative to road testing. An outstanding challenge with simulation-based testing is the generation of safety-critical scenarios, which are essential to ensure that AVs can handle rare but potentially fatal situations. This paper addresses this challenge by introducing a novel framework CaDRE, to generate realistic, diverse, and controllable safetycritical scenarios. Our approach optimizes for both the quality and diversity of scenarios by employing a unique formulation and algorithm that integrates real-world scenarios, domain knowledge, and black-box optimization. We validate the effectiveness of our framework through extensive testing in three representative types of traffic scenarios. The results demonstrate superior performance in generating diverse and highquality scenarios with greater sample efficiency than existing reinforcement learning (RL) and sampling-based methods.

Abstract:
Control Barrier Functions (CBFs) have emerged as a prominent approach to designing safe navigation systems of robots. Despite their popularity, current CBF-based methods exhibit some limitations: optimization-based safe control techniques tend to be either myopic or computationally intensive, and they rely on simplified system models; conversely, the learning-based methods suffer from the lack of quantitative indication in terms of navigation performance and safety. In this paper, we present a new model-free reinforcement learning algorithm called Certificated Actor-Critic (CAC), which introduces a hierarchical reinforcement learning framework and well-defined reward functions derived from CBFs. We carry out theoretical analysis and proof of our algorithm, and propose several improvements in algorithm implementation. Our analysis is validated by two simulation experiments, showing the effectiveness of our proposed CAC algorithm.

Abstract:
Humor is a key element in human interactions, essential for building connections and rapport. To enhance human-robot communication, we developed a humor-aware chat framework that enables robots to deliver contextually appropriate humor. This framework takes into account the interaction environment, and user's profile as well as emotional state. Two GPT models are used to generate responses. The initial one, named sensor-GPT, processes contextual data from the sensor along with the user's response and conversation history to create prompts for the second one, chat-GPT. These prompts can guide the model on how to integrate appropriate humor elements into the conversation, ensuring that the dialogue is both contextually relevant and humorous. Our experiment compared the effectiveness of humor expression between our framework and the GPT-40 model. The results demonstrate that robots using our framework significantly outperform those using GPT-4o in humor expression, extending conversations, and improving overall interaction quality.

Abstract:
Humans are able to convey different messages using only touch. Equipping robots with the ability to under-stand social touch adds another modality in which humans and robots can communicate. In this paper, we present a social gesture recognition system using a fabric-based, large-scale tactile sensor placed onto the arms of a humanoid robot. We built a social gesture dataset using multiple participants and extracted temporal features for classification. By collecting tactile data on a humanoid robot, our system provides insights into human-robot social touch, and displays that the use of fabric based sensors could be a potential way of advancing the development of spHRI systems for more natural and effective communication.

Abstract:
This paper proposes a new variable gain sliding mode filter augmented by variable windowing for achieving smooth and reactive response over a broad range of input frequencies. The proposed filter can be seen as a synergistic combination of Kikuuwe et al.'s [1] sliding mode filter with varying gain and sliding surfaces and a novel varying-length moving-window algorithm. In all schemes, the estimated input speed is employed for rendering the filter parameters between low and high settings. The discrete-time algorithm of the proposed filter does not suffer from chattering due to implicit (backward) Euler method. The effectiveness of the proposed filter in achieving better trade-off between noise attenuation and signal preservation is validated in both simulation and experimental scenarios by using the velocity signal obtained by differentiation of quantized position data.

Abstract:
The use of machine learning to investigate grasp affordances has received extensive attention over the past several decades. The existing literature provides a robust basis to build upon, though a number of aspects may be improved. Results commonly work in terms of grasp configuration, with little consideration for the manner in which the grasp may be (re-)produced, from a reachability and trajectory planning perspective. We propose a different perspective on grasp affordance learning, explicitly accounting for grasp synthesis; that is, the manner in which manipulator kinematics are used to allow materialization of grasps. The approach allows to explicitly map the grasp policy space in terms of generated grasp types and associated grasp quality. Results of application to a range of objects illustrate merit of the method and highlight the manner in which it may promote a greater degree of explainability for otherwise intransparent reinforcement processes.

Abstract:
We present a deterministic approach for the localization of an Unmanned Aerial Vehicle (UAV) in a known indoor environment by using only a few downward distance measurements and the corresponding odometries between measurements. For each distance measurement and odometry, we look at the preimage of that distance measurement under the downwards distance function combined with the corresponding odometry where the motion between every two measurements has four degrees of freedom: three of translation and one of azimuth change. The intersection of these preimages yields the set of all possible locations for the UAV. In this work, we present an efficient method for approximating that intersection of preimages. We perform a spatial subdivision search, which splits only voxels containing that intersection. We present a novel technique, based on geometric insights, for correctly evaluating whether a voxel indeed contains a true localization. This technique is also robust under different kinds of errors that might occur. Our method is guaranteed to contain the ground truth location, and its runtime complexity is output sensitive, in the Hausdorff dimension and measure of the resulting intersection of preimages. We demonstrate the effectiveness of this method in various indoor scenarios, showing that it can be used to significantly decrease the uncertainty of localization when solving the kidnapped robot problem in simulation and on a physical drone. Our method can be performed in real-time. Furthermore, our method requires only a map of the environment, odometry and ToF sensors, which is advantageous in terms of cost, privacy and transmission bandwidth. Our open-source software and supplementary materials are available at https://github.com/TAU-CGL/uav-fdml-public.

Abstract:
Interacting with the world is a multi-sensory experience: achieving effective general-purpose interaction requires making use of all available modalities - including vision, touch, and audio - to fill in gaps from partial observation. For example, when vision is occluded reaching into a bag, a robot should rely on its senses of touch and sound. However, state-of-the-art generalist robot policies are typically trained on large datasets to predict robot actions solely from visual and proprioceptive observations. In this work, we propose FuSe, a novel approach that enables finetuning visuomotor generalist policies on heterogeneous sensor modalities for which large datasets are not readily available by leveraging natural language as a common cross-modal grounding. We combine a multimodal contrastive loss with a sensory-grounded language generation loss to encode high-level semantics. In the context of robot manipulation, we show that FuSe enables performing challenging tasks that require reasoning jointly over modalities such as vision, touch, and sound in a zero-shot setting, such as multimodal prompting, compositional cross-modal prompting, and descriptions of objects it interacts with. We show that the same recipe is applicable to widely different generalist policies, including both diffusion-based generalist policies and large vision-language-action (VLA) models. Extensive experiments in the real world show that FuSe is able to increase success rates by over 20% compared to all considered baselines.

Abstract:
Untethered magnetic manipulation of biomedical millirobots has a high potential for minimally invasive surgical applications. However, it is still challenging to exert high actuation forces on the small robots over a large distance. Permanent magnets offer stronger magnetic torques and forces than electromagnetic coils, however, feedback control is more difficult. As proven by Earnshaw's theorem, it is not possible to achieve a stable magnetic trap in 3D by static permanent magnets. Here, we report a stable 2D magnetic force trap by an array of permanent magnets to control a millirobot. The trap is located in an open space with a tunable distance to the magnet array in the range of 20 − 120mm, which is relevant to human anatomical scales. The design is achieved by a novel GPU-accelerated optimization algorithm that uses mean squared error (MSE) and Adam optimizer to efficiently compute the optimal angles for any number of magnets in the array. The algorithm is verified using numerical simulation and physical experiments with an array of two magnets. A millirobot is successfully trapped and controlled to follow a complex trajectory. The algorithm demonstrates high scalability by optimizing the angles for 100 magnets in under three seconds. Moreover, the optimization workflow can be adapted to optimize a permanent magnet array to achieve the desired force vector fields.

Abstract:
Human-robot collaboration has great potential in enhancing robot deployment at close proximity with people, especially in non-dyadic collaborations with multiple users. However, autonomous systems that are capable of handling such interactions in a physical domain are rare. This work proposes TriHRCBot, a robotic architecture designed to handle a collaborative task that involves two concurrent users. The architecture is sensitive to position, orientation, body lengths and state of the users in the interaction, and uses this information to adjust the pose of a target object to enable both users to act on it at the same time. A robotic system equipped with the TriHRCBot architecture was deployed in a user study in which 30 participants from the BAE Systems Academy for Skills and Knowledge Centre interacted with it during such multi-user collaborative task. The study shows that the participants considered TriHRCBot acceptable for the task at hand. The code repository of the architecture is publicly available11Code repository: https://github.com/francescosemeraro/TriHRCBot,

Abstract:
The use of unmanned aerial vehicles (UAVs) has increased rapidly, leading to an effort to accurately and efficiently track UAVs. Many existing approaches utilize YOLO, a state-of-the-art object detection model, in conjunction with object tracking algorithms to detect and follow UAVs in real-time. However, these systems typically focus on a single method, without considering alternative tracking methods. In this paper, we present an experimental comparison of multiple object tracking algorithms integrated with YOLOv8, offering a comprehensive evaluation of their performance in UAV tracking scenarios. First, the model size was optimized to determine the best balance between speed and accuracy. Then, various tracking methods are tested to determine the most effective combination. The YOLOv8 model combined with a Kernelized Correlation Filter outperformed various other trackers in varying environmental scenarios, with a combined success rate and a tracking accuracy of 0.8041. This approach was further implemented in real-time on a Jetson Orion Nano GPU, utilizing a pan-tilt gimbal and an Intel RealSense D435i camera. Running at 20 FPS, the system demonstrated robustness and stability during motion and various environmental scenarios, highlighting its potential for integration into applications such as ground-based UAV surveillance.

Abstract:
Endoscopic submucosal dissection (ESD) is a procedure targeted for early gastrointestinal cancer. Traction plays a crucial role in enhancing the efficiency of cutting lesions, thereby reducing procedural complexity and duration. From the perspective of traction devices, current non-magnetic ones hold shortcomings in complicating the workspace in directional tissue manipulation; Current magnetic traction devices cannot be prepared before the procedure, and require the withdrawal of endoscope in the midway to re-introduce the magnetic retractor to the lesion site. Towards these plights, this paper introduces a robotic flexible magnetic retractor designed for tissue manipulation during ESD. Precisely, the flexible prototype can be seamlessly inserted through the instrument channel of an endoscope to the lesion site without the need for endoscope withdrawal. Moreover, the introduction of robotic magnetic actuation enhances the agile control of magnetic retractors while alleviating the surgeon's workload in magnetic-retractor-assisted ESD. The experimental results validate the functionality and efficacy of the prototype magnetic retractor in magnetic traction-assisted ESD procedures. The retractor demonstrated its ability to provide adequate traction and accomplish clinical tasks. This innovative approach holds promise for enhancing the efficiency and outcomes of ESD procedures, offering a compelling alternative to traditional traction methods.

Abstract:
Accurate pose estimation of surgical tools in Robot-assisted Minimally Invasive Surgery (RMIS) is essential for surgical navigation and robot control. While traditional marker-based methods offer accuracy, they face challenges with occlusions, reflections, and tool-specific designs. Similarly, supervised learning methods require extensive training on annotated datasets, limiting their adaptability to new tools. Despite their success in other domains, zero-shot pose estimation models remain unexplored in RMIS for pose estimation of surgical instruments, creating a gap in generalising to unseen surgical tools. This paper presents a novel 6 Degrees of Freedom (DoF) pose estimation pipeline for surgical instruments, leveraging state-of-the-art zero-shot RGB-D models like the Foundation-Pose and SAM-6D. We advanced these models by incorporating vision-based depth estimation using the RAFT-Stereo method, for robust depth estimation in reflective and textureless environments. Additionally, we enhanced SAM-6D by replacing its instance segmentation module, Segment Anything Model (SAM), with a fine-tuned Mask R-CNN, significantly boosting segmentation accuracy in occluded and complex conditions. Extensive validation reveals that our enhanced SAM-6D surpasses FoundationPose in zero-shot pose estimation of unseen surgical instruments, setting a new benchmark for zero-shot RGB-D pose estimation in RMIS. This work enhances the generalisability of pose estimation for unseen objects and pioneers the application of RGB-D zero-shot methods in RMIS.

Abstract:
Surgical automation has the capability to improve the consistency of patient outcomes and broaden access to advanced surgical care in underprivileged communities. Shared autonomy, where the robot automates routine subtasks while the surgeon retains partial teleoperative control, offers great potential to make an impact. In this paper we focus on one important skill within surgical shared autonomy: Automating robotic assistance to maximize visual exposure and apply tissue tension for dissection and cautery. Ensuring consistent exposure to visualize the surgical site is crucial for both efficiency and patient safety. However, achieving this is highly challenging due to the complexities of manipulating deformable volumetric tissues that are prevalent in surgery. To address these challenges we propose MEDiC, a framework for autonomous surgical robotic assistance to Maximizing Exposure for Dissection and Cautery. We integrate a differentiable physics model with perceptual feedback to achieve our two key objectives: 1) Maximizing tissue exposure and applying tension for a specified dissection site through visual-servoing control and 2) Selecting optimal control positions for a dissection target based on deformable Jacobian analysis. We quantitatively assess our method through repeated real robot experiments on a tissue phantom. Our visual-servoing and optimal control position selection achieve success rate of 100% and 82% respectively in ablation study. We also showcase our framework's capabilities through dissection experiments using shared autonomy on real animal tissue.

Abstract:
In recent years, imitation learning has made progress in the field of robotic manipulation. However, it still faces challenges when addressing complex long-horizon tasks with deformable objects, such as high-dimensional state spaces, complex dynamics, and multimodal action distributions. Traditional imitation learning methods often require a large amount of data and encounter distributional shifts and accumulative errors in these tasks. To address these issues, we propose a data-efficient general learning framework (DeformPAM) based on preference learning and reward-guided action selection. DeformPAM decomposes long-horizon tasks into multiple action primitives, utilizes 3D point cloud inputs and diffusion models to model action distributions, and trains an implicit reward model using human preference data. During the inference phase, the reward model scores multiple candidate actions, selecting the optimal action for execution, thereby reducing the occurrence of anomalous actions and improving task completion quality. Experiments conducted on three challenging real-world long-horizon deformable object manipulation tasks demonstrate the effectiveness of this method. Results show that DeformPAM improves both task completion quality and efficiency compared to baseline methods even with limited data. Code and data will be available at deform-pam.robotflow.ai.

Abstract:
Physics-based simulation is essential for developing and evaluating robot manipulation policies, particularly in scenarios involving deformable objects and complex contact interactions. However, existing simulators often struggle to balance computational efficiency with numerical accuracy, especially when modeling deformable materials with frictional contact constraints. We introduce an efficient subspace representation for the Incremental Potential Contact (IPC) method, leveraging model reduction to decrease the number of degrees of freedom. Our approach decouples simulation complexity from the resolution of the input model by representing elasticity in a low-resolution subspace while maintaining collision constraints on an embedded high-resolution surface. Our barrier formulation ensures intersection-free trajectories and configurations regardless of material stiffness, time step size, or contact severity. We validate our simulator through quantitative experiments with a soft bubble gripper grasping and qualitative demonstrations of placing a plate on a dish rack. The results demonstrate our simulator's efficiency, physical accuracy, computational stability, and robust handling of frictional contact, making it well-suited for generating demonstration data and evaluating downstream robot training applications. More details and supplementary material are on the website: https://sites.google.com/view/embedded-ipc.

Abstract:
Navigating and inspecting confined space is crucial for the aerospace and healthcare industries. Exploring smaller and narrower spaces allows for problems to be identified earlier, preventing negative outcomes for patients and equipment. The challenge is to scale down the navigation probe while preserving degrees of freedom (DOF) and functionality. Dielectric elastomer actuators (DEAs) are promising probe candidates because they are solid-state, electrical-driven, and can be scaled down favorably. This work demonstrates a modular 2-DOF DEA miniature probe with an embedded CMOS sensor for visual data acquisition. The modularity achieved by a novel connector system enables switching between single and dual DEA probes based on 2D or 3D pathway structures. The probes can be controlled using a pocket-sized circuit with two knobs to turn. We present the operating mechanism, device assembly, fabrication, and characterization of DEA bending actuators with widths below 2 mm. In the end, we demonstrate the ability of devices to navigate through various complex and confined pathways.

Abstract:
Artificial muscles play a crucial role in musculoskeletal robotics and prosthetics to approximate the force-generating functionality of biological muscle. However, current artificial muscle systems are typically limited to either contraction or extension, not both. This limitation hinders the development of fully functional artificial musculoskeletal systems. We address this challenge by introducing an artificial antagonistic muscle system capable of both contraction and extension. Our design integrates non-stretchable electrohydraulic soft actuators (HASELs) with electrostatic clutches within an antagonistic musculoskeletal framework. This configuration enables an antagonistic joint to achieve a full range of motion without displacement loss due to tendon slack. We implement a synchronization method to coordinate muscle and clutch units, ensuring smooth motion profiles and speeds. This approach facilitates seamless transitions between antagonistic muscles at operational frequencies of up to 3.2 Hz. While our prototype utilizes electrohydraulic actuators, this muscle-clutch concept is adaptable to other non-stretchable artificial muscles, such as McKibben actuators, expanding their capability for extension and full range of motion in antagonistic setups. Our design represents a significant advancement in the development of fundamental components for more functional and efficient artificial musculoskeletal systems, bringing their capabilities closer to those of their biological counterparts.

Abstract:
Accurate ground reaction force (GRF) estimation can significantly improve the adaptability of legged robots in various real-world applications. For instance, with estimated GRF and contact kinematics, the locomotion control and planning assist the robot in overcoming uncertain terrains. The canonical momentum-based methods, formulated as nonlinear observers, do not fully address the noisy measurements and the dependence between floating-base states and the generalized momentum dynamics. In this paper, we present a simultaneous ground reaction force and state estimation framework for legged robots, which systematically addresses the sensor noise and the coupling between states and dynamics. With the floating base orientation estimated separately, a decentralized Moving Horizon Estimation (MHE) method is implemented to fuse the robot dynamics, proprioceptive sensors, exteroceptive sensors, and deterministic contact complementarity constraints in a convex windowed optimization. The proposed method is shown to be capable of providing accurate GRF and state estimation on several legged robots, including the custom-designed humanoid robot Bucky, the open-source educational planar bipedal robot STRIDE, and the quadrupedal robot Unitree Go1, with a frequency of 200Hz and a past time window of 0.04s.

Abstract:
The Koopman operator framework has shown promising results in enabling the analysis of nonlinear dynamics into an infinite-dimensional linear representation. Koopman direct encoding (KDE) is a model-based approach that utilizes inner products and compositions in a Hilbert space to compute the Koopman operator. However, it has primarily been applied to autonomous systems and simulation environments. Here, we extend the application of KDE to nonautonomous systems and real-world environments by introducing Koopman direct encoding-based model predictive control (KDE-MPC). It was validated on nonlinear electromechanical systems with segmented dynamic conditions, such as contact-noncontact transitions, which pose challenges for modeling and control. Simulation results demonstrate a more stable and smoother position profile compared to proportional-integral-derivative control, particularly at discontinuous boundaries. KDE-MPC was also applied to real-world systems, achieving similar position tracking performance to simulation results. We anticipate that KDE-MPC will offer a viable solution for complex robotic control challenges.

Abstract:
This paper proposes a configurable and scalable framework based on Controlled Robot Language with Frame Semantics (FrameCRL) for plan generation. Given natural language instructions, FrameCRL constructs an equivalent formal semantic formulation in the form of discourse representation structures (DRS). Imperative verbs are extracted from the semantic structures as keys to anchor relevant semantic frames from FrameNet, and the selected semantic frames are used to construct goal statements in planning language. Non-imperative statements are further analyzed to generate object specifications and the initial state of the planning problem. These generated statements are then merged into a single planning script, which can be solved directly by the integrated planner. The performance of FrameCRL was evaluated on various natural language corpora and compared with large language models (LLM) based methods in plan generation. The results demonstrated the outperformance of FrameCRL in generating high-quality plans and its capability to handle large context scenarios. The FrameCRL was also tested on pick-and-place tasks using a dual-arm robot and it showcased a robust performance in linguistic understanding.

Abstract:
Mobile manipulation is a crucial problem in various real-world applications. However, existing methods have demonstrated unsatisfactory training efficiency and sparse rewards, requiring complex coordination strategies between the mobile base and arm. In this paper, we propose RM-Planner, a planning method for mobile manipulation tasks in unknown complex environments. By adopting a two-layer hierarchical framework, we utilize a whole-body Model Predictive Control (MPC)-based low-level planner to track subgoals and generate aggressive but safe joint commands throughout the entire manipulation process, while a Reinforcement Learning (RL)based high-level policy directly uses 3D point cloud representations of the environment, guiding the robot to achieve optimal manipulation postures based on current observations and specific task objectives. We conduct extensive simulations and real-world experiments, where RM-planner significantly outperforms state-of-the-art methods. Our code will be released at https://github.com/SYSU-RoboticsLab/RM-Planner.git.

Abstract:
Tasks involving deformable linear objects (DLOs) are prevalent in daily life but pose significant challenges due to their infinite degrees of freedom and underactuated nature. Frequent contact between DLOs and surrounding objects with unknown physical parameters, such as friction, further complicates their manipulation. Performing tasks like routing ropes through a hole requires gentle yet robust manipulation, making it particularly challenging. Previous research has not adequately addressed general DLO manipulation tasks that involve intensive contact, especially in environments with rough surfaces. This paper presents a robust and delicate manipulation learning approach for the DLO routing task, leveraging reinforcement learning (RL) and diffusion policy. First, reinforcement learning agents are trained separately for rope insertion and pulling. During training, the agents are encouraged to minimize rope tension throughout task execution in environments with randomized friction to achieve delicate motion. Next, the rollouts from these agents are collected as expert demonstrations to train a diffusion policy. Our approach generates delicate motions to prevent the rope from being damaged or getting stuck on rough surfaces while remaining robust against environmental disturbances. Please refer to our project page: https://lmeee.github.io/DLOPull/

Abstract:
Manipulating transparent objects presents significant challenges due to the complexities introduced by their reflection and refraction properties, which considerably hinder the accurate estimation of their 3D shapes. To address these challenges, we propose a single-view RGB-D-based depth completion framework, TransDiff, that leverages the Denoising Diffusion Probabilistic Models(DDPM) to achieve material-agnostic object grasping in desktop. Specifically, we leverage features extracted from RGB images, including semantic segmentation, edge maps, and normal maps, to condition the depth map generation process. Our method learns an iterative denoising process that transforms a random depth distribution into a depth map, guided by initially refined depth information, ensuring more accurate depth estimation in scenarios involving transparent objects. Additionally, we propose a novel training method to better align the noisy depth and RGB image features, which are used as conditions to refine depth estimation step by step. Finally, we utilized an improved inference process to accelerate the denoising procedure. Through comprehensive experimental validation, we demonstrate that our method significantly outperforms the baselines in both synthetic and real-world benchmarks with acceptable inference time. The demo of our method can be found on: https://wang-haoxiao.github.io/TransDiff/

Abstract:
End-to-end models are emerging as the mainstream in autonomous driving perception. However, the inability to meticulously deconstruct their internal mechanisms results in diminished development efficacy and impedes the establishment of trust. Pioneering in the issue, we present the Independent Functional Module Evaluation for Bird's-EyeView Perception Model (BEV-IFME), a novel framework that juxtaposes the module's feature maps against Ground Truth within a unified semantic Representation Space to quantify their similarity, thereby assessing the training maturity of individual functional modules. The core of the framework lies in the process of feature map encoding and representation aligning, facilitated by our proposed two-stage Alignment AutoEncoder, which ensures the preservation of salient information and the consistency of feature structure. The metric for evaluating the training maturity of functional modules, Similarity Score, demonstrates a robust positive correlation with BEV metrics, with an average correlation coefficient of 0.9387, attesting to the framework's reliability for assessment purposes.

Abstract:
We introduce a novel distributed source seeking framework, DIAS, designed for multi-robot systems in scenarios where the number of sources is unknown and potentially exceeds the number of robots. Traditional robotic source seeking methods typically focused on directing each robot to a specific strong source and may fall short in comprehensively identifying all potential sources. DIAS addresses this gap by introducing a hybrid controller that identifies the presence of sources and then alternates between exploration for data gathering and exploitation for guiding robots to identified sources. It further enhances search efficiency by dividing the environment into Voronoi cells and approximating source density functions based on Gaussian process regression. Additionally, DIAS can be integrated with existing source seeking algorithms. We compare DIAS with existing algorithms, including DOSS and GMES in simulated gas leakage scenarios where the number of sources outnumbers or is equal to the number of robots. The numerical results show that DIAS outperforms the baseline methods in both the efficiency of source identification by the robots and the accuracy of the estimated environmental density function.

Abstract:
Lifelong Multi-Agent Path Finding (LMAPF) repeatedly finds collision-free paths for multiple agents that are continually assigned new goals when they reach current ones. Recently, this field has embraced learning-based methods, which reactively generate single-step actions based on individual local observations. However, it is still challenging for them to match the performance of the best search-based algorithms, especially in large-scale settings. This work proposes an imitation-learning-based LMAPF solver that introduces a novel communication module as well as systematic single-step collision resolution and global guidance techniques. Our proposed solver, Scalable Imitation Learning for LMAPF (SILLM), inherits the fast reasoning speed of learning-based methods and the high solution quality of search-based methods with the help of modern GPUs. Across six large-scale maps with up to \mathbf1 0, 0 0 0 agents and varying obstacle structures, SILLM surpasses the best learning- and search-based baselines, achieving average throughput improvements of \mathbf137.7 % and \mathbf1 6. 0 %, respectively. Furthermore, SILLM also beats the winning solution of the 2023 League of Robot Runners, an international LMAPF competition. Finally, we validated SILLM with 10 real robots and \mathbf1 0 0 virtual robots in a mock warehouse environment.

Abstract:
Low-cost, vision-centric 3D perception systems for autonomous driving have made significant progress in recent years, narrowing the gap to expensive LiDAR-based methods. The primary challenge in becoming a fully reliable alternative lies in robust depth prediction capabilities, as camera-based systems struggle with long detection ranges and adverse lighting and weather conditions. In this work, we introduce HyDRa, a novel camera-radar fusion architecture for diverse 3D perception tasks. Building upon the principles of dense Bird's-EyeView (BEV)-based architectures, HyDRa introduces a hybrid fusion approach to combine the strengths of complementary camera and radar features in two distinct representation spaces. Our Height Association Transformer module leverages radar features already in the perspective view to produce more robust and accurate depth predictions. In the BEV, we refine the initial sparse representation by a Radar-weighted Depth Consistency. HyDRa achieves a new state-of-the-art for cameraradar fusion of 64.2 NDS (+1.8) and 58.4 AMOTA (+1.5) on the public nuScenes dataset. Moreover, our new semantically rich and spatially accurate BEV features can be directly converted into a powerful occupancy representation, beating all previous camera-based methods on the Occ3D benchmark by an impressive 3.7 mIoU. Code and models are available at https://github.com/phi-wol/hydra.

Abstract:
Reliable object localization is a critical challenge in drone-based aerial manipulation, particularly when objects are outside the camera's field of view. This paper presents a new approach to enhance drone reliability in aerial grasping tasks by integrating a 1D time-of-flight range sensor with a vision-based localization system. The range sensor, positioned beneath the drone, generates a detailed point cloud of the ground beneath the drone, allowing for precise object localization even when the drone hovers directly above the target. By combining visual tracking with realtime distance measurements, our system achieves a 96 % grasp success rate across 128 trials with diverse objects, representing a significant improvement over previous approaches. This method enables zero-shot grasping without prior knowledge of the objects, increasing versatility and robustness in complex, unstructured environments. The open-source software and hardware design of the platform provide a foundation for further research and development in the field of autonomous aerial manipulation.

Abstract:
Human-aware navigation (HAN) aims to build autonomous agents that robustly and naturally navigate in human-centered environments. Due to the complex and dynamic nature of this task, existing approaches typically rely on sophisticated pipelines that separately process perception and decision-making to solve it. In this work, we propose an Obstruction Distance Vector based End-to-End Model (ODVEEM), using monocular vision for navigation around humans. The Obstruction Distance Vector (ODV) is an intermediate representation in our model, leveraged to describe the Obstruction Distance to the first future collision in all possible directions in the horizontal field of view. As ODV cannot be calculated directly in the real world, we design a neural network for ODV estimation, formulating it as a classification problem with auxiliary proxy tasks, which play a key role in effectively predicting the implicit future motion of nearby humans. Taking advantage of ODV, ODVEEM supervised by human behavioral heuristics is employed to guide the agent to reach a goal efficiently and avoid potential collisions. Several challenging experiments show our method's substantial improvement over a number of baseline methods, attaining solid performance with zero-shot transfer to unseen simulated and real-world environments.

Abstract:
This paper explores zero-shot Vision-and-Language Navigation (VLN), enabling agents to generalize navigation to unseen data classes. Most current approaches rely on large models, but these are not specifically tailored for VLN, lacking direct learning from navigation environments and slowing down agents due to their overwhelming size. To tackle this, we propose Map-Semantic Zero-shot Navigation (Map-SemNav), which does not rely on large models for navigation planning. Map-SemNav utilizes three key cues: direction, object, and scene, to acquire relational knowledge instead of memorizing specific classes, which enables generalization to unseen data. Direction is guided by a top-down semantic map, while object and scene information is decoupled from environment knowledge. Extensive experiments demonstrate that Map-SemNav outperforms state-of-the-art large model-based methods in zero-shot VLN tasks within continuous environments, while also offering higher efficiency due to its simplified architecture.

Abstract:
This paper presents the addition of two models for added mass to the fluid-field spring-loaded inverted pendulum (FF-SLIP) Model for legged swimming. The relative ability of these models to capture the increased fluid forces due to virtual mass displacement is evaluated using a two-legged swimming robot, Tadpole. We show that a simple addition to our reduced-order model can predict fluid-leg interaction forces while remaining computationally efficient.

Abstract:
This paper proposes a novel dual-filter architecture utilizing RGB-D camera data and dynamic control barrier functions (D-CBFs) for real-time obstacle avoidance in unstructured environments. The proposed method efficiently handles static, suddenly appearing, and dynamic obstacles, maintaining consistent computational performance across diverse scenarios. To achieve this, two key challenges must be addressed. First, the substantial volume of pixel and depth map data requires robust, real-time processing for efficient D-CBF construction. Second, constructing D-CBFs for each obstacle in multi-obstacle scenarios increases optimization solver time. To address these challenges, we adapt the concept of salient object detection (SOD), proposing an enhanced FastSOD (E-FastSOD) method for rapid risk area identification. This approach rapidly filters out low-risk areas, while high-risk regions are mathematically represented utilizing the proposed enhanced minimal bounding circle (E-MBC) technique. We differentiate static and dynamic obstacles by comparing current and previous MBC states, employing Kalman filtering for obstacle state prediction. This setup enables efficient online D-CBF construction for each MBC, balancing computational speed with accurate obstacle representation. Subsequently, the second filter establishes buffer zones around established D-CBFs, activating only those corresponding to zones the robot actually enters, rather than all D-CBFs to increase real-time performance. We prove the system's safety and asymptotic stabilization under this architecture. Simulated and real-world experiments validate our method, demonstrating an equipped mobile robot's ability to accomplish tasks while ensuring safety across diverse, unknown scenarios.

Abstract:
Grasping a variety of objects remains a key challenge in the development of versatile robotic systems. The human hand is remarkably dexterous, capable of grasping and manipulating objects with diverse shapes, mechanical properties, and textures. Inspired by how humans use two fingers to pick up thin and large objects such as fabric or sheets of paper, we aim to develop a gripper optimized for grasping such deformable objects. Observing how the soft and flexible fingertip joints of the hand approach and grasp thin materials, a hybrid gripper design that incorporates both soft and rigid components was proposed. The gripper utilizes a soft pneumatic ring wrapped around a rigid revolute joint to create a flexible two-fingered gripper. Experiments were conducted to characterize and evaluate the gripper's performance in handling sheets of paper and other objects. Compared to rigid grippers, the proposed design improves grasping efficiency and reduces the gripping distance by up to eightfold.

Abstract:
Disassembly is a critical challenge in maintenance and service tasks, particularly in high-precision operations such as electric vehicle (EV) battery recycling. Tasks like prying-open sealed battery covers require precise manipulation and controlled force application. In our approach we collect human demonstrations using a motion capture system, enabling the robot to learn from human-expert disassembly strategies. These demonstrations train a bimanual robotic system in which one arm exerts force with a specialized tool while the other manipulates and removes sealed components. Our method builds on a diffusion-based policy and integrates real-time force sensing to adapt its actions as contact conditions change. We decompose the demonstrations into distinct sub-tasks and apply data augmentation, thereby reducing the number of demonstrations needed and mitigating potential task failures. Our results show that the proposed method, even with a small dataset, achieves a high task success rate and efficiency compared to a standard diffusion technique. We demonstrate in a real-world application that the bimanual system effectively executes chiseling and peeling actions to separate bonded sheet from a substrate.

Abstract:
Globally localizing a mobile robot in a known map is often a foundation for enabling robots to navigate and operate autonomously. In indoor environments, traditional Monte Carlo localization based on occupancy grid maps is considered the gold standard, but its accuracy is limited by the representation capabilities of the occupancy grid map. In this paper, we address the problem of building an effective map representation that allows to accurately perform probabilistic global localization. To this end, we propose an implicit neural map representation that is able to capture positional and directional geometric features from 2D LiDAR scans to efficiently represent the environment and learn a neural network that is able to predict both, the non-projective signed distance and a direction-aware projective distance for an arbitrary point in the mapped environment. This combination of neural map representation with a lightweight neural network allows us to design an efficient observation model within a conventional Monte Carlo localization framework for pose estimation of a robot in real time. We evaluated our approach to indoor localization on a publicly available dataset for global localization and the experimental results indicate that our approach is able to more accurately localize a mobile robot than other localization approaches employing occupancy or existing neural map representations. In contrast to other approaches employing an implicit neural map representation for 2D LiDAR localization, our approach allows to perform real-time pose tracking after convergence and near real-time global localization. The code of our approach is available at: https://github.com/PRBonn/enm-mcl.

Abstract:
The uncertainty quantification of sensor measurements coupled with deep learning networks is crucial for many robotics systems, especially for safety-critical applications such as self-driving cars. This paper develops an uncertainty quantification approach in the context of visual localization for autonomous driving, where locations are selected based on images. Key to our approach is to learn the measurement uncertainty using light-weight sensor error model, which maps both image feature and semantic information to 2-dimensional error distribution. Our approach enables uncertainty estimation conditioned on the specific context of the matched image pair, implicitly capturing other critical, unannotated factors (e.g., city vs. highway, dynamic vs. static scenes, winter vs. summer) in a latent manner. We demonstrate the accuracy of our uncertainty prediction framework using the Ithaca365 dataset, which includes variations in lighting and weather (sunny, night, snowy). Both the uncertainty quantification of the sensor+network is evaluated, along with Bayesian localization filters using unique sensor gating method. Results show that the measurement error does not follow a Gaussian distribution with poor weather and lighting conditions, and is better predicted by our Gaussian Mixture model.

Abstract:
We propose a novel framework COLLAGE for generating collaborative agent-object-agent interactions by leveraging large language models (LLMs) and hierarchical motion-specific vector-quantized variational autoencoders (VQ-VAEs). Our model addresses the lack of rich datasets in this domain by incorporating the knowledge and reasoning abilities of LLMs to guide a generative diffusion model. The hierarchical VQ-VAE architecture captures different motion-specific characteristics at multiple levels of abstraction, avoiding redundant concepts and enabling efficient multi-resolution representation. We introduce a diffusion model that operates in the latent space and incorporates LLM-generated motion planning cues to guide the denoising process, resulting in prompt-specific motion generation with greater control and diversity. Experimental results on the CORE-4D, and InterHuman datasets demonstrate the effectiveness of our approach in generating realistic and diverse collaborative human-object-human interactions, outperforming state-of-the-art methods. Our work opens up new possibilities for modeling complex interactions in various domains, such as robotics, graphics and computer vision. Paper website: https://collagemotion.github.io/

Abstract:
Multimodal large language models (MLLMs) have shown remarkable performance across various visual understanding tasks. However, most existing MLLMs still lack image detail perception, limiting their effectiveness in tasks that require detailed visual information. In this paper, we introduce Percept-DriveLM, a novel MLLM designed to tackle the fine-grained perception challenges in autonomous driving tasks. At the core of our model is the Visual Fusion Module, which integrates several innovative components: a dynamic resolution mechanism that combines both high and low resolution features, and an RoI conditional mechanism to incorporate object/region-level features identified by offline detectors, further refining the model's fine-grained perception abilities. Trained in a two-stage process, our model demonstrates exceptional performance, outperforming existing MLLMs with comparable parameter sizes and excelling in both autonomous driving perception and general vision-language tasks. The effectiveness of our approach is validated through extensive empirical studies. Code will be available at https://github.com/DebuggerSunfz/PerceptDriveLM.

Abstract:
Enabling a high-degree-of-freedom robot to learn specific skills is a challenging task due to the complexity of robotic dynamics. Reinforcement learning (RL) has emerged as a promising solution; however, addressing such problems requires the design of multiple reward functions to account for various constraints in robotic motion. Existing approaches typically sum all reward components indiscriminately to optimize the RL value function and policy. We argue that this uniform inclusion of all reward components in policy optimization is inefficient and limits the robot's learning performance. To address this, we propose an Automated Hybrid Reward Scheduling (AHRS) framework based on Large Language Models (LLMs). This paradigm dynamically adjusts the learning intensity of each reward component throughout the policy optimization process, enabling robots to acquire skills in a gradual and structured manner. Specifically, we design a multi-branch value network, where each branch corresponds to a distinct reward component. During policy optimization, each branch is assigned a weight that reflects its importance, and these weights are automatically computed based on rules designed by LLMs. The LLM generates a rule set in advance, derived from the task description, and during training, it selects a weight calculation rule from the library based on language prompts that evaluate the performance of each branch. Experimental results demonstrate that the AHRS method achieves an average \mathbf6. 4 8 % performance improvement across multiple high-degree-of-freedom robotic tasks.

Abstract:
Handling pre-crash scenarios is still a major challenge for self-driving cars due to limited practical data and human-driving behavior datasets. We introduce DISC (Driving Styles In Simulated Crashes), one of the first datasets designed to capture various driving styles and behaviors in precrash scenarios for mixed autonomy analysis. DISC includes over 8 classes of driving styles/behaviors from hundreds of drivers navigating a simulated vehicle through a virtual city, encountering rare-event traffic scenarios. This dataset enables the classification of pre-crash human driving behaviors in unsafe conditions, supporting individualized trajectory prediction based on observed driving patterns. By utilizing a custom-designed VR-based in-house driving simulator, TRAVERSE, data was collected through a driver-centric study involving human drivers encountering twelve simulated accident scenarios. This dataset fills a critical gap in human-centric driving data for rare events involving interactions with autonomous vehicles. It enables autonomous systems to better react to human drivers and optimize trajectory prediction in mixed autonomy environments involving both human-driven and self-driving cars. In addition, individual driving behaviors are classified through a set of standardized questionnaires, carefully designed to identify and categorize driving behavior traits. We correlate data features with driving behaviors, showing that the simulated environment reflects real-world driving styles. DISC is the first dataset to capture how various driving styles respond to accident scenarios, offering significant potential to enhance autonomous vehicle safety and driving behavior analysis in mixed autonomy environments.

Abstract:
The integration of data from various sensor modalities (e.g. camera and LiDAR) constitutes a prevalent methodology within the ambit of autonomous driving scenarios. Recent advancements in efficient point cloud transformers have underscored the efficacy of integrating information in sparse formats. When it comes to fusion, since image patches are dense in pixel space with ambiguous depth, it necessitates additional design considerations for effective fusion. In this paper, we conduct a comprehensive exploration of design choices for transformer-based sparse camera-LiDAR fusion. This investigation encompasses strategies for image-to-3D and LiDAR-to-2D mapping, attention neighbor grouping, single modal tokenizer, and micro-structure of Transformer. By amalgamating the most effective principles uncovered through our investigation, we introduce FlatFusion, a carefully designed framework for sparse camera-LiDAR fusion. Notably, FlatFusion significantly outperforms state-of-the-art sparse Transformer-based methods, including UniTR, CMT, and SparseFusion, achieving 73.7 NDS on the nuScenes validation set with 10.1 FPS with PyTorch.

Abstract:
Magnetic micro-robots have demonstrated immense potential in biomedical applications, such as in vivo drug delivery, non-invasive diagnostics, and cell-based therapies, owing to their precise maneuverability and small size. However, current micromanipulation techniques often rely solely on a two-dimensional (2D) microscopic view as sensory feedback, while traditional control interfaces do not provide an intuitive manner for operators to manipulate micro-robots. These limitations increase the cognitive load on operators, who must interpret limited feedback and translate it into effective control actions. To address these challenges, we propose a Deep Re-inforcement Learning-Based Semi-Autonomous Control (DRL-SC) framework for magnetic micro-robot navigation in a simulated microvascular system. Our framework integrates Mixed Reality (MR) to facilitate immersive manipulation of micro-robots, thereby enhancing situational awareness and control precision. Simulation and experimental results demonstrate that our approach significantly improves navigation efficiency, reduces control errors, and enhances the overall robustness of the system in simulated microvascular environments.

Abstract:
Service robots in human-centered environments such as hospitals, office buildings, and long-term care homes need to navigate while adhering to social norms to ensure the safety and comfortability of the people they are sharing the space with. Furthermore, they need to adapt to new social scenarios that can arise during robot navigation. In this paper, we present a novel Online Lifelong Vision Language architecture, OLiVia-Nav, which uniquely integrates vision-language models (VLMs) with an online lifelong learning framework for robot social navigation. We introduce a unique distillation approach, Social Context Contrastive Language Image Pre-training (SC-CLIP), to transfer the social reasoning capabilities of large VLMs to a lightweight VLM, in order for OLiVia-Nav to directly encode social and environment context during robot navigation. These encoded embeddings are used to generate and select robot social compliant trajectories. The lifelong learning capabilities of SC-CLIP enable OLiVia-Nav to update the robot trajectory planning overtime as new social scenarios are encountered. We conducted extensive real-world experiments in diverse social navigation scenarios. The results showed that OLiVia-Nav outperformed existing state-of-the-art DRL and VLM methods in terms of mean squared error, Hausdorff loss, and personal space violation duration. Ablation studies also verified the design choices for Ol.Jvia-Nav.

Abstract:
Open tracheostomy (OT) is considered the traditional way and golden standard for treating airway obstruction patients. However, OT has many unavoidable drawbacks, including strict performing scenarios, significant scarring, and the risk of surgeon infection. Percutaneous dilation tracheostomy (PDT) emerges, with advantages including a lower cost, smaller scarring, and better protection of surgeons from inflecting by aerosol. However, the outside-in puncture manner of PDT has a risk of piercing the post-tracheal wall and the esophagus with uncontrolled force. Additionally, locating tracheal rings and determining the puncture site externally can be challenging for certain patients, such as those who are obese or have undergone neck surgery, while this procedure typically relies on palpation and the surgeon's expertise. Hence, to improve the safety and simplicity of tracheostomy, a minimally-invasive endotracheal inside-out flexible needle-driving system towards microendoscope-guided robotic tracheostomy (MERT) has been proposed in this paper. Guided by an optical coherence tomography (OCT) probe and a microendoscope, the robot inserts into the trachea and performs an inside-out puncture using a flexible needle. The robot can work through a standard endotracheal tube (ETT), and the puncture direction of the flexible needle is variable. Kinematics and statics models of the flexible needle have been derived, and the minimum position errors generated in the kinematics and statics validation experiments are 0.57 \pm 0.21 \mathbf~ m m and 0.27 \pm 0.21 \mathbf~ m m. Finally, a porcine trachea puncture experiment is carried out, and the feasibility of the proposed system is verified.

Abstract:
Recently there has been growing interest in the use of aerial and satellite map data for autonomous vehicles, primarily due to its potential for significant cost reduction and enhanced scalability. Despite the advantages, aerial data also comes with challenges such as a sensor-modality gap and a viewpoint difference gap. Learned localization methods have shown promise for overcoming these challenges to provide precise metric localization for autonomous vehicles. Most learned localization methods rely on coarsely aligned ground truth, or implicit consistency-based methods to learn the localization task - however, in this paper we find that improving the alignment between aerial data and autonomous vehicle sensor data at training time is critical to the performance of a learning-based localization system. We compare two data alignment methods using a factor graph framework and, using these methods, we then evaluate the effects of closely aligned ground truth on learned localization accuracy through ablation studies. Finally, we evaluate a learned localization system using the data alignment methods on a comprehensive (1600km) autonomous vehicle dataset and demonstrate localization error below 0.3m and 0.5°sufficient for autonomous vehicle applications.

Abstract:
With the increasing popularity of autonomous driving based on the Bird's-Eye-View (BEV) representation, improving the generalization of such detection models is key for safe real-world applications. However, a realistic yet challenging scenario: Single Domain Generalization (SDG) for BEV, is still under-explored. A key ingredient for SDG is to increase data diversity via common image augmentation or adversarial data generation first. However, common image-level augmentation is not sufficient enough to ensure domain diversity in most part of latent space. The adversarial generation has the problem of unstable training or mode collapsing as well. To address these limitations, we present Tri-level Automatic Augmentation (Tri-AutoAug), a simple yet effective method to enlarge the diversity and quantity of data from image and 2D features and facilitate the model to learn more domain-invariant features in BEV space. Besides, Tri-AutoAug can automatically learn augmentation strategies to avoid spending too much time manually adjusting hyperparameters and maximize the benefit of Tri-level Augmentation. To the best of our knowledge, this is the first study to explore automatic augmentation for SDG BEV. Extensive experiments on NuScenes-C including eight testing domains have demonstrated that our approach can achieve the best performance across various domain generalization methods. More importantly, we evaluate the proposed method in real-world autonomous driving scenarios. Tri-AutoAug improves the out-of-distribution (ood) performance by 8.54% (mAP), which demonstrates that Tri-AutoAug provides a practical and feasible solution for the applications of 3D detectors in the real world. The code is available at https://github.com/ClaireTunlTri-AutoAug.

Abstract:
In recent times, Lyapunov theory has been in-corporated into learning-based control methods to provide a stability guarantee. However, merely satisfying the Lyapunov conditions does not fully leverage the capabilities of the Neural Network (NN) controller. Furthermore, training an effective Lyapunov candidate requires substantial data, which inherently results in sample inefficiency. To address these limitations, we propose an off-policy variant of the vanilla D-learning method that uses current and historical data to iteratively enhance the NN controller within the framework of Lyapunov theory. Our method outperforms the Deep Deterministic Policy Gradient (DDPG) and D-learning in terms of stability, sample efficiency, and the quality of the trained controllers and Lyapunov candidates. Link to code: github.com/Shenzhaolong1330/DOPT

Abstract:
Vehicle trajectory prediction (VTP) is essential for microscopic traffic risk assessment, autonomous vehicle navigation, and traffic behavior analysis. Related research leveraging learning-based methodologies has yielded notable success on various benchmark trajectory datasets. However, these models often experience performance degradation when faced with dynamic changes in traffic conditions such as vehicle density, road types, and weather conditions, as they have not been exposed to these variations during the training process. To effectively address the need for real-time adaptation in dynamic traffic scenarios, we propose a novel framework titled self-evolving spatial-temporal directed graph neural network (SE-STDGNN). This model utilizes evolving graph convolution networks (EvolveGCNs) to aggregate spatial-temporal features of vehicles and their neighbors, which are then utilized by a trajectory prediction module to forecast future trajectories. Further, a self-evolving mechanism is introduced to adjust model parameters dynamically in the real-time operation. The efficacy of SE-STDGNN is validated using the public vehicle trajectory dataset AD4CHE.

Abstract:
Most existing visual-inertial odometry (VIO) initialization methods rely on accurate pre-calibrated extrinsic parameters. However, during long-term use, irreversible structural deformation caused by temperature changes, mechanical squeezing, etc. will cause changes in extrinsic parameters, especially in the rotational part. Existing initialization methods that simultaneously estimate extrinsic parameters suffer from poor robustness, low precision, and long initialization latency due to the need for sufficient translational motion. To address these problems, we propose a novel VIO initialization method, which jointly considers extrinsic orientation and gyroscope bias within the normal epipolar constraints, achieving higher precision and better robustness without delayed rotational calibration. First, a rotation-only constraint is designed for extrinsic orientation and gyroscope bias estimation, which tightly couples gyroscope measurements and visual observations and can be solved in pure-rotation cases. Second, we propose a weighting strategy together with a failure detection strategy to enhance the precision and robustness of the estimator. Finally, we leverage Maximum A Posteriori to refine the results before enough translation parallax comes. Extensive experiments have demonstrated that our method outperforms the state-of-the-art methods in both accuracy and robustness while maintaining competitive efficiency.

Abstract:
In robotic navigation, maintaining precise pose estimation and navigation in complex and dynamic environments is crucial. However, environmental challenges such as smoke, tunnels, and adverse weather can significantly degrade the performance of single-sensor systems like LiDAR or GPS, compromising the overall stability and safety of autonomous robots. To address these challenges, we propose AF-RLIO: an adaptive fusion approach that integrates 4D millimeterwave radar, LiDAR, inertial measurement unit (IMU), and GPS to leverage the complementary strengths of these sensors for robust odometry estimation in complex environments. Our method consists of three key modules. Firstly, the pre-processing module utilizes radar data to assist LiDAR in removing dynamic points and determining when environmental conditions are degraded for LiDAR. Secondly, the dynamic-aware multimodal odometry selects appropriate point cloud data for scan-tomap matching and tightly couples it with the IMU using the Iterative Error State Kalman Filter. Lastly, the factor graph optimization module balances weights between odometry and GPS data, constructing a pose graph for optimization. The proposed approach has been evaluated on datasets and tested in real-world robotic environments, demonstrating its effectiveness and advantages over existing methods in challenging conditions such as smoke and tunnels. Furthermore, we open source our code at https://github.com/NeSC-IV/AF-RLIO.git to benefit the research community.

Abstract:
This work introduces a model-free reinforcement learning framework that enables various modes of motion (quadruped, tripod, or biped) and diverse tasks for legged robot locomotion. We employ a motion-style reward based on a relaxed logarithmic barrier function as a soft constraint, to bias the learning process toward the desired motion style, such as gait, foot clearance, joint position, or body height. The predefined gait cycle is encoded in a flexible manner, facilitating gait adjustments throughout the learning process. Extensive experiments demonstrate that KAIST HOUND, a 45 kg robotic system, can achieve biped, tripod, and quadruped locomotion using the proposed framework; quadrupedal capabilities include traversing uneven terrain, galloping at 4.67 m/s, and overcoming obstacles up to 58 cm (67 cm for HOUND2); bipedal capabilities include running at 3.6 m/s, carrying a 7.5 kg object, and ascending stairs-all performed without exteroceptive input.

Abstract:
We study behavior change-based visual risk object identification (Visual-ROI), a critical framework designed to detect potential hazards for intelligent driving systems. Existing methods often show significant limitations in spatial accuracy and temporal consistency, stemming from an incomplete understanding of scene affordance. For example, these methods frequently misidentify vehicles that do not impact the ego vehicle as risk objects. Furthermore, existing behavior change-based methods are inefficient because they implement causal inference in the perspective image space. We propose a new framework with a Bird's Eye View (BEV) representation to overcome the above challenges. Specifically, we utilize potential fields as scene affordance, involving repulsive forces derived from road infrastructure and traffic participants, along with attractive forces sourced from target destinations. In this work, we compute potential fields by assigning different energy levels according to the semantic labels obtained from BEV semantic segmentation. We conduct thorough experiments and ablation studies, comparing the proposed method with various state-of-the-art algorithms on both synthetic and real-world datasets. Our results show a notable increase in spatial accuracy and temporal consistency, with enhancements of 20.3% and 11.6% on the RiskBench dataset, respectively. Additionally, we can improve computational efficiency by 88%. We achieve improvements of 5.4% in spatial accuracy and 7.2% in temporal consistency on the nuScenes dataset. For more qualitative results, please visit our project webpage: project webpage.

Abstract:
Multi-frame temporal inputs are important for vision-based autonomous driving. Observations from different angles enable the recovery of 3 D object states from 2 D images as long as we can identify the same instance from different input frames. However, the dynamic nature of driving scenes leads to significant variance in the instance appearance and shape captured by the cameras at different time steps. To this end, we propose a novel contrastive learning algorithm, Cohere3D, to learn coherent instance representations robust to the changes of distance and perspective in a long-term temporal sequence without any human annotations. In the pretraining stage, raw point clouds from LiDAR sensors are utilized to construct the instance-wise long-term temporal correspondence, which serves as guidance for the extraction of instance-level representation from the vision-based bird's-eye-view (BEV) feature map. Cohere3D encourages consistent representation for the same instance at different frames but distinguishes between different instances. We validate the effectiveness and generalizability of our algorithm by finetuning the pretrained model across key downstream autonomous driving tasks: perception, mapping, prediction, and planning. Results show a notable improvement in both data efficiency and final performance in all these tasks.

Abstract:
We consider multi-robot systems under recurring tasks formalized as linear temporal logic (LTL) specifications. To solve the planning problem efficiently, we propose a bottomup approach combining offline plan synthesis with online coordination, dynamically adjusting plans via real-time communication. To address action delays, we introduce a synchronization mechanism ensuring coordinated task execution, leading to a multi-agent coordination and synchronization framework that is adaptable to a wide range of multi-robot applications. The software package is developed in Python and ROS2 for broad deployment. We validate our findings through lab experiments involving nine robots showing enhanced adaptability compared to previous methods. Additionally, we conduct simulations with up to ninety agents to demonstrate the reduced computational complexity and the scalability features of our work.

Abstract:
Language models (LMs) possess a strong capability to comprehend natural language, making them effective in translating human instructions into detailed plans for simple robot tasks. Nevertheless, it remains a significant challenge to handle long-horizon tasks, especially in subtask identification and allocation for cooperative heterogeneous robot teams. To address this issue, we propose a Language Model-Driven MultiAgent PDDL Planner (LaMMA-P), a novel multi-agent task planning framework that achieves state-of-the-art performance on long-horizon tasks. LaMMA-P integrates the strengths of the LMs' reasoning capability and the traditional heuristic search planner to achieve a high success rate and efficiency while demonstrating strong generalization across tasks. Additionally, we create MAT-THOR, a comprehensive benchmark that features household tasks with two different levels of complexity based on the AI2-THOR environment. The experimental results demonstrate that LaMMA-P achieves a 105% higher success rate and 36 % higher efficiency than existing LM-based multiagent planners. The experimental videos, code, datasets, and detailed prompts used in each module can be found on the project website: https://lamma-p.github.io.

Abstract:
This paper presents a Long Short-Term Memory network-based Fluid Experiment Data-Driven model (FEDLSTM) for predicting unsteady, nonlinear hydrodynamic forces on the underwater quadruped robot we constructed. Trained on experimental data from leg force and body drag tests conducted in both a recirculating water tank and a towing tank, FED-LSTM outperforms traditional Empirical Formulas (EF) commonly used for flow prediction over flat surfaces. The model demonstrates superior accuracy and adaptability in capturing complex fluid dynamics, particularly in straightline and turning-gait optimizations via the NSGA-II algorithm. FED-LSTM reduces deflection errors during straight-line swimming and improves turn times without increasing the turning radius. Hardware experiments further validate the model's precision and stability over EF. This approach provides a robust framework for enhancing the swimming performance of legged robots, laying the groundwork for future advances in underwater robotic locomotion.

Abstract:
As the complexity of tasks addressed through reinforcement learning (RL) increases, the definition of reward functions also has become highly complicated. We introduce an RL method aimed at simplifying the reward-shaping process through intuitive strategies. Initially, instead of a single reward function composed of various terms, we define multiple reward and cost functions within a constrained multi-objective RL (CMORL) framework. For tasks involving sequential complex movements, we segment the task into distinct stages and define multiple rewards and costs for each stage. Finally, we introduce a practical CMORL algorithm that maximizes objectives based on these rewards while satisfying constraints defined by the costs. The proposed method has been successfully demonstrated across a variety of acrobatic tasks in both simulation and real-world environments. Additionally, it has been shown to successfully perform tasks compared to existing RL and constrained RL algorithms. Our code is available at https://github.com/rllab-snu/Stage-Wise-CMORL.

Abstract:
The seafloor is a complex environment and it is challenging to conduct detailed mapping, soil composition sampling, and habitat characterization missions in this benthic region. As a step toward overcoming these challenges, we present a quadruped robot capable of walking on the seafloor and maneuvering via midfluid swimming. SELQIE, the Seafloor Environment Legged Quadruped Intelligent Explorer, is capable of walking underwater at speeds up to 0.2 ~\mathrmm / \mathrms, swimming at over 0.16 ~\mathrmm / \mathrms, and transitioning between modes. We also introduce a path planning algorithm that can account for both swimming and walking gaits to efficiently navigate around or over obstacles, and demonstrate the robot executing such a multi-modal trajectory.

Abstract:
This paper addresses the challenge of autonomous excavation of challenging terrains, in particular those that are prone to jamming and inter-particle adhesion when tackled by a standard penetrate-drag-scoop motion pattern. Inspired by human excavation strategies, our approach incorporates oscillatory rotation elements - including swivel, twist, and dive motions - to break up compacted, tangled grains and reduce jamming. We also present an adaptive impedance control method, the Reactive Attractor Impedance Controller (RAIC), that adapts a motion trajectory to unexpected forces during loading in a manner that tracks a trajectory closely when loads are low, but avoids excessive loads when significant resistance is met. Our method is evaluated on four terrains using a robotic arm, demonstrating improved excavation performance across multiple metrics, including volume scooped, protective stop rate, and trajectory completion percentage.

Abstract:
Transparency and explainability are important features that responsible autonomous vehicles should possess, particularly when interacting with humans, and causal reasoning offers a strong basis to provide these qualities. However, even if one assumes agents act to maximise some concept of reward, it is difficult to make accurate causal inferences of agent planning without capturing what is of importance to the agent. Thus our work aims to learn a weighting of reward metrics for agents such that explanations for agent interactions can be causally inferred. We validate our approach quantitatively and qualitatively across three real-world driving datasets, demonstrating a functional improvement over previous methods and competitive performance across evaluation metrics.

Abstract:
Accurate and efficient surgical robotic tool pose estimation is of fundamental significance to downstream applications such as augmented reality (AR) in surgical training and learning-based autonomous manipulation. While significant advancements have been made in pose estimation for humans and animals, it is still a challenge in surgical robotics due to the scarcity of published data. The relatively large absolute error of the da Vinci end effector kinematics and arduous calibration procedure make calibrated kinematics data collection expensive. Driven by this limitation, we collected a dataset, dubbed SurgPose, providing instance-aware semantic keypoints for visual surgical tool pose estimation and tracking. By marking keypoints using ultraviolet (UV) reactive paint, which is invisible under white light and fluorescent under UV light, we execute the same trajectory under different lighting conditions to collect raw videos and keypoint annotations, respectively. The SurgPose dataset consists of approximately 120 K surgical instrument instances of 6 categories as shown in Fig. 1. Since the videos are collected in stereo pairs, the 2D pose can be lifted to 3D based on stereo-matching depth. In addition to releasing the dataset, we tested a few baseline approaches to surgical instrument tracking to demonstrate the utility of SurgPose. More details can be found at surgpose.github.io.

Abstract:
Shape control of deformable objects under both rotational and translational deformations is important for versatile robotic applications. However, deformation control with full 6-degree-of-freedom (DoF) manipulation is an open problem, since modeling and describing rotational deformations lead to significant challenges. To tackle the problem, this paper proposes a novel method by introducing a co-rotated space for the modal graph representation of objects with unknown physical and geometric models. In this space, we design new deformation features that can encode local rotations while preserving a compact and low-frequency shape representation. Moreover, these features can be mapped analytically to the robot manipulation, enabling the design of adaptive control laws with guaranteed stability for unmodeled objects. Experiments on complex volumetric objects demonstrate the effectiveness and advantage of our method with raw, noisy, and unregistered point clouds. The results highlight the importance of integrating co-rotated features to address rotational deformations.

Abstract:
Large multimodal models (LMMs) are increasingly integrated into autonomous driving systems for user interaction. However, their limitations in fine-grained spatial reasoning pose challenges for system interpretability and user trust. We introduce Logic-RAG, a novel Retrieval-Augmented Generation (RAG) framework that improves LMMs' spatial understanding in driving scenarios. Logic-RAG constructs a dynamic knowledge base (KB) about object-object relationships in first-order logic (FOL) using a perception module, a query-to-logic embedder, and a logical inference engine. We evaluated Logic-RAG on visual-spatial queries using both synthetic and real-world driving videos. When using popular LMMs (GPT-4V, Claude 3.5) as proxies for an autonomous driving system, these models achieved only 55% accuracy on synthetic driving scenes and under 75% on real-world driving scenes. Augmenting them with Logic-RAG increased their accuracies to over 80% and 90%, respectively. An ablation study showed that even without logical inference, the fact-based context constructed by Logic-RAG alone improved accuracy by 15%. Logic-RAG is extensible: it allows seamless replacement of individual components with improved versions and enables domain experts to compose new knowledge in both FOL and natural language. In sum, Logic-RAG addresses critical spatial reasoning deficiencies in LMMs for autonomous driving applications. Code and data are available at: https://github.com/Imran2205/LogicRAG.

Abstract:
Understanding and reasoning about the 4D space-time is crucial for Vision-and-Language Navigation (VLN). However, previous works lack in-depth exploration in this aspect, resulting in bottlenecked spatial perception and action precision of VLN agents. In this work, we introduce NaVid-4D, a Vision Language Model (VLM) based navigation agent taking the lead in explicitly showcasing the capabilities of spatial intelligence in the real world. Given natural language instructions, NaVid-4D requires only egocentric RGB-D video streams as observations to perform spatial understanding and reasoning for generating precise instruction-following robotic actions. NaVid-4D learns navigation policies using the data from simulation environments and is endowed with precise spatial understanding and reasoning capabilities using web data. Without the need to pre-train an RGB-D foundation model, we propose a method capable of directly injecting the depth features into the visual encoder of a VLM. We further compare the use of factually captured depth information with the monocularly estimated one and find NaVid-4D works well with both while using estimated depth offers greater gener-alization capability and better mitigates the sim-to-real gap. Extensive experiments demonstrate that NaVid-4D achieves state-of-the-art performance in simulation environment and makes impressive VLN performance with spatial intelligence happen in the real world.

Abstract:
This paper presents TSPDiffuser, a novel data-driven path planner for traveling salesperson path planning problems (TSPPPs) in environments rich with obstacles. Given a set of destinations within obstacle maps, our objective is to efficiently find the shortest possible collision-free path that visits all the destinations. In TSPDiffuser, we train a diffusion model on a large collection of TSPPP instances and their respective solutions to generate plausible paths for unseen problem instances. The model can then be employed as a learned sampler to construct a roadmap that contains potential solutions with a small number of nodes and edges. This approach enables efficient and accurate estimation of travel costs between destinations, effectively addressing the primary computational challenge in solving TSPPPs. Experimental evaluations with diverse synthetic and real-world indoor/outdoor environments demonstrate the effectiveness of TSPDiffuser over existing methods in terms of the trade-off between solution quality and computational time requirements.

Abstract:
Diffusion Policy is a powerful technique tool for learning end-to-end visuomotor robot control. It is expected that Diffusion Policy possesses scalability, a key attribute for deep neural networks, typically suggesting that increasing model size would lead to enhanced performance. However, our observations indicate that Diffusion Policy in transformer architecture (DP-T) struggles to scale effectively; even minor additions of layers can deteriorate training outcomes. To address this issue, we introduce Scalable Diffusion Transformer Policy for visuomotor learning. Our proposed method, namely ScaleDP, introduces two modules that improve the training dynamic of Diffusion Policy and allow the network to better handle multimodal action distribution. First, we identify that DPT suffers from large gradient issues, making the optimization of Diffusion Policy unstable. To resolve this issue, we factorize the feature embedding of observation into multiple affine layers, and integrate it into the transformer blocks. Additionally, our utilize non-causal attention which allows the policy network to “see” future actions during prediction, helping to reduce compounding errors. We demonstrate that our proposed method successfully scales the Diffusion Policy from 10 million to 1 billion parameters. This new model, named ScaleDP, can effectively scale up the model size with improved performance and generalization. We benchmark ScaleDP across 50 different tasks from MetaWorld and find that our largest ScaleDP outperforms DP-T with an average improvement of 21.6%. Across 7 real-world robot tasks, our ScaleDP demonstrates an average improvement of 36. 25% over DP-T on four single-arm tasks and 75% on three bimanual tasks. We believe our work paves the way for scaling up models for visuomotor learning. The project page is available at https://scaling-diffusion-policy.github.io/.

Abstract:
We introduce a learning-guided motion planning framework that generates seed trajectories using a diffusion model for trajectory optimization. Given a workspace, our method approximates the configuration space (C-space) obstacles through an environment representation consisting of a sparse set of task-related key configurations, which is then used as a conditioning input to the diffusion model. The diffusion model integrates regularization terms that encourage smooth, collision-free trajectories during training, and trajectory optimization refines the generated seed trajectories to correct any colliding segments. Our experimental results demonstrate that high-quality trajectory priors, learned through our C-space-grounded diffusion model, enable the efficient generation of collision-free trajectories in narrow-passage environments, outperforming previous learning- and planning-based baselines. Videos and additional materials can be found on the project page: https://kiwi-sherbet.github.io/PRESTO.

Abstract:
In this paper, a learning based Model Predictive Control (MPC) using a low dimensional residual model is proposed for autonomous driving. One of the critical challenge in autonomous driving is the complexity of vehicle dynamics, which impedes the formulation of accurate vehicle model. Inaccurate vehicle model can significantly impact the performance of MPC controller. To address this issue, this paper decomposes the nominal vehicle model into invariable and variable elements. The accuracy of invariable elements are ensured by calibration, while the deviations in the variable elements are learned by a low-dimensional residual model. The features of residual model are selected as the physical variables most correlated with nominal model errors. Physical constraints among these features are formulated to explicitly define the valid region within the feature space. The formulated model and constraints are incorporated into the MPC framework and validated through both simulation and real vehicle experiments. The results indicate that the proposed method significantly enhances the model accuracy and controller performance.

Abstract:
Point cloud registration is a critical problem in computer vision and robotics, especially in the field of navigation. Current methods often fail when faced with high outlier rates or take a long time to converge to a suitable solution. In this work, we introduce a novel algorithm for point cloud registration called SANDRO11https://github.com/iit-DLSLab/SANDRO (Splitting strategy for point cloud Alignment using Non-convex anD Robust Optimization), which combines an Iteratively Reweighted Least Squares (IRLS) framework with a robust loss function with graduated non-convexity. This approach is further enhanced by a splitting strategy designed to handle high outlier rates and skewed distributions of outliers. SANDRO is capable of addressing important limitations of existing methods, as in challenging scenarios where the presence of high outlier rates and point cloud symmetries significantly hinder convergence. SANDRO achieves superior performance in terms of success rate when compared to the state-of-the-art methods, demonstrating a 20% improvement from the current state of the art when tested on the Redwood real dataset and 60% improvement when tested on synthetic data.

Abstract:
Semantic segmentation is a key technique that enables mobile robots to understand and navigate surrounding environments autonomously. However, most existing works focus on segmenting known objects, overlooking the identification of unknown classes, which is common in real-world applications. In this paper, we propose a feature-oriented framework for open-set semantic segmentation on LiDAR data, capable of identifying unknown objects while retaining the ability to classify known ones. We design a decomposed dual-decoder network to simultaneously perform closed-set semantic segmentation and generate distinctive features for unknown objects. The network is trained with multi-objective loss functions to capture the characteristics of known and unknown objects. Using the extracted features, we introduce an anomaly detection mechanism to identify unknown objects. By integrating the results of close-set semantic segmentation and anomaly detection, we achieve effective feature-driven LiDAR open-set semantic segmentation. Evaluations on both SemanticKITTI and nuScenes datasets demonstrate that our proposed framework significantly outperforms state-of-the-art methods. The source code will be made publicly available at https://github.com/nubot-nudt/DOSS.

Abstract:
In this paper, we perform robot manipulation activities in real-world environments with language contexts by integrating a compact referring image segmentation model into the robot's perception module. First, we propose CLIPU2Net, a lightweight referring image segmentation model designed for fine-grain boundary and structure segmentation from language expressions. Then, we deploy the model in an eye-in-hand visual servoing system to enact robot control in the real world. The key to our system is the representation of salient visual information as geometric constraints, linking the robot's visual perception to actionable commands. Experimental results on 46 real-world robot manipulation tasks demonstrate that our method outperforms traditional visual servoing methods relying on labor-intensive feature annotations, excels in fine-grain referring image segmentation with a compact decoder size of 6.6 MB, and supports robot control across diverse contexts.

Abstract:
Learning diverse skills for quadruped robots presents significant challenges, such as mastering complex transitions between different skills and handling tasks of varying difficulty. Existing imitation learning methods, while successful, rely on expensive datasets to reproduce expert behaviors. Inspired by introspective learning, we propose Progressive Adversarial Self-Imitation Skill Transition (PASIST), a novel method that eliminates the need for complete expert datasets. PASIST autonomously explores and selects high-quality trajectories based on predefined target poses instead of demonstrations, leveraging the Generative Adversarial Self-Imitation Learning (GASIL) framework. To further enhance learning, We develop a skill selection module to mitigate mode collapse by balancing the weights of skills with varying levels of difficulty. Through these methods, PASIST is able to reproduce skills corresponding to the target pose while achieving smooth and natural transitions between them. Evaluations on both simulation platforms and the Solo 8 robot confirm the effectiveness of PASIST, offering an efficient alternative to expert-driven learning.

Abstract:
Engagement is a key concept in Human-Robot Interaction (HRI), as high engagement often leads to improved user experience and task performance. However, accurately estimating engagement during interactions is challenging. In this study, we propose a Dynamic Bayesian Network (DBN) to infer user engagement from various modalities, including head rotation, eye movements, facial expressions captured through visual sensors, as well as facial temperature variations measured by a thermal camera. Data was gathered from a human-robot interaction (HRI) experiment, where a robot guided participants and encouraged them to share their thoughts and insights on environmental issues. Our approach successfully combines these diverse features to offer a thorough assessment of user engagement. The network was tested on its capacity to classify participants as either engaged or not engaged, achieving an accuracy of 0.83 and an Area Under the Curve (AUC) of 0.82. These findings underscore the strength of our DBN in detecting user engagement during interactions.

Abstract:
This paper presents experiments for embedded cooperative distributed model predictive control applied to a team of hovercraft floating on an air hockey table. The hovercraft collectively solve a centralized optimal control problem in each sampling step via a stabilizing decentralized real-time iteration scheme using the alternating direction method of multipliers. The efficient implementation does not require a central coordinator, executes onboard the hovercraft, and facilitates sampling intervals in the millisecond range. The formation control experiments showcase the flexibility of the approach on scenarios with point-to-point transitions, trajectory tracking, collision avoidance, and moving obstacles.

Abstract:
In multi-robot exploration, a team of mobile robot is tasked with efficiently mapping an unknown environments. While most exploration planners assume omnidirectional sensors like LiDAR, this is impractical for small robots such as drones, where lightweight, directional sensors like cameras may be the only option due to payload constraints. These sensors have a constrained field-of-view (FoV), which adds complexity to the exploration problem, requiring not only optimal robot positioning but also sensor orientation during movement. In this work, we propose MARVEL, a neural framework that leverages graph attention networks, together with novel frontiers and orientation features fusion technique, to develop a collaborative, decentralized policy using multi-agent reinforcement learning (MARL) for robots with constrained FoV. To handle the large action space of viewpoints planning, we further introduce a novel information-driven action pruning strategy. MARVEL improves multi-robot coordination and decision-making in challenging large-scale indoor environments, while adapting to various team sizes and sensor configurations (i.e., FoV and sensor range) without additional training. Our extensive evaluation shows that MARVEL's learned policies exhibit effective coordinated behaviors, outperforming state-of-the-art exploration planners across multiple metrics. We experimentally demonstrate MARVEL's generalizability in large-scale environments, of up to 90 m by 90 m, and validate its practical applicability through successful deployment on a team of real drone hardware.

Abstract:
Collaborative multi-agent exploration of unknown environments is crucial for search and rescue operations. Effective real-world deployment must address challenges such as limited inter-agent communication and static and dynamic obstacles. This paper introduces a novel decentralized collaborative framework based on Reinforcement Learning to enhance multi-agent exploration in unknown environments. Our approach enables agents to decide their next action using an agent-centered field-of-view occupancy grid, and features extracted from A algorithm-based trajectories to frontiers in the reconstructed global map. Furthermore, we propose a constrained communication scheme that enables agents to share their environmental knowledge efficiently, minimizing exploration redundancy. The decentralized nature of our framework ensures that each agent operates autonomously, while contributing to a collective exploration mission. Extensive simulations in Gymnasium and real-world experiments demonstrate the robustness and effectiveness of our system, while all the results highlight the benefits of combining autonomous exploration with inter-agent map sharing, advancing the development of scalable and resilient robotic exploration systems.

Abstract:
Human-robot physical interaction (pHRI) is a rapidly evolving research field with significant implications for physical therapy, search and rescue, and telemedicine. However, a major challenge lies in accurately understanding human constraints and safety in human-robot physical experiments without an IRB and physical human experiments. Concerns regarding human studies include safety concerns, repeatabil-ity, scalability of the number, and diversity of participants. This paper examines whether a physical approximation can serve as a stand-in for human subjects to enhance robot autonomy for physical assistance. This paper introduces the SHULDRD (Shoulder Haptic Universal Limb Dynamic Repositioning Device), an economical and anatomically similar device designed for real-time testing and deploying pHRI planning tasks on robots in the real world. SHULDRD replicates human shoulder motion, providing crucial force feedback and safety data. The device's open-source CAD and software facilitate easy construction and use, ensuring broad accessibility for researchers. By providing a flexible platform able to emulate infinite human subjects, ensure repeatable trials, and provide quantitative metrics to assess the effectiveness of the robotic intervention, SHULDRD aims to improve the safety and efficacy of human-robot physical interactions. Project URL: https://sites.google.com/view/haptic-shoulder/home

Abstract:
This paper addresses the challenge of Velocity Vector Control (VVC) for fixed-wing UAVs using Reinforcement Learning (RL) in the presence of imperfect demonstrations. The multi-objective and long-horizon nature of VVC introduces significant spatial and temporal complexities, complicating RL's exploration. While demonstration-based RL methods can help mitigate exploration challenges, their effectiveness is often limited by the quality of the provided demonstrations. To tackle this, we propose V-Pilot, a novel approach that integrates: (1) a controller equipped with a control law model to reduce action oscillation, thus alleviating temporal exploration issues, and (2) a VVC-specific training workflow for iterative policy refinement and demonstration quality improvement. This framework is designed to enhance the performance of demonstration-based RL under imperfect demonstrations. We evaluate V-Pilot on the fixed-wing UAV RL environment, VVCGym. Experimental results demonstrate that V-Pilot outperforms PID and Behavioral Cloning across multiple performance metrics.

Abstract:
The flexibility and dexterity of cable-driven continuum robots (CDCRs) make them well suited for intricate tasks such as minimally invasive surgery. However, the complexity of accurately modeling their dynamics has limited their broader adoption and effective control. Current models either oversimplify the dynamics by assuming quasi-static conditions or over complicate them, making real-time application challenging. Additionally, many existing models neglect the critical coupling between the robot's body and actuator dynamics, a factor essential for accurate control. In this paper, we propose a new minimal dynamics model for CDCRs that strikes a balance between simplicity and accuracy. Our model captures the essential dynamics of both the robot and its actuators, providing a practical tool for control design. We also establish connections between our model and those used for other robotic systems, enabling the transfer of well-established control strategies to CDCRs. The model is validated through hardware experiments, demonstrating its ability to effectively address complex control challenges in CDCR applications.

Abstract:
The Dynamic Vision Sensor (DVS) is a distinctive visual sensor that exclusively responds to alterations in pixel brightness, enabling the real-time capture of swift and subtle movements with reduced power consumption and data bandwidth requirements. This paper proposes a DVS-aware visual perception method and presents its application for pose estimation of mobile robots. Specifically, a new marker is designed to provide pose reference data that leverages the inherent advantages of DVS more effectively. Moreover, we formulate a pose recognition system incorporating DVS, an algorithm based on Spiking Convolutional Neural Networks (SCNN) and a neuromorphic computing accelerator (Lynxi HS110). Such a formulation can well explore the DVS's advantages, as its event-triggered feature matches the nature of SCNN while the neuromorphic hardware enables efficient, low-power execution, making the system highly suitable for real-time embedded applications. Comparative analysis with traditional ARcode-based pose recognition methods reveals that our innovative approach demonstrates significant advantages in recognition speed and energy efficiency. The whole system is deployed on mobile robots and evaluated in real-world scenarios.

Abstract:
SAFER-Splat (Simultaneous Action Filtering and Environment Reconstruction) is a real-time, scalable, and minimally invasive safety filter, based on control barrier functions, for safe robotic navigation in a detailed map constructed at runtime using Gaussian Splatting (GSplat). We propose a novel Control Barrier Function (CBF) that not only induces safety with respect to all Gaussian primitives in the scene, but when synthesized into a controller, is capable of processing hundreds of thousands of Gaussians while maintaining a minimal memory footprint and operating at 15 Hz during online Splat training. Of the total compute time, a small fraction of it consumes GPU resources, enabling uninterrupted training. The safety layer is minimally invasive, correcting robot actions only when they are unsafe. To showcase the safety filter, we also introduce SplatBridge, an open-source software package built with ROS for real-time GSplat mapping for robots. We demonstrate the safety and robustness of our pipeline first in simulation, where our method is 20-50x faster, safer, and less conservative than competing methods based on neural radiance fields. Further, we demonstrate simultaneous GSplat mapping and safety filtering on a drone hardware platform using only on-board perception. We verify that under teleoperation a human pilot cannot invoke a collision. Our videos and codebase can be found at https://chengine.github.io/safer-splat.

Affiliations: Human-Robot Interaction Lab., Technical Faculty of IT and Design, Aalborg University, Denmark; Center for Lightweight Production Technology, German Aerospace Center (DLR), Augsburg, Germany; Department of Information Engineering, IT+Robotics s.r.l and the Intelligent Autonomous System Lab, University of Padua, Padua, Italy; Institute of Intelligent Industrial Technologies and Systems for Advanced Manufacturing, CNR-STIIMA, Milan, Italy; Smart Production Lab., Faculty of Engineering and Natural Sciences, Aalborg University, Denmark

Abstract:
In the Industry 5.0 era, the focus shifts from basic automation to fostering collaboration between humans and robots. Trust is crucial in this new paradigm, enabling smooth interaction, especially for users with limited robotics knowledge. This study presents a novel framework that uses human hand gestures and voice commands to control robot movements, aiming to enhance trust, reduce cognitive workload, and minimize task execution time-key for efficient manufacturing. In automated systems, swift completion of micromanagement tasks is essential to prevent process disruption. To evaluate this framework, we devised a testbed scenario within an automated carbon fiber transportation and draping process, focusing on a maintenance task as the micromanagement challenge. Participants inspected the gripper, guided the robot along a defined path, and performed maintenance, such as attaching cables. Two conditions were tested: gestures and voice commands versus a smartPAD. The results showed that gestures and voice commands increased trust, lowered cognitive load, and shortened execution times, improving overall manufacturing efficiency.

Abstract:
Large Language Models (LLMs) have gained popularity in task planning for long-horizon manipulation tasks. To enhance the validity of LLM-generated plans, visual demonstrations and online videos have been widely employed to guide the planning process. However, for manipulation tasks involving subtle movements but rich contact interactions, visual perception alone may be insufficient for the LLM to fully interpret the demonstration. Additionally, visual data provides limited information on force-related parameters and conditions, which are crucial for effective execution on real robots. In this paper, we introduce LEMMo-Plan, an in-context learning framework that incorporates tactile and force-torque information from human demonstrations to enhance LLMs' ability to generate plans for new task scenarios. We propose a bootstrapped reasoning pipeline that sequentially integrates each modality into a comprehensive task plan. This task plan is then used as a reference for planning in new task configurations. Real-world experiments on two different sequential manipulation tasks demonstrate the effectiveness of our framework in improving LLMs' understanding of multi-modal demonstrations and enhancing the overall planning performance. More materials are available on our project website: lemmo-plan.github.io/LEMMo-Plan/.

Abstract:
We encounter large-scale environments where both structured and unstructured spaces coexist, such as on campuses. In this environment, lighting conditions and dynamic objects change constantly. To tackle the challenges of large-scale mapping under such conditions, we introduce DiTer++, a diverse terrain and multi-modal dataset designed for multi-robot SLAM in multi-session environments. According to our datasets' scenarios, Agent-A and Agent-B scan the area designated for efficient large-scale mapping day and night, respectively. Also, we utilize legged robots for terrain-agnostic traversing. To generate the ground-truth of each robot, we first build the survey-grade prior map. Then, we remove the dynamic objects and outliers from the prior map and extract the trajectory through scan-to-map matching. Our dataset and supplement materials are available at https://github.com/sparolab/DiTer-plusplus/.

Abstract:
Despite having fewer neurons than more complex life forms, insects are still capable of producing astonishing locomotive behaviors, such as traversing diverse environments and making rapid gait adaptations after extreme injury or autotomy. Biologists attribute this to a chain of segmental neuron clusters (ganglia) within insect nervous systems, which act as distributed self-organizing sensorimotor control units. Inspired by the neural structure of the Carausius morosus, the common stick insect, this work introduces the Distributed Neural Locomotion Controller (D-NLC), a modular control framework that utilizes local proprioceptive feedback to modulate joint-level Central Pattern Generator (CPG) signals to produce emergent locomotive behaviors. This framework was implemented on a modular legged robot with distributed jointlevel embedded computing units. In addition, assessments were conducted on the framework's performance and behavior in various experimental settings. Based on real-world experiments, we observe an overall 31.3% average increase in curvilinear motion performance under external (terrain) and internal (amputation) perturbation compared to a centralized predefined gait controller. This difference is statistically significant (P \ll 0.05) for larger perturbations but not for single-leg amputations. Experiments with perturbation-induced leg stance duration and leg phase-difference analysis further validated our hypothesis regarding D-NLC's role in the robust perceptive locomotion and self-emergent gait adaptation against complex unforeseen perturbations. This proposed control framework does not require any numerical optimization or weight training processes, which are time-consuming and computationally expensive. To the best of our knowledge, this framework is the first bio-inspired neural controller deployed on a distributed embedded system. More info at https://eigenbot-dnlc.github.io.

Abstract:
This paper presents AquaMILR+, an untethered limbless robot designed for agile navigation in complex aquatic environments. The robot features a bilateral actuation mechanism that models musculoskeletal actuation in many anguilliform swimming organisms which propagates a moving wave from head to tail allowing open fluid undulatory swimming. This actuation mechanism employs mechanical intelligence through programmable body compliance, enhancing the robot's open-loop maneuverability when interacting with obstacles. AquaMILR+ also includes a compact depth control system inspired by the swim bladder and lung structures of eels and sea snakes. The mechanism, driven by a syringe and telescoping leadscrew, enables depth and pitch control - capabilities that are difficult for most anguilliform swimming robots to achieve. Additional structures, such as fins and a tail, further improve stability and propulsion efficiency. Our tests in both open water and laboratory models of 2D and 3D heterogeneous aquatic environments highlight AquaMILR+'s capabilities and suggest a promising system for complex underwater tasks such as search and rescue and deep-sea exploration.

Abstract:
Current humanoid push-recovery strategies often use whole-body motion, yet they tend to overlook posture regulation. For instance, in manipulation tasks, the upper body may need to stay upright and have minimal recovery displacement. This paper introduces a novel approach to enhancing humanoid push-recovery performance under unknown disturbances and regulating body posture by tailoring the recovery stepping strategy. We propose a hierarchical-MPC-based scheme that analyzes and detects instability in the prediction window and quickly recovers through adapting gait frequency. Our approach integrates a high-level nonlinear MPC, a posture-aware gait frequency adaptation planner, and a low-level convex locomotion MPC. The planners predict the center of mass (CoM) state trajectories that can be assessed for precursors of potential instability and posture deviation. In simulation, we demonstrate improved maximum recoverable impulse by 131 % on average compared with baseline approaches. In hardware experiments, a 125 ms advancement in recovery stepping timing/reflex has been observed with the proposed approach. We also demonstrate improved push-recovery performance and minimized body attitude change under 0.2 rad.

Abstract:
This work hypothesizes that a social robot that uses reinforcement learning can effectively adapt to individual differences in teaching imitation skills (e.g., facial expressions) to children with autism spectrum disorder. We developed an active learning method based on reinforcement learning to personalize human-robot interaction sessions based on each child's imitation performance and preference. We evaluated this method with five children with autism spectrum disorder, and the results demonstrated varying responses to different methods of presenting facial expressions to teach imitation skills. We found that the robot consistently promoted increased shared attention, including visual contact and physical proximity during imitation tasks. This suggests that adaptive human-robot interactions can cater to the unique needs of children with autism, offering a promising avenue for personalized intervention. Additionally, we discuss observed qualitative insights from our study and considerations for robot behavior mitigation strategies to sustain engagement.

Abstract:
Adapting quickly to dynamic, uncertain environments—often called “open worlds” —remains a major challenge in robotics. Traditional Task and Motion Planning (TAMP) approaches struggle to cope with unforeseen changes, are data-inefficient when adapting, and do not leverage world models during learning. We address this issue with a hybrid planning and learning system that integrates two models: a low-level neural network-based model that learns stochastic transitions and drives exploration via an Intrinsic Curiosity Module (ICM), and a high-level symbolic planning model that captures abstract transitions using operators, enabling the agent to plan in an “imaginary” space and generate reward machines. Our evaluation in a robotic manipulation domain with sequential novelty injections demonstrates that our approach converges faster and outperforms state-of-the-art hybrid methods.

Abstract:
This paper develops a control strategy for pursuit-evasion problems in environments with occlusions. We address the challenge of a mobile pursuer keeping a mobile evader within its field of view (FoV) despite line-of-sight obstructions. The signed distance function (SDF) of the FoV is used to formulate visibility as a control barrier function (CBF) constraint on the pursuer's control inputs. Similarly, obstacle avoidance is formulated as a CBF constraint based on the SDF of the obstacle set. While the visibility and safety CBFs are Lipschitz continuous, they are not differentiable everywhere, necessitating the use of generalized gradients. To achieve non-myopic pursuit, we generate reference control trajectories leading to evader visibility using a sampling-based kinodynamic planner. The pursuer then tracks this reference via convex optimization under the CBF constraints. We validate our approach in CARLA simulations and real-world robot experiments, demonstrating successful visibility maintenance using only onboard sensing, even under severe occlusions and dynamic evader movements.

Abstract:
Immediate removal of hazardous gases is critical for ensuring safety. Traditional methods, such as portable ventilation equipment, are difficult to use when hazardous gases are released in inaccessible environments. In this paper, we propose a novel mechanism that integrates an inflatable helical structure into a soft growing robot. The proposed mechanism is capable of performing suction through its inner channel after navigating complex environments, while maintaining the inherent advantages of the soft growing robot as it grows. The mechanism operates in two phases: a growing phase, in which the robot extends by eversion, and a suction phase, in which suction is performed through the inner channel of the robot. Experiments and demonstrations were conducted to evaluate the performance of the proposed mechanism. The experimental results confirmed the ability to maintain the passageway shape of the inner channel during suction operations and provided a design guideline. The demonstration validated that the mechanism can effectively navigate inaccessible environments and perform suction to remove hazardous gases.

Abstract:
Exploration is a critical challenge in robotics, centered on understanding unknown environments. In this work, we focus on structured indoor environments, which often exhibit predictable, repeating patterns. Conventional frontier-based exploration approaches have difficulty leveraging this predictability, relying on simple heuristics such as ‘closest first’ for exploration. More recent deep learning-based methods predict unknown regions of the map for information gain computation, but these approaches are often sensitive to the predicted map quality or fail to account for sensor coverage. To overcome these issues, our key insight is to jointly reason over what the robot can observe and its uncertainty to calculate probabilistic information gain. We introduce MapEx, a new exploration framework that uses predicted maps to form probabilistic sensor model for information gain estimation. MapEx generates multiple predicted maps based on observed information, and takes into consideration both the computed variances of predicted maps and estimated visible area to estimate the information gain of a given viewpoint. Experiments on the real-world KTH dataset showed on average 12.4% improvement than representative map-prediction based exploration and 25.4% improvement than nearest frontier approach. Website: https://mapex-explorer.github.io/

Abstract:
In this paper, we introduce a novel method for safe navigation in agricultural robotics. As global environmental challenges intensify, robotics offers a powerful solution to reduce chemical usage while meeting the increasing demands for food production. However, significant challenges remain in ensuring the autonomy and resilience of robots operating in unstructured agricultural environments. Obstacles such as crops and tall grass, which are deformable, must be identified as safely traversable, compared to rigid obstacles. To address this, we propose a new traversability analysis method based on a 3D spectral map reconstructed using a LIDAR and a multispectral camera. This approach enables the robot to distinguish between safe and unsafe collisions with deformable obstacles. We perform a comprehensive evaluation of multispectral metrics for vegetation detection and incorporate these metrics into an augmented environmental map. Utilizing this map, we compute a physics-based traversability metric that accounts for the robot's weight and size, ensuring safe navigation over deformable obstacles.

Abstract:
As modern agriculture progresses, the swift deployment of accurate maps becomes essential for the autonomous navigation and operation of orchard robots. Traditional mapping techniques often fall short in addressing the challenges posed by orchards, which are characterized by unstructured, dynamically changing environments with complex spatial and temporal dynamics due to seasonal and continuous operations. This paper proposes a new approach to orchard map construction that merges topological maps with semantic SLAM. This integration enables the creation, optimization, and rapid deployment of maps that are not only lightweight and robust but also precise. To evaluate the effectiveness of our method, we performed navigation tests in orchard environments using the newly developed maps. The experimental outcomes demonstrated a significant reduction in CPU usage, with maximum and average reductions of 7.6% and 4.5%, respectively. This approach not only enhances navigation efficiency but also facilitates quicker map deployment, effectively freeing computational resources for other critical tasks.

Abstract:
We focus on push-based multi-object rearrangement planning using a nonholonomically constrained mobile robot. The simultaneous geometric, kinematic, and physics constraints make this problem especially challenging. Prior work on rearrangement planning often relaxes some of these constraints by assuming dexterous hardware, prehensile manipulation, or sparsely occupied workspaces. Our key insight is that by capturing these constraints into a unified representation, we could empower a constrained robot to tackle difficult problem instances by modifying the environment in its favor. To this end, we introduce a push-traversability graph, whose vertices represent poses that the robot can push objects from, and edges represent optimal, kinematically feasible, and stable transitions between them. Based on this graph, we develop ReloPush, a graph-based planning framework that takes as input a complex multi-object rearrangement task and breaks it down into a sequence of single-object pushing tasks. We evaluate ReloPush across a series of challenging scenarios, involving the rearrangement of densely cluttered workspaces with up to nine objects, using a 1/10-scale robot racecar. ReloPush exhibits orders of magnitude faster runtimes and significantly more robust execution in the real world, evidenced in lower execution times and fewer losses of object contact, compared to two baselines lacking our proposed graph structure.

Abstract:
To operate at a building scale, service robots must perform long-horizon mobile manipulation tasks by navigating to different rooms, accessing multiple floors, and interacting with a wide and unseen range of everyday objects. We refer to these tasks as Building-wide Mobile Manipulation. To tackle these inherently long-horizon tasks, we introduce BUMBLE, a unified Vision-Language Model (VLM)-based framework integrating open-world RGB-D perception, a wide spectrum of gross-to-fine motor skills, and dual-layered memory. Our extensive evaluation (90+ hours) indicates that BUMBLE outperforms competitive baselines in long-horizon building-wide tasks that require sequencing up to 12 skills, spanning 15 minutes per trial. BUMBLE achieves 47.1% success rate averaged over 70 trials in different buildings, tasks, and scene layouts from various starting locations. Our user study shows 22% higher task satisfaction using our framework compared to state-of-the-art VLM-based mobile manipulation methods. Finally, we show the potential of using increasingly capable foundation models to improve the system performance further. For more information, see https://robin-lab.cs.utexas.edu/BUMBLE/

Abstract:
Non-prehensile pushing to move and reorient objects to a goal is a versatile loco-manipulation skill. In the real world, the object's physical properties and friction with the floor contain significant uncertainties, which makes the task challenging for a mobile manipulator. In this paper, we develop a learning-based controller for a mobile manipulator to move an unknown object to a desired position and yaw orientation through a sequence of pushing actions. The proposed controller for the robotic arm and the mobile base motion is trained using a constrained Reinforcement Learning (RL) formulation. We demonstrate its capability in experiments with a quadrupedal robot equipped with an arm. The learned policy achieves a success rate of 91.35% in simulation and at least 80% on hardware in challenging scenarios. Through our extensive hardware experiments, we show that the approach demonstrates high robustness against unknown objects of different masses, materials, sizes, and shapes. It reactively discovers the pushing location and direction, thus achieving contact-rich behavior while observing only the pose of the object. Additionally, we demonstrate the adaptive behavior of the learned policy towards preventing the object from toppling.

Abstract:
In contrast with the diversity of materials found in nature, most robots are designed with some combination of aluminum, stainless steel, and 3D-printed filament. Additionally, robotic systems are typically assumed to follow basic rigid-body dynamics. However, several examples in nature illustrate how changes in physical material properties yield functional advantages. In this paper, we explore how physical materials (non-rigid bodies) affect the functional performance of a hopping robot. In doing so, we address the practical question of how to model and simulate material properties. Through these simulations we demonstrate that material gradients in the leg of a single-limb hopper provide functional advantages compared to homogeneous designs. For example, when considering incline ramp hopping, a material gradient with increasing density provides a 35% reduction in tracking error and a 23% reduction in power consumption compared to homogeneous stainless steel. By providing bio-inspiration to the rigid limbs in a robotic system, we seek to show that future fabrication of robots should look to leverage the material anisotropies of moduli and density found in nature. This would allow for reduced vibrations in the system and would provide offsets of joint torques and vibrations while protecting their structural integrity against reduced fatigue and wear. This simulation system could inspire future intelligent material gradients of custom-fabricated robotic locomotive devices.

Abstract:
Sandy environments present challenges for robotic space rovers and systems due to reduced traction, limiting mobility and tugging force. This paper presents an anchoring method that utilizes a winching system to create a sand mound in front of a mobile agent dragged through the media. The proposed controller is designed to consistently achieve realtime capture of close-to-maximal lateral sand mound resistive force, even when applied to varied uneven terrains, like holes or waves. Notably, tugging is non-reversible, so suitable peaks should be captured before breakdown and without necessarily knowing the global optimum a priori. The controller logic tracks both tugging force and agent pitch gradients to detect terrain conditions and peak force trends. Results show that the controller captures an average 92 % of the maximum forces, within the previously winched workspace tested, across three different granular media with four varying structured terrain features. The controller achieves higher resistive force peaks on terrains with geometric features, as opposed to flat sand. We conclude that sand mounding through tugging is a viable means to generate robotic resistive forces for unknown sandy terrains, a simple yet effective anchoring mechanism.

Abstract:
Autonomous and targeted underwater visual monitoring and exploration using Autonomous Underwater Vehicles (AUVs) can be a challenging task due to both online and offline constraints. The online constraints comprise limited onboard storage capacity and communication bandwidth to the surface, whereas the offline constraints entail the time and effort required for the selection of desired keyframes from the video data. An example use case of targeted underwater visual monitoring is finding the most interesting visual frames of fish in a long sequence of an AUV's visual experience. This challenge of targeted informative sampling is further aggravated in murky waters with poor visibility. In this paper, we present MERLION, a novel framework that provides semantically aligned and visually enhanced summaries for murky underwater marine environment monitoring and exploration. Specifically, our framework integrates (a) an image-text model for semantically aligning the visual samples to the user's needs, (b) an image enhancement model for murky water visual data and (c) an informative sampler for summarizing the monitoring experience. We validate our proposed MERLION framework on real-world data with user studies and present qualitative and quantitative results using our evaluation metric and show improved results compared to the state-of-the-art approaches. We have open-sourced the code for MERLION at the following link https://github.com/MARVL-Lab/MERLION.git.

Abstract:
Maritime environments often present hazardous situations due to factors such as moving ships or buoys, which become obstacles under the influence of waves. In such challenging conditions, the ability to detect and track potentially hazardous objects is critical for the safe navigation of marine robots, but datasets capturing these scenarios remain limited. To address this limitation, we introduce a new multi-modal dataset that includes image and point-wise annotations of maritime obstacles. Our dataset provides detailed ground truth for obstacle detection and tracking, including objects as small as 10 × 10 pixels, which are crucial for maritime safety. To validate the dataset's effectiveness as a reliable benchmark, we conducted evaluations using various methodologies, including state-of-the-art (SOTA) techniques for object detection and tracking. These evaluations are expected to contribute to improving performance, particularly in the complex maritime environment. This represents the first demonstration of a dataset offering multi-modal annotations specifically tailored to maritime environments. Our dataset is available at https://github.com/sparolab/PoLaRIS.

Abstract:
This paper proposes the ProxFly, a residual deep Reinforcement Learning (RL)-based controller for close proximity quadcopter flight. Specifically, we design a residual module on top of a cascaded controller (denoted as basic controller) to generate high-level control commands, which compensate for external disturbances and thrust loss caused by downwash effects from other quadcopters. First, our method takes only the ego state and controllers' commands as inputs and does not rely on any communication between quadcopters, thereby reducing the bandwidth requirement. Through domain randomization, our method relaxes the requirement for accurate system identification and fine-tuned controller parameters, allowing it to adapt to changing system models. Meanwhile, our method not only reduces the proportion of unexplainable signals from the black box in control commands but also enables the RL training to skip the time-consuming exploration from scratch via guidance from the basic controller. We validate the effectiveness of the residual module in the simulation with different proximities. Moreover, we conduct the real close proximity flight test to compare ProxFly with the basic controller and an advanced model-based controller with complex aerodynamic compensation. Finally, we show that ProxFly can be used for challenging quadcopter midair docking, where two quadcopters fly in extreme proximity, and strong airflow significantly disrupts flight. However, our method can stabilize the quadcopter in this case and accomplish docking. The resources are available at https://github.com/ruiqizhang99/ProxFly.

Abstract:
This paper addresses the problem of task planning for robots that must comply with operational manuals in real-world settings. Task planning under these constraints is essential for enabling autonomous robot operation in domains that require adherence to domain-specific knowledge. Current methods for generating robot goals and plans rely on common sense knowledge encoded in large language models. However, these models lack grounding of robot plans to domain-specific knowledge and are not easily transferable between multiple sites or customers with different compliance needs. In this work, we present SayComply, which enables grounding robotic task planning with operational compliance using retrievalbased language models. We design a hierarchical database of operational, environment, and robot embodiment manuals and procedures to enable efficient retrieval of the relevant context under the limited context length of the LLMs. We then design a task planner using a tree-based retrieval augmented generation (RAG) technique to generate robot tasks that follow user instructions while simultaneously complying with the domain knowledge in the database. We demonstrate the benefits of our approach through simulations and hardware experiments in real-world scenarios that require precise context retrieval across various types of context, outperforming the standard RAG method. Our approach bridges the gap in deploying robots that consistently adhere to operational protocols, offering a scalable and edge-deployable solution for ensuring compliance across varied and complex real-world environments. Project website: saycomply.github.io.

Abstract:
Distributed multi-robot teams are increasingly used for optimal coverage of domains with unknown density distributions, often modeled with Gaussian Processes (GPs). However, current methods rely on data sharing, raising privacy concerns and computational issues. We propose a Federated Learning (FL) approach that enables collaborative training of GP models without sharing raw data. To enhance scalability and efficiency, we introduce a filtering strategy that selects relevant data samples, minimizing computational load. Realistic simulations emulating real world scenarios demonstrate the effectiveness of our method in achieving robust environmental estimates with minimal data sharing and reduced complexity.

Abstract:
In this paper, we present a comprehensive dataset comprising 37.9 hours of sensor data collected from humanoid robots, including 18.3 hours of walking and 2,519 recorded falls. This extensive dataset is a valuable resource for various robotics and machine learning applications. Leveraging this data, we propose RePro-TCN, a Temporal Convolutional Network (TCN) enhanced with two novel extensions: Relaxed Loss Formulation and Progressive Forecasting. Predicting falls is a critical capability in humanoid robotics for implementing countermeasures such as lunging or stopping the walk. Thanks to the new dataset, we train RePro-TCN and demonstrate its superiority over previous approaches under real-world conditions that were previously unattainable.

Abstract:
We introduce a unified approach to forecast the dynamics of human keypoints along with the motion trajectory based on a short sequence of input poses. While many studies address either full-body pose prediction or motion trajectory prediction, only a few attempt to merge them. We propose a motion transformation technique to simultaneously predict full-body pose and trajectory key-points in a global coordinate frame. We utilize an off-the-shelf 3D human pose estimation module, a graph attention network to encode the skeleton structure, and a compact, non-autoregressive transformer suitable for real-time motion prediction for human-robot interaction and human-aware navigation. We introduce a human navigation dataset “DARKO” with specific focus on navigational activities that are relevant for human-aware mobile robot navigation. We perform extensive evaluation on Human3.6M, CMU-Mocap, and our DARKO dataset. In comparison to prior work, we show that our approach is compact, real-time, and accurate in predicting human navigation motion across all datasets. Result animations, our dataset, and code will be available at https://nisarganc.github.io/UPTor-page/

Abstract:
Task-oriented grasping (TOG) is crucial for robots to accomplish manipulation tasks, requiring the determination of TOG positions and directions. Existing methods either rely on costly manual TOG annotations or only extract coarse grasping positions or regions from human demonstrations, limiting their practicality in real-world applications. To address these limitations, we introduce RTAGrasp, a Retrieval, Transfer, and Alignment framework inspired by human grasping strategies. Specifically, our approach first effortlessly constructs a robot memory from human grasping demonstration videos, extracting both TOG position and direction constraints. Then, given a task instruction and a visual observation of the target object, RTAGrasp retrieves the most similar human grasping experience from its memory and leverages semantic matching capabilities of vision foundation models to transfer the TOG constraints to the target object in a training-free manner. Finally, RTAGrasp aligns the transferred TOG constraints with the robot's action for execution. Evaluations on the public TOG benchmark, TaskGrasp dataset, show the competitive performance of RTAGrasp on both seen and unseen object categories compared to existing baseline methods. Real-world experiments further validate its effectiveness on a robotic arm. Our code, appendix, and video are available at https://sites.google.com/view/rtagrasp/home.

Abstract:
In complex missions such as search and rescue, robots must make intelligent decisions in unknown environments, relying on their ability to perceive and understand their surroundings. High-quality and real-time reconstruction enhances situational awareness and is crucial for intelligent robotics. Traditional methods often struggle with poor scene representation or are too slow for real-time use. Inspired by the efficacy of 3D Gaussian Splatting (3DGS), we propose a hierarchical planning framework for fast and high-fidelity active reconstruction. Our method evaluates completion and quality gain to adaptively guide reconstruction, integrating global and local planning for efficiency. Experiments in simulated and realworld environments show our approach outperforms existing real-time methods.

Abstract:
With the increasing integration of robots into human life, their role in architectural spaces where people spend most of their time has become more prominent. While motion capabilities and accurate localization for automated robots have rapidly developed, the challenge remains to generate efficient, smooth, comprehensive, and high-quality trajectories in these areas. In this paper, we propose a novel efficient planner for ground robots to autonomously navigate in large complex multi-layered architectural spaces. Considering that traversable regions typically include ground, slopes, and stairs, which are planar or nearly planar structures, we simplify the problem to navigation within and between complex intersecting planes. We first extract traversable planes from 3D point clouds through segmenting, merging, classifying, and connecting to build a plane-graph, which is lightweight but fully represents the traversable regions. We then build a trajectory optimization based on motion state trajectory and fully consider special constraints when crossing multi-layer planes to maximize the robot's maneuverability. We conduct experiments in simulated environments and test on a CubeTrack robot in real-world scenarios, validating the method's effectiveness and practicality.

Abstract:
Ahstract-Achieving time-optimal flight in real time for multi-drone systems presents significant challenges, particularly in scenarios requiring rapid responses or aggressive maneuvers. This paper introduces a novel framework that bridges the gap between time-optimal polynomial trajectory generation and optimal control, facilitating efficient online replanning (100 Hz onboard) for multiple quadrotors. Specifically, the proposed method leverages a neural network to learn optimal time allocations for polynomial trajectories, which are then integrated with Model Predictive Contouring Control to fully exploit the dynamics of quadrotors. We further extend this approach to multi-drone systems, enabling collaborative high-speed flight with reciprocal collision avoidance. We benchmark the time-optimal performance and computational efficiency of our method in a drone racing scenario and demonstrate its effectiveness in agile cooperative flight within more constrained simulation and real-world environments. The results demonstrate that the proposed method achieves agile waypoint traverse at a speed of up to 19 m/s in simulation and up to 9 m/s in two-drone real-world scenario. [video44https://www.youtube.com/watch?v=KE97sKwYpAs]

Abstract:
While undulatory swimming of elongate limbless robots has been extensively studied in open hydrodynamic environments, less research has been focused on limbless locomotion in complex, cluttered aquatic environments. Motivated by the concept of mechanical intelligence [1], where controls for obstacle navigation can be offloaded to passive body mechanics in terrestrial limbless locomotion, we hypothesize that principles of mechanical intelligence can be extended to cluttered hydrodynamic regimes. To test this, we developed an untethered limbless robot capable of undulatory swimming on water surfaces, utilizing a bilateral cable-driven mechanism inspired by organismal muscle actuation morphology to achieve programmable anisotropic body compliance. We demonstrated through robophysical experiments that, similar to terrestrial locomotion, an appropriate level of body compliance can facilitate emergent swim through complex hydrodynamic environments under pure open-loop control. Moreover, we found that swimming performance depends on undulation frequency, with effective locomotion achieved only within a specific frequency range. This contrasts with highly damped terrestrial regimes, where inertial effects can often be neglected. Further, to enhance performance and address the challenges posed by nondeterministic obstacle distributions, we incorporated computational intelligence by developing a real-time body compliance tuning controller based on cable tension feedback. This controller improves the robot's robustness and overall speed in heterogeneous hydrodynamic environments.

Abstract:
Vision-based tactile sensors have recently gained prominence due to their superior resolution and ability to capture multi-dimensional contact information. However, even when sensors share the same sensing principle, variations in production factors can lead to differences in the color patterns of tactile signals. Unlike common vision tasks, vision-based tactile perception depends on tracking light variation in colorful signals, making it more susceptible to lighting conditions and thus more prone to domain gaps. In this paper, we propose an Omni-hardness perception framework that enables adaptation across various vision-based tactile sensors. Firstly, in-depth analyses of the factors influencing the generalization of hardness perception are presented. Furthermore, the light balance module and the force scale module are coupled to regulate network learning of generalized representations. Experimental results across multiple sensors demonstrate the transferability of learned representations. Additionally, downstream tasks in natural object perception, tumor detection, and grasping stability prediction, are proposed to evaluate the potential applications. The framework's performance shows promise for advancing general tactile sensing and embodied tactile perception.

Abstract:
Human-level autonomous driving is an ever-elusive goal, with planning and decision making - the cognitive functions that determine driving behavior - posing the greatest challenge. Despite a proliferation of promising approaches, progress is stifled by the difficulty of deploying experimental planners in naturalistic settings. In this work, we propose Lab2Car, an optimization-based wrapper that can take a trajectory sketch from an arbitrary motion planner and convert it to a safe, comfortable, dynamically feasible trajectory that the car can follow. This allows motion planners that do not provide such guarantees to be safely tested and optimized in real-world environments. We demonstrate the versatility of Lab2Car by using it to deploy a machine learning (ML) planner and a classical planner on self-driving cars in Las Vegas. The resulting systems handle challenging scenarios, such as cut-ins, overtaking, and yielding, in complex urban environments like casino pick-up/drop-off areas. Our work paves the way for quickly deploying and evaluating candidate motion planners in realistic settings, ensuring rapid iteration and accelerating progress towards human-level autonomy.

Abstract:
Path planning algorithms, such as the searchbased A, are a critical component of autonomous mobile robotics, enabling robots to navigate from a starting point to a destination efficiently and safely. We investigated the resilience of the A^ algorithm in the face of potential adversarial interventions known as obstacle attacks. The adversary's goal is to delay the robot's timely arrival at its destination by introducing obstacles along its original path. We developed malicious software to execute the attacks and conducted experiments to assess their impact, both in simulation using TurtleBot in Gazebo and in real-world deployment with the Unitree Go1 robot. In simulation, the attacks resulted in an average delay of 36 %, with the most significant delays occurring in scenarios where the robot was forced to take substantially longer alternative paths. In real-world experiments, the delays were even more pronounced, with all attacks successfully rerouting the robot and causing measurable disruptions. These results highlight that the algorithm's robustness is not solely an attribute of its design but is significantly influenced by the operational environment. For example, in constrained environments like tunnels, the delays were maximized due to the limited availability of alternative routes.

Abstract:
Automatic Emergency Braking (AEB) systems are a crucial component in ensuring the safety of passengers in autonomous vehicles. Conventional AEB systems primarily rely on closed-set perception modules to recognize traffic conditions and assess collision risks. To enhance the adaptability of AEB systems in open scenarios, we propose Dual-AEB, a system combines an advanced multimodal large language model (MLLM) for comprehensive scene understanding and a conventional rule-based rapid AEB to ensure quick response times. To the best of our knowledge, Dual-Aebis the first method to incorporate MLLMs within AEB systems. Through extensive experimentation, we have validated the effectiveness of our method. Codes will be publicly available at https://github.com/ChipsICU/Dual-AEB.

Abstract:
This paper proposes an assembly sequence planning framework, named Subassembly to Assembly (S2A). The framework is designed to enable a robotic manipulator to assemble multiple parts in a prespecified structure by leveraging object manipulation actions. The primary technical challenge lies in the exponentially increasing complexity of identifying a feasible assembly sequence as the number of parts grows. To address this, we introduce a graph-based reinforcement learning approach, where a graph attention network is trained using a delayed reward assignment strategy. In this strategy, rewards are assigned only when an assembly action contributes to the successful completion of the assembly task. We validate the framework's performance through physics-based simulations, comparing it against various baselines to emphasize the significance of the proposed reward assignment approach. Additionally, we demonstrate the feasibility of deploying our framework in a real-world robotic assembly scenario.

Abstract:
The Real2Sim2Real (R2S2R) paradigm is critical for advancing robotic learning. Existing methods lack a comprehensive solution to accurately reconstruct real-world objects with both spatial representations and their associated physics attributes in the Real2Sim stage. We propose a Real2Sim pipeline to generate digital assets enabling high-fidelity simulation. We design a hybrid repre-sentation model that integrates mesh geometry, 3D Gaussian kernels, and physics attributes to enhance the representation of robotic arms in digital assets. This hybrid representation is implemented through a Gaussian-Mesh-Pixel binding technique, which establishes an isomorphic mapping between mesh vertices and the Gaussian model. This enables a fully differentiable rendering pipeline that can be optimized through numerical solvers, achieves high-fidelity rendering via Gaussian Splatting, and facilitates physically plausible simulation of the robotic arm's interaction with its environment through mesh geometry. With the digital assets, we propose a fully manipulable Real2Sim pipeline that standardizes coordinate systems and scales, ensuring the seamless integration of multiple components. To demonstrate its effectiveness, we include datasets covering various robotic manipulation tasks with their mesh reconstructions. Our model achieves state-of-the-art results in realistic rendering and mesh reconstruction quality for robotic applications. Our code and datasets will be made publicly available at robostudioapp.com.

Abstract:
Radioguided surgery, such as sentinel lymph node biopsy, relies on the precise localization of radioactive targets by non-imaging gamma/beta detectors. Manual radioactive target detection based on visual display or audible indication of gamma level is highly dependent on the ability of the surgeon to track and interpret the spatial information. This paper presents a learning-based method to realize the autonomous radiotracer detection in robot-assisted surgeries by navigating the probe to the radioactive target. We proposed novel hybrid approach that combines deep reinforcement learning (DRL) with adaptive robotic scanning. The adaptive grid-based scanning could provide initial direction estimation while the DRL-based agent could efficiently navigate to the target utilising historical data. Simulation experiments demonstrate a 95% success rate, and improved efficiency and robustness compared to conventional techniques. Real-world evaluation on the da Vinci Research Kit (dVRK) further confirms the feasibility of the approach, achieving an 80% success rate in radiotracer detection. This method has the potential to enhance consistency, reduce operator dependency, and improve procedural accuracy in radioguided surgeries.

Abstract:
With the growing popularity of electric vehicles, the demand for robot-based unmanned automatic charging has become both urgent and challenging. Two key challenges need to be addressed: how to efficiently locate the charging port, and how to compliantly insert the connector into the port. In this paper, we propose an incremental learning method based on the broad learning system to address the visual positioning error of the charging port. This method allows the robot to transfer and generalize the search skills learned in simulation to real-world scenarios. As a result, the robot can rapidly locate the charging port in real-world environments without the need for complex contact state modeling, time-consuming data collection, or model retraining. Subsequently, a biomimetic admittance controller is designed to enable the robot to adapt its compliant behavior online during the plugging process. Finally, experiments are performed on a UR robot to verify the effectiveness of our method.

Abstract:
The planning problem constitutes a fundamental aspect of the autonomous driving framework. Recent strides in representation learning have empowered vehicles to comprehend their surrounding environments, thereby facilitating the integration of learning-based planning strategies. Among these approaches, Imitation Learning stands out due to its notable training efficiency. However, traditional Imitation Learning methodologies encounter challenges associated with the co-variate shift phenomenon. We propose Validity Learning on Failures, VL(on failure), as a remedy to address this issue. The essence of our method lies in deploying a pre-trained planner across diverse scenarios. Instances where the planner deviates from its immediate objectives, such as maintaining a safe distance from obstacles or adhering to traffic rules, are flagged as failures. The states corresponding to these failures are compiled into a new dataset, termed the failure dataset. Notably, the absence of expert annotations for this data precludes the applicability of standard imitation learning approaches. To facilitate learning from the closed-loop mistakes, we introduce the VL objective which aims to discern valid trajectories within the current environmental context. Experimental evaluations conducted on both reactive CARLA simulation and non-reactive log-replay simulations reveal substantial enhancements in closed-loop metrics such as Score, Progress, and Success Rate, underscoring the effectiveness of the proposed methodology. Further evaluations against the Bench2Drive benchmark demonstrate that VL(on failure) outperforms the state-of-the-art methods by a large margin.

Affiliations: School of Computer Science and Technology, University of Science and Technology of China, Hefei, China; Department of Statistics and Finance, School of Management, University of Science and Technology of China, Hefei, China; School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei, China; Hefei Comprehensive National Science Center, Institute of Artificial Intelligence, Hefei, China

Abstract:
Large-scale scene point cloud registration with limited overlap is a challenging task due to computational load and constrained data acquisition. To tackle these issues, we propose a point cloud registration method, MT-PCR, based on Modality Transformation. MT-PCR leverages a Bird's Eye View (BEV) capturing the maximal overlap information to improve the accuracy and utilizes images to provide complementary spatial features. Specifically, MT-PCR converts 3D point clouds to BEV images and estimates correspondence by 2D image keypoints extraction and matching. Subsequently, the 2D correspondence estimates are then transformed back to 3D point clouds using inverse mapping. We have applied MT-PCR to Terrestrial Laser Scanning (TLS) and Aerial Laser Scanning (ALS) point cloud registration on the GrAco dataset, involving 8 low-overlap, square-kilometer scale registration scenarios. Experiments and comparisons with commonly used methods demonstrate that MT-PCR can achieve superior accuracy and robustness in large-scale scenes with limited overlap.

Abstract:
Autonomous navigation in unknown environments is a fundamental challenge in robotics, particularly in coordinating ground and aerial robots to maximize exploration efficiency. This paper presents a novel approach that utilizes a hierarchical graph to represent the environment, encoding both geometric and semantic traversability. The framework enables the robots to compute a shared confidence metric, which helps the ground robot assess terrain and determine when deploying the aerial robot will extend exploration. The robot's confidence in traversing a path is based on factors such as predicted volumetric gain, path traversability, and collision risk. A hierarchy of graphs is used to maintain an efficient representation of traversability and frontier information through multi-resolution maps. Evaluated in a real subterranean exploration scenario, the approach allows the ground robot to autonomously identify zones that are no longer traversable but suitable for aerial deployment. By leveraging this hierarchical structure, the ground robot can selectively share graph information on confidence-assessed frontier targets from parts of the scene, enabling the aerial robot to navigate beyond obstacles and continue exploration.

Abstract:
The capability of effectively moving on complex terrains such as sand and gravel can empower our robots to robustly operate in outdoor environments, and assist with critical tasks such as environment monitoring, search-and-rescue, and supply delivery. Inspired by the Mount Lyell salamander's ability to curl its body into a loop and effectively roll down hill slopes, in this study we develop a sand-rolling robot and investigate how its locomotion performance is governed by the shape of its body. We experimentally tested three different body shapes: Hexagon, Quadrilateral, and Triangle. We found that Hexagon and Triangle can achieve a faster rolling speed on sand, but exhibited more frequent failures of getting stuck. Analysis of the interaction between robot and sand revealed the failure mechanism: the deformation of the sand produced a local “sand incline” underneath robot contact segments, increasing the effective region of supporting polygon (ERSP) and preventing the robot from shifting its center of mass (CoM) outside the ERSP to produce sustainable rolling. Based on this mechanism, a highly-simplified model successfully captured the critical body pitch for each rolling shape to produce sustained rolling on sand, and informed design adaptations that mitigated the locomotion failures and improved robot speed by more than 200%. Our results provide insights into how locomotors can utilize different morphological features to achieve robust rolling motion across deformable substrates.

Abstract:
Adaptive locomotion is a fundamental capability for quadruped robots, particularly in real-world scenarios when they must transport novel or out-of-distribution (O.O.D.) payloads across diverse terrains. Previous learning-based methods often tightly couple a locomotion controller's learned parameters with the adaptation process, which requires extensive pre-training or slow online updates when encountering O.O.D. payloads. To enable adaptation of quadruped locomotion to O.O.D. payloads, we propose the novel Rapid Introspective Neural Adaptation (RINA) method that rapidly compensates for differences between expected and actual joint torques caused by O.O.D. payloads. RINA introduces an adaptive residual dynamics representation that decouples the learning model's parameters from those used for adaptation. A new neural operator network is introduced to learn a set of basis functions as the learning model, which are combined using linear coefficients to predict residual dynamics. Then, these residual dynamics are used to adjust the locomotion controller's output, compensating for additional torques induced by the O.O.D. payload. During execution, the mixing coefficients can be rapidly and introspectively adapted on-the-go to generate joint torque compensations for O.O.D. payloads, while keeping the learned basis functions unchanged. Experimental results have demonstrated that our RINA approach well addresses on-the-go O.O.D. payload adaptation on varied natural terrains without collecting and retraining on additional data and outperforms baseline methods. More details of this work are provided on the project website: https://hcrlab.gitlab.io/project/rina.

Abstract:
Recent advances in quadrupedal robots have demonstrated impressive agility and the ability to traverse diverse terrains. However, hardware issues, such as motor overheating or joint locking, may occur during long-distance walking or traversing through rough terrains leading to locomotion failures. Although several studies have proposed fault-tolerant control methods for quadrupedal robots, there are still challenges in traversing unstructured terrains. In this paper, we propose DreamFLEX, a robust fault-tolerant locomotion controller that enables a quadrupedal robot to traverse complex environments even under joint failure conditions. DreamFLEX integrates an explicit failure estimation and modulation network that jointly estimates the robot's joint fault vector and utilizes this information to adapt the locomotion pattern to faulty conditions in real-time, enabling quadrupedal robots to maintain stability and performance in rough terrains. Experimental results demonstrate that DreamFLEX outperforms existing methods in both simulation and real-world scenarios, effectively managing hardware failures while maintaining robust locomotion performance.

Abstract:
This paper is about generating motion plans for high degree-of-freedom systems that account for both static and dynamic collisions along the entire body. A particular class of mathematical programs with complementarity constraints become useful in this regard. Optimization-based planners can tackle confined space trajectory planning while being cognizant of robot and (mostly static) obstacle constraints. However, handling moving obstacles is non-trivial in a real-time setting. To this end, we present the FLIQC (Fast LInear Quadratic Complementarity based) motion planner. Our reactive planner employs a novel motion model that captures the entire rigid robot as well as the obstacle geometry and ensures nonpenetration between the surfaces due to the imposed constraint. We perform thorough comparative studies with the state-of-the-art, which demonstrate improved performance. Extensive simulation and hardware experiments validate our claim of generating continuous and real-time motion plans at 1 kHz for modern collaborative robots with constant minimal parameters.

Abstract:
Achieving rapid and accurate dynamic obstacle avoidance is crucial for enhancing the survivability of unmanned aerial vehicles (UAVs) in hazardous conditions. To accomplish dynamic obstacle avoidance, sensors with high temporal resolution and efficient processing models are required. Dynamic vision sensors (DVS) fulfill the sensing requirements, while spiking neural networks (SNNs) address the processing demands. In this paper, we develop an end-to-end obstacle avoidance algorithm for UAVs using only a single monocular DVS as the sensor and further enhance accuracy and speed through our proposed mechanisms. The algorithm consists of three components: ego-motion compensation, an SNN model for movement analysis, and a force filter inspired by spiking neurons. In movement analysis, we propose the temporal potential pooling (TPP) and incremental event (EI) mechanisms to accelerate our SNN model. The real-flight experiments confirm that our algorithm achieves approximately 90% accuracy with a processing latency as low as 4ms on a GPU, surpassing state-of-the-art methods. Ablation studies show that the proposed method maintains high accuracy in movement detection while significantly reducing computational time. Our method operates in real-time, achieves high accuracy, and is feasible across a wide range of environments. Our code is available at https://github.com/AmperiaWang/oanet_s1 for reproducibility.

Affiliations: Institute of Systems Engineering and Collaborative Laboratory for Intelligent Science and Systems, Macau University of Science and Technology; College of Information Science and Technology, Beijing University of Chemical Technology; School of Astronautics, Harbin Institute of Technology; Institute for AI Industry Research (AIR), Tsinghua University; School of Artificial Intelligence, University of Chinese Academy of Sciences; School of Mechanical and Vehicular Engineering, Beijing Institute of Technology

Abstract:
The increasing demand for efficient last-mile delivery in smart logistics underscores the role of autonomous robots in enhancing operational efficiency and reducing costs. Traditional navigation methods, which depend on highprecision maps, are resource-intensive, while learning-based approaches often struggle with generalization in real-world scenarios. To address these challenges, this work proposes the Openstreetmap-enhanced oPen-air sEmantic Navigation (OPEN) system that combines foundation models with classic algorithms for scalable outdoor navigation. The system uses off-the-shelf OpenStreetMap (OSM) for flexible map representation, thereby eliminating the need for extensive pre-mapping efforts. It also employs Large Language Models (LLMs) to comprehend delivery instructions and Vision-Language Models (VLMs) for global localization, map updates, and house number recognition. To compensate the limitations of existing benchmarks that are inadequate for assessing last-mile delivery, this work introduces a new benchmark specifically designed for outdoor navigation in residential areas, reflecting the real-world challenges faced by autonomous delivery systems. Extensive experiments in simulated and real-world environments demonstrate the proposed system's efficacy in enhancing navigation efficiency and reliability. To facilitate further research, our code and benchmark are publicly available11https://ei-nav.github.io/OpenBench/.

Abstract:
In physical Human-Robot Collaboration (pHRC), accurate human intent estimation and rational human-robot role allocation are crucial for safe and efficient assistance. Existing methods that rely on short-term motion data for intention estimation lack multi-step prediction capabilities, hindering their ability to sense intent changes and adjust human-robot assignments autonomously, resulting in potential discrepancies. To address these issues, we propose a Dual Transformer-based Robot Trajectron (DTRT) featuring a hierarchical architecture, which harnesses human-guided motion and force data to rapidly capture human intent changes, enabling accurate trajectory predictions and dynamic robot behavior adjustments for effective collaboration. Specifically, human intent estimation in DTRT uses two Transformer-based Conditional Variational Autoencoders (CVAEs), incorporating robot motion data in obstacle-free case with human-guided trajectory and force for obstacle avoidance. Additionally, Differential Cooperative Game Theory (DCGT) is employed to synthesize predictions based on human-applied forces, ensuring robot behavior align with human intention. Compared to state-of-the-art (SOTA) methods, DTRT incorporates human dynamics into long-term prediction, providing an accurate understanding of intention and enabling rational role allocation, achieving robot autonomy and maneuverability. Experiments demonstrate DTRT's accurate intent estimation and superior collaboration performance.

Abstract:
We investigated the performance of existing semiand fully autonomous methods for controlling flipper-based skid-steer robots. Our study involves the reimplementation of these methods for a fair comparison, and it introduces a novel semi-autonomous control policy that provides a compelling trade-off among current state-of-the-art approaches. We also propose new metrics for assessing cognitive load and traversal quality and offer a benchmarking interface for generating Quality-Load graphs from recorded data. Our results, presented in a 2D Quality-Load space, demonstrate that the new control policy effectively bridges the gap between autonomous and manual control methods. Additionally, we reveal a surprising fact that fully manual, continuous control of all six degrees of freedom remains highly effective when performed by an experienced operator on a well-designed analog controller from a third-person view.

Abstract:
High-precision manipulation has always been a developmental goal for aerial manipulators. This paper investigates the kinematic coordinate control issue in aerial manipulators. We propose a predictive kinematic coordinate control method, which includes a learning-based modified kinematic model and a model predictive control (MPC) scheme based on weight allocation. Compared to existing methods, our proposed approach offers several attractive features. First, the kinematic model incorporates closed-loop dynamics characteristics and online residual learning. Compared to methods that do not consider closed-loop dynamics and residuals, our proposed method has improved accuracy by 59.6%. Second, a MPC scheme that considers weight allocation has been proposed, which can coordinate the motion strategies of quadcopters and manipulators. Compared to methods that do not consider weight allocation, the proposed method can meet the requirements of more tasks. The proposed approach is verified through complex trajectory tracking and moving target tracking experiments. The results validate the effectiveness of the proposed method.

Abstract:
Visuotactile sensors are a popular tactile sensing strategy due to high-fidelity estimates of local object geometry. However, existing algorithms for processing raw sensor inputs to useful intermediate signals such as contact patches struggle in high-deformation regimes. This is due to physical constraints imposed by sensor hardware and small-deformation assumptions used by mechanics-based models. In this work, we propose a fusion algorithm for proximity and visuotactile point clouds for contact patch segmentation, entirely independent from membrane mechanics. This algorithm exploits the synchronous, high spatial resolution proximity and visuotactile modalities enabled by an extremely deformable, selectively transmissive soft membrane, which uses visible light for visuotactile sensing and infrared light for proximity depth. We evaluate our contact patch algorithm in low (\mathbf1 0 %), medium (\mathbf6 0 %), and high (100 %+) strain states. We compare our method against three baselines: proximity-only, tactile-only, and a first principles mechanics model. Our approach outperforms all baselines with an average RMSE under 2.8 mm of the contact patch geometry across all strain ranges. We demonstrate our contact patch algorithm in four applications: varied stiffness membranes, torque and shear-induced wrinkling, closed loop control, and pose estimation.

Abstract:
Electrical Resistance Tomography (ERT) has emerged as a promising technology for large-area robotic skin due to its ability to reconstruct pressure distribution over extensive regions using a few sparsely distributed electrodes. Despite ERT's potential to reconstruct the external forces applied on 3D surfaces, the uneven distribution of spatial sensitivity leads to significant errors in identifying the physical quantities of contacts, inhibiting this technique from being an effective tactile sensor. To address this issue, this paper proposes a method to equalize the spatial sensitivity by modulating the conductivity of ERT sensors through topology optimization. In a simulation environment, the sensor's conductive domain was converted into a binary image and optimized to equalize spatial sensitivity and reduce disparities between low and highsensitivity areas. Additionally, we present a sensor fabrication method with a complex optimized conductive patch pattern from simulation by applying screen printing techniques. The effectiveness of the implemented spatial sensitivity equalization was validated by comparing it to a conventional ERT sensor in both simulations and real-world environments. The proposed sensitivity optimization method expands the use of ERT-based sensors for distributed tactile sensing in physical human-robot interaction scenarios.

Abstract:
Autonomous racing presents a complex environment requiring robust controllers capable of making rapid decisions under dynamic conditions. While traditional controllers based on tire models are reliable, they often demand extensive tuning or system identification. Reinforcement Learning (RL) methods offer significant potential due to their ability to learn directly from interaction, yet they typically suffer from the Sim-to-Real gap, where policies trained in simulation fail to perform effectively in the real world. In this paper, we propose RLPP, a residual RL framework that enhances a Pure Pursuit (PP) controller with an RL-based residual. This hybrid approach leverages the reliability and interpretability of PP while using RL to fine-tune the controller's performance in real-world scenarios. Extensive testing on the F1TENTH platform demonstrates that RLPP improves lap times of the baseline controllers by up to 6.37 %, closing the gap to the State-of-the-Art (SotA) methods by more than 52 % and providing reliable performance in zero-shot real-world deployment, overcoming key challenges associated with the Sim-to-Real transfer and reducing the performance gap from simulation to reality by more than 8 -fold when compared to the baseline RL controller. The RLPP framework is made available as an open-source tool, encouraging further exploration and advancement in autonomous racing research. The code is available at: www.github.com/forzaeth/rlpp.

Abstract:
Ensuring consistent processing quality is challenging in laser processes due to varying material properties and surface conditions. Although some approaches have shown promise in solving this problem via automation, they often rely on predetermined targets or are limited to simulated environments. To address these shortcomings, we propose a novel real-time reinforcement learning approach for laser process control, implemented on a Field Programmable Gate Array to achieve real-time execution. Our experimental results from laser welding tests on stainless steel samples with a range of surface roughnesses validated the method's ability to adapt autonomously, without relying on reward engineering or prior setup information. Specifically, the algorithm learned the optimal power profile for each unique surface characteristic, demonstrating significant improvements over handengineered optimal constant power strategies - up to 23% better performance on rougher surfaces and 7% on mixed surfaces. This approach represents a significant advancement in automating and optimizing laser processes, with potential applications across multiple industries.

Abstract:
As the world population is expected to reach 10 billion by 2050, our agricultural production system needs to double its productivity despite a decline of human workforce in the agricultural sector. Autonomous robotic systems are one promising pathway to increase productivity by taking over labor-intensive manual tasks like fruit picking. To be effective, such systems need to monitor and interact with plants and fruits precisely, which is challenging due to the cluttered nature of agricultural environments causing, for example, strong occlusions. Thus, being able to estimate the complete 3D shapes of objects in presence of occlusions is crucial for automating operations such as fruit harvesting. In this paper, we propose the first publicly available 3D shape completion dataset for agricultural vision systems. We provide an RGB-D dataset for estimating the 3D shape of fruits. Specifically, our dataset contains RGB-D frames of single sweet peppers in lab conditions but also in a commercial greenhouse. For each fruit, we additionally collected high-precision point clouds that we use as ground truth. For acquiring the ground truth shape, we developed a measuring process that allows us to record data of real sweet pepper plants, both in the lab and in the greenhouse with high precision, and determine the shape of the sensed fruits. We release our dataset, consisting of almost 7,000 RGB-D frames belonging to more than 100 different fruits. We provide segmented RGB-D frames, with camera intrinsics to easily obtain colored point clouds, together with the corresponding high-precision, occlusion-free point clouds obtained with a high-precision laser scanner. We additionally enable evaluation of shape completion approaches on a hidden test set through a public challenge on a benchmark server.

Abstract:
Fruit monitoring plays an important role in crop management, and rising global fruit consumption combined with labor shortages necessitates automated monitoring with robots. However, occlusions from plant foliage often hinder accurate shape and pose estimation. Therefore, we propose an active fruit shape and pose estimation method that physically manipulates occluding leaves to reveal hidden fruits. This paper introduces a framework that plans robot actions to maximize visibility and minimize leaf damage. We developed a novel scene-consistent shape completion technique to improve fruit estimation under heavy occlusion and utilize a perception-driven deformation graph model to predict leaf deformation during planning. Experiments on artificial and real sweet pepper plants demonstrate that our method enables robots to safely move leaves aside, exposing fruits for accurate shape and pose estimation, outperforming baseline methods. Project page: https://shaoxiongyao.github.io/lmap-ssc/.

Abstract:
Motion planning is a central challenge in robotics, with learning-based approaches gaining significant attention in recent years. Our work focuses on a specific aspect of these approaches: using machine-learning techniques, particularly Support Vector Machines (SVM), to evaluate whether robot configurations are collision free, an operation termed “collision detection”. Despite the growing popularity of these methods, there is a lack of theory supporting their efficiency and prediction accuracy. This is in stark contrast to the rich theoretical results of machine-learning methods in general and of SVMs in particular. Our work bridges this gap by analyzing the sample complexity of an SVM classifier for learning-based collision detection in motion planning. We bound the number of samples needed to achieve a specified accuracy at a given confidence level. This result is stated in terms relevant to robot motion-planning such as the system's clearance. Building on these theoretical results, we propose a collision-detection algorithm that can also provide statistical guarantees on the algorithm's error in classifying robot configurations as collision-free or not.

Abstract:
Navigation amongst densely packed crowds remains a challenge for mobile robots. The complexity increases further if the environment layout changes, making the prior computed global plan infeasible. In this paper, we show that it is possible to dramatically enhance crowd navigation by just improving the local planner. Our approach combines generative modelling with inference-time optimization to generate sophisticated long-horizon local plans at interactive rates. More specifically, we train a Vector Quantized Variational AutoEncoder to learn a prior over the expert trajectory distribution conditioned on the perception input. At run-time, this is used as an initialization for a sampling-based optimizer for further refinement. Our approach does not require any sophisticated prediction of dynamic obstacles and yet provides state-of-theart performance. In particular, we compare against the recent DRL-VO approach [2] and show a 40% improvement in success rate and a 6% improvement in travel time.

Abstract:
This paper introduces a novel proprioceptive state estimator for legged robots that combines model-based filters with deep neural networks. In environments where vision systems are not reliable, proprioceptive state estimators become indispensable. Traditionally, proprioceptive state estimators are based on model-based approaches, which rely solely on contact foot kinematics as measurements. In contrast, learning-based approaches have obtained new measurements, such as displacement and covariance, by leveraging real-world data in a supervised manner. In this work, we develop a state estimation framework that trains a neural measurement network (NMN) to estimate the base's linear velocity and foot contact probability, which are then employed as measurements in an invariant extended Kalman filter. Our approach relies solely on simulation data for training, as it allows us to obtain extensive data easily. We address the sim-to-real gap by adapting existing learning techniques and regularization. To validate our proposed method, we conduct hardware experiments using a quadruped robot on four types of terrain: flat, debris, soft, and slippery. In our experiments, the proposed method demonstrates significant improvements over the model-based state estimator, achieving an average reduction in Absolute Trajectory Error (ATE) by 6 1. 8 % for position and 8. 5 % for velocity.

Affiliations: Department of Control Science and Engineering, College of Electronics and Information Engineering, Shanghai Institute of Intelligent Science and Technology, Tongji University, Shanghai, China; Department of Electrical and Electronic Engineering, Faculty of Engineering, The University of Hong Kong, Hong Kong, China; Robotics and Microsystems Center, School of Mechanical and Electric Engineering, Soochow University, Suzhou, Jiangsu, China; School of Oriental Pan-Vascular Devices Innovation College, University of Shanghai for Science and Technology, Shanghai, China

Abstract:
Robotic-assisted percutaneous coronary intervention (PCI) holds considerable promise for elevating precision and safety in cardiovascular procedures. Nevertheless, current systems heavily depend on human operators, resulting in variability and the potential for human error. To tackle these challenges, Sim4EndoR, an innovative reinforcement learning (RL) based simulation environment, is first introduced to bolster task-level autonomy in PCI. This platform offers a comprehensive and risk-free environment for the development, evaluation, and refinement of potential autonomous systems, enhancing data collection efficiency and minimizing the need for costly hardware trials. A notable aspect of the groundbreaking Sim4EndoR is its reward function, which takes into account the anatomical constraints of the vascular environment, utilizing the geometric characteristics of vessels to steer the learning process. By seamlessly integrating advanced physical simulations with neural network-driven policy learning, Sim4EndoR fosters efficient sim-to-real translation, paving the way for safer, more consistent robotic interventions in clinical practice, ultimately improving patient outcomes.

Abstract:
Imitation learning is a promising approach for learning robot policies with user-provided data. The way demonstrations are provided, i.e., demonstration modality, influences the quality of the data. While existing research shows that kinesthetic teaching (physically guiding the robot) is preferred by users for the intuitiveness and ease of use, the majority of existing manipulation datasets were collected through teleoperation via a VR controller or spacemouse. In this work, we investigate how different demonstration modalities impact downstream learning performance as well as user experience. Specifically, we compare low-cost demonstration modalities including kinesthetic teaching, teleoperation with a VR controller, and teleoperation with a spacemouse controller. We experiment with three table-top manipulation tasks with different motion constraints. We evaluate and compare imitation learning performance using data from different demonstration modalities, and collected subjective feedback on user experience. Our results show that kinesthetic teaching is rated the most intuitive for controlling the robot and provides cleanest data for best downstream learning performance. However, it is not preferred as the way for large-scale data collection due to the physical load. Based on such insight, we propose a simple data collection scheme that relies on a small number of kinesthetic demonstrations mixed with data collected through teleoperation to achieve the best overall learning performance while maintaining low data-collection effort.

Abstract:
Multi-agent systems in non-stationary environments face challenges due to rapidly changing dynamics, leading to quick obsolescence of experiences in the replay buffer. To address this, we propose the Adaptive Experience Replay with Attention-Based Sequence Embedding (AERAS) framework, which integrates sequence embedding with an attention mechanism to prioritize experiences based on their relevance. By assigning adaptive weights, AERAS emphasizes relevant experiences while diminishing the impact of outdated ones, enhancing efficiency and learning performance in multi-agent reinforcement learning. Evaluations on the StarCraft II Multi-Agent Challenge and Google Research Football environments show that AERAS consistently outperforms state-of-the-art methods, achieving faster convergence and higher win rates. Ablation studies confirm the essential roles of sequence embedding and attention mechanisms in boosting AERAS's robustness and adaptability, underscoring its effectiveness in managing non-stationary environments within multi-agent systems.

Abstract:
Operating high degree of freedom robots can be difficult for users of wheelchair mounted robotic manipulators. Mode switching in Cartesian space has several drawbacks such as unintuitive control reference frames, separate translation and orientation control, and limited movement capabilities that hinder performance. We propose Point and Go mode switching, which reallocates the Cartesian mode switching reference frames into a more intuitive action space comprised of new translation and rotation modes. We use a novel sweeping motion to point the gripper, which defines the new translation axis along the robot base frame's horizontal plane. This creates an intuitive ‘point and go’ translation mode that allows the user to easily perform complex, human-like movements without switching control modes. The system's rotation mode combines position control with a refined endeffector oriented frame that provides precise and consistent robot actions in various end-effector poses. We verified its effectiveness through initial experiments, followed by a three-task user study that compared our method to Cartesian mode switching and a state of the art learning method. Results show that Point and Go mode switching reduced completion times by 31%, pauses by 41%, and mode switches by 33%, while receiving significantly favorable responses in user surveys.

Abstract:
Robotic simulators play a crucial role in the development and testing of autonomous systems, particularly in the realm of Uncrewed Aerial Vehicles (UAV). However, existing simulators often lack high-level autonomy, hindering their immediate applicability to complex tasks such as autonomous navigation in unknown environments. This limitation stems from the challenge of integrating realistic physics, photorealistic rendering, and diverse sensor modalities into a single simulation environment. At the same time, the existing photorealistic UAV simulators use mostly hand-crafted environments with limited environment sizes, which prevents the testing of long-range missions. This restricts the usage of existing simulators to only low-level tasks such as control and collision avoidance. To this end, we propose the novel FlightForge UAV opensource simulator. FlightForge offers advanced rendering capabilities, diverse control modalities, and, foremost, procedural generation of environments. Moreover, the simulator is already integrated with a fully autonomous UAV system capable of long-range flights in cluttered unknown environments. The key innovation lies in novel procedural environment generation and seamless integration of high-level autonomy into the simulation environment. Experimental results demonstrate superior sensor rendering capability compared to existing simulators, and also the ability of autonomous navigation in almost infinite environments.

Abstract:
The object navigation task requires robots to understand the semantic regularities in their environments. However, existing modular object navigation frameworks rely on instance segmentation models trained at fixed camera height viewpoints, limiting generalization performance and increasing labeling costs for new height viewpoints. To tackle this issue, we propose a semi-supervised method that transfers knowledge from a source height to a target height, minimizing the need for additional labels. Our approach introduces three key innovations: i) a projection policy to enhance the teacher model's detection capabilities at the target height, ii) a dynamic weight mechanism that emphasizes high-confidence pseudo-labels to reduce overfitting, and iii) a prototype contrast transferring method to transfer knowl-edge effectively. Experiments on the Habitat- Matterport 3D (HM3D) dataset show our method outperforms state-of-the-art semi-supervised techniques, improving both segmentation accuracy and navigation performance. The code is available at: https://github.com/FreeformRobotics/TransferKnowledge.

Abstract:
With the recent rise of large language models, vision-language models, and other general foundation models, there is growing potential for multimodal, multi-task robotics that can operate in diverse environments given natural language input. One such application is indoor navigation using natural language instructions. However, despite recent progress, this problem remains challenging due to the 3D spatial reasoning and semantic understanding required. Additionally, the language used may be imperfect or misaligned with the scene, further complicating the task. To address this challenge, we curate a benchmark dataset, IRef-VLA, for Interactive Referential Vision and Language-guided Action in 3D Scenes with imperfect references. IRef-VLA is the largest real-world dataset for the referential grounding task, consisting of over 11.5 K scanned 3D rooms from existing datasets, 7.6 M heuristically generated semantic relations, and 4.7 M referential statements. Our dataset also contains semantic object and room annotations, scene graphs, navigable free space annotations, and is augmented with statements where the language has imperfections or ambiguities. We verify the generalizability of our dataset by evaluating with state-of-the-art models to obtain a performance baseline and also develop a graphsearch baseline to demonstrate the performance bound and generation of alternatives using scene-graph knowledge. With this benchmark, we aim to provide a resource for 3D scene understanding that aids the development of robust, interactive navigation systems. The dataset and all source code is publicly released11https://github.com/HaochenZ11/IRef-VLA.

Abstract:
In medical laboratory environments, where pre-cision and safety are critical, the deployment of autonomous robots requires not only accurate object manipulation but also the ability to verify task success to comply with regulatory requirements. This paper introduces a novel imagination-enabled perception framework that integrates cognitive AI with semantic digital twins to allow medical robots to sim-ulate task outcomes, compare them with real-world results, and autonomously verify the success of their actions. Our approach addresses challenges related to handling small and transparent objects commonly found in sterility testing kits and other related consumables. By enhancing the RoboKudo perception system with parthood-based reasoning, we enable more accurate task verification through focused attention on object subparts. Experiments show that our system significantly improves performance compared to traditional object-centric methods, increasing accuracy in complex environments without the need for extensive retraining. This work demonstrates a novel concept in making robotic systems more adaptable and reliable for critical tasks in medical laboratories.

Abstract:
This paper establishes a new framework for repetitive control of uncertain robot manipulators via operator-theoretic robust stabilization. After applying the inverse dynamics approach to robot manipulators, by which the relevant nonlinear input/output behavior is converted to a linear time-invariant (LTI) equation, we take the repetitive control approach. Even though such a repetitive controller is known to achieve high performances for periodic reference inputs, it is quite difficult to derive the stability analysis for the resulting closed-loop systems in a rigorous fashion. To solve this difficulty, we construct an operator-theoretic approach to the repetitive control treatment, and show that the closed-loop systems are exponentially stable if and only if the spectrum radius of the relevant monodromy operator is less than 1. Based on the necessary and sufficient condition, we develop a guideline to take the relevant control parameters. Finally, some experiment results are given to demonstrate the overall arguments developed in this paper.

Abstract:
California blackworms constitute a recently identified animal system exhibiting unusual collective behaviors, in which dozens to thousands of worms entangle to form a “blob” capable of actions like locomotion as an aggregate. In this paper we describe a system of pneumatic soft robots inspired by the blackworms, intended for the study of collective behaviors enabled and mediated by such physical entanglement. Both the robots and worms have high aspect ratio (\gtrsim 1: 50), intertwine in complex 3D configurations, operate both in air and underwater, and can locomote both individually and as a collective. We demonstrate and characterize locomotion for both individual robots and entangled blobs, explore the tunability of entanglement strength, and compare these to the analogous versions in living worms. The robots provide a testbed for studying mechanisms underlying behaviors observed in worm blobs, as well as serving as a platform for studies of novel collective behaviors based on physical entanglement.

Abstract:
It is challenging to employ a quadruped robot for real-time mapping and positioning in a large range of scenes. The significant vibration and instability of the quadruped robot during mobility, as well as the quantity of computation required to convey a wide variety of complex landscapes, result in unsatisfactory drawing construction accuracy and inefficient real-time performance. Therefore, we propose an accurate robust spinning LiDAR SLAM (ARS-SLAM) algorithm for a quadruped robot under the large-scale scene. The tightly coupled iterative Kalman filter in FAST-LIO2 is introduced into the front end of the cartographer framework to improve the accuracy and robustness of robot pose estimation. To reduce the computational complexity of the original cartographer framework, a pose threshold optimization algorithm was introduced to effectively remove redundant information from loop detection and improve computational efficiency and real-time performance. We tested the system's performance against the most advanced point-cloud-based methods, LIO-SAM and FAST-LIO2, on a large dataset of large science parks and underground parking lots, and the results show that the proposed system achieves the same or better accuracy and real-time performance.

Abstract:
The autonomous flights of unmanned aerial vehicles (UAVs) in unknown environments have garnered significant attention. However, most existing methods only achieve safe navigation in static environments or spacious scenes with few moving obstacles. Motivated by this open problem, this paper presents a complete system for safe and autonomous UAVs flights in unknown clustered environments with multiple dynamic obstacles. To properly represent complex dynamic environments, we develop a 3D dynamic Euclidean Signed Distance Field (ESDF) mapping method that initially segments and tracks dynamic obstacles using a novel feature-based association strategy, while fusing the remaining static obstacles into ESDF map. Then, we propose a joint trajectory planning and motion control framework for safely avoiding surrounding obstacles. Specifically, the gradient-based B-spline trajectory optimization algorithm is employed to generate a collision-free static trajectory with respect to static obstacles. To avoid dynamic obstacles while adaptively tracking the static trajectory, we utilize time-adaptive model predictive control combined with Dynamic Control Barrier Function (D-CBF), which maps the collision avoidance constraints of dynamic obstacles onto the control inputs. Extensive simulated and real-world experiments confirm that our proposed method outperforms previous approaches for UAVs flights in challenging dynamic environments.

Abstract:
In this paper, we present a novel trajectory planning algorithm for cooperative manipulation with multiple quadrotors using control barrier functions (CBFs). Our approach addresses the complex dynamics of a system in which a team of quadrotors transports and manipulates a cable-suspended rigid-body payload in environments cluttered with obstacles. The proposed algorithm ensures obstacle avoidance for the entire system, including the quadrotors, cables, and the payload in all six degrees of freedom (DoF). We introduce the use of CBFs to enable safe and smooth maneuvers, effectively navigating through cluttered environments while accommodating the system's nonlinear dynamics. To simplify complex constraints, the system components are modeled as convex polytopes, and the Duality theorem is employed to reduce the computational complexity of the optimization problem. We validate the performance of our planning approach both in simulation and real-world environments using multiple quadrotors. The results demonstrate the effectiveness of the proposed approach in achieving obstacle avoidance and safe trajectory generation for cooperative transportation tasks.

Abstract:
This paper introduces a novel approach that integrates future closest point predictions into the distance constraints of a collision avoidance controller, leveraging convex hulls with closest point distance calculations. By addressing abrupt shifts in closest points, this method effectively reduces collision risks and enhances controller performance. Applied to an Image Guided Therapy robot and validated through simulations and user experiments, the framework demonstrates improved distance prediction accuracy, smoother trajectories, and safer navigation near obstacles.

Abstract:
In this work, we introduce an enhanced square-root information filter for visual-inertial odometry. This filter utilizes stochastic cloning, implemented via Gaussian elimination, to facilitate time offset calibration and feature anchor changes. By using single-precision numbers within the filter, we significantly reduce computational load and memory requirements. In addition, we employ a fast Mahalanobis distance test and block Householder triangulation to accelerate the calculations. To mitigate feature drift from frame-to-frame optical flow, we create keyframes at regular intervals and refine long-tracked features between them. We use affine optical flow to compensate for patch deformations induced by possible large spatial transformations between keyframes. An analytical approach to computing the affine transformation is proposed. Experiments conducted on real-world data show that the proposed method achieves state-of-the-art performance at a much faster speed.

Abstract:
Robotic grasping in scenes with transparent and specular objects presents great challenges for methods relying on accurate depth information. In this paper, we introduce NeuGrasp, a neural surface reconstruction method that leverages background priors for material-agnostic grasp detection. NeuGrasp integrates transformers and global prior volumes to aggregate multi-view features with spatial encoding, enabling robust surface reconstruction in narrow and sparse viewing conditions. By focusing on foreground objects through residual feature enhancement and refining spatial perception with an occupancy-prior volume, NeuGrasp excels in handling objects with transparent and specular surfaces. Extensive experiments in both simulated and real-world scenarios show that NeuGrasp outperforms state-of-the-art methods in grasping while maintaining comparable reconstruction quality. More details are available at https://neugrasp.github.io/.

Abstract:
We propose a framework for active next best view and touch selection for robotic manipulators using 3D Gaussian Splatting (3DGS). 3DGS is emerging as a useful explicit 3D scene representation for robotics, as it has the ability to represent scenes in a both photorealistic and geometrically accurate manner. However, in real-world, online robotic scenes where the number of views is limited given efficiency requirements, random view selection for 3DGS becomes impractical as views are often overlapping and redundant. We address this issue by proposing an end-to-end online training and active view selection pipeline, which enhances the performance of 3DGS in few-view robotics settings. We first elevate the performance of few-shot 3DGS with a novel semantic depth alignment method using Segment Anything Model 2 (SAM2) that we supplement with Pearson depth and surface normal loss to improve color and depth reconstruction of real-world scenes. We then extend FisherRF, a next-best-view selection method for 3DGS, to select views and touch poses based on depth uncertainty. We perform online view selection on a real robot system during live 3DGS training. We motivate our improvements to few-shot GS scenes, and extend depth-based FisherRF to them, where we demonstrate both qualitative and quantitative improvements on challenging robot scenes. For more information, please see our project page at arm.stanford.edu/next-best-sense.

Abstract:
Recent advances in robotics have underscored the critical role of colorized point clouds in enhancing environmental perception accuracy. However, conventional multisensor fusion Simultaneous Localization and Mapping (SLAM) systems typically employ all available images indiscriminately for point cloud colorization, resulting in suboptimal outcomes with blurred textures. Notably, achieving precise texture-togeometry alignment remains a challenge despite the availability of accurate pose estimation. This study introduces RISED, an advanced colorized mapping system that tackles this challenge from two perspectives: projection accuracy and distribution uniformity. For projection accuracy, we analyze the influence of camera poses on colorization and carefully select the optimal viewpoint to minimize errors. Regarding distribution uniformity, point cloud densification is applied to eliminate LiDAR scanning traces. Furthermore, a novel evaluation method is introduced to provide comprehensive assessment of colorized point clouds, filling a gap in this field. Experimental results show that our method outperforms traditional approaches in RGB-colorized mapping. Specifically, our method achieves notable improvements in projection accuracy (55.2 %), geometric accuracy (63.1 %), and surface coverage (30.8 %).

Abstract:
Robotic tasks often require multiple manipulators to enhance task efficiency and speed, but this increases complexity in terms of collaboration, collision avoidance, and the expanded state-action space. To address these challenges, we propose a multi-level approach combining Reinforcement Learning (RL) and Dynamic Movement Primitives (DMP) to generate adaptive, real-time trajectories for new tasks in dynamic environments using a demonstration library. This method ensures collision-free trajectory generation and efficient collaborative motion planning. We validate the approach through experiments in the PyBullet simulation environment with UR5e robotic manipulators. Project Website: https://sites.google.com/virginia.edu/oncoldmp/home

Affiliations: Faculty of Engineering, and Tyree Institute of Health Engineering (IHealthE) UNSW Sydney, Graduate School of Biomedical Engineering, Kensington Campus, NSW, Australia; More-Than-One Robotics Laboratory, Faculty of Sustainable Design Engineering, University of Prince Edward Island, Charlottetown, Canada; Department of Computer Science and Engineering, Advanced Robotics and Automation (ARA) Lab, University of Nevada, Reno, NV, USA; Soft Haptics Laboratory, School of Materials Science, Japan Advanced Institute of Science and Technology, Kawaguchi, Saitama, Japan

Abstract:
While wearable robots that utilize intrinsically soft materials for actuation offer enhanced safety and biological compatibility, the challenges of sensing and control significantly affect their performance. The control problem in such systems is inherently complex, and the inclusion of 'softness' introduces additional nonlinearities, hysteresis, and uncertainties. Furthermore, the effectiveness of control strategies is highly dependent on sensor selection and integration, which presents its own challenges. Most robotic systems require separate sensors for control purposes. In this study, a new sensing and control scheme are introduced for soft wearable robots, leveraging the intrinsic soft-sensing capability of fluidic filament actuators without adding computational complexity. This method enables simultaneous sensing and actuation with \mathbf9 6 % position accuracy, even under physical disturbances. This approach is demonstrated with a soft assistive device for elbow flexion/extension, achieving 70.5% tracking accuracy and a 0.09s response delay to human intention, ensuring the system provides minimal resistance when assistance is not needed, while delivering the required support when necessary.

Abstract:
Autonomous driving necessitates the ability to reason about future interactions between traffic agents and to make informed evaluations for planning. This paper introduces the Gen-Drive framework, which shifts from the traditional prediction and deterministic planning framework to a generation-then-evaluation planning paradigm. The framework employs a behavior diffusion model as a scene generator to produce diverse possible future scenarios, thereby enhancing the capability for joint interaction reasoning. To facilitate decision-making, we propose a scene evaluator (reward) model, trained with pairwise preference data collected through VLM assistance, thereby reducing human workload and enhancing scalability. Furthermore, we utilize an RL fine-tuning framework to improve the generation quality of the diffusion model, rendering it more effective for planning tasks. We conduct training and closed-loop planning tests on the nuPlan dataset, and the results demonstrate that employing such a generation-then-evaluation strategy outperforms other learning-based approaches. Additionally, the fine-tuned generative driving policy shows significant enhancements in planning performance. We further demonstrate that utilizing our learned reward model for evaluation or RL fine-tuning leads to better planning performance compared to relying on human-designed rewards. Project website: https://mczhi.github.io/GenDrive.

Abstract:
This paper presents a mixed traffic control policy designed to optimize traffic efficiency across diverse road topologies, addressing issues of congestion prevalent in urban environments. A model-free reinforcement learning (RL) approach is developed to manage large-scale traffic flow, using data collected by autonomous vehicles to influence human-driven vehicles. A real-world mixed traffic control benchmark is also released, which includes 444 scenarios from 20 countries, representing a wide geographic distribution and covering a variety of scenarios and road topologies. This benchmark serves as a foundation for future research, providing a realistic simulation environment for the development of effective policies. Comprehensive experiments demonstrate the effectiveness and adaptability of the proposed method, achieving better performance than existing traffic control methods in both intersection and roundabout scenarios. To the best of our knowledge, this is the first project to introduce a real-world complex scenarios mixed traffic control benchmark. Videos and code of our work are available at https://sites.google.com/berkeley.edu/mixedtrafficplus/home

Abstract:
Autonomous vehicles (AVs) systems are envisioned to revolutionize our life by providing safe, relaxing, and convenient ground transportation. To ensure safety, AV systems need to make timely driving decisions in response to complicated and highly dynamic real-world driving environments. We present a systematic study to understand the causes of tail latency in AV systems and their impact on safety. We empirically analyze the design of two open-source industrial AV systems, Baidu Apollo and Autoware. We explore how pipelined computation design (such as module dependency and execution patterns), traffic factors (surrounding environments of AV), and system factors (such as cache contention) impact AV systems' tail latency. Inspired by the insights, We propose a set of systematic designs that lead to performance and safety improvements of up to 1.65 × and 14 ×, respectively.

Abstract:
Optical coherence tomography (OCT) has been widely used for high-fidelity biological tissue scanning but is traditionally limited to small lateral fields of view that preclude large-area scanning. To overcome this problem, we propose an integration of an OCT sensor to a 6-DOF robot arm end-effector combined with a geometry-aware stitching model for surface and volumetric data stitching. We firstly develop a simple but efficient Robot-OCT calibration method by using a three-marker calibration pattern and implement an optimization solver. Given a pre-defined trajectory, a local planner is developed to update the sensor pose by using the OCT point cloud information in order to maintain the effective imaging depth based on the distance and orientation constraints. The system calibration method is verified through repeated experiments with the three-marker targets and the result shows an average testing error of 0.132 \pm 0.071 ~\textmm. The geometry-aware OCT stitching framework is demonstrated based on the experiments of different scanning trajectories and 3D-printed phantoms for large-area scanning. The OCT stitched point cloud is compared with the ground truth from the phantom CAD model and the result show an average surface alignment error of 0.441 \pm 0.241 ~\textmm for the path following tasks.

Abstract:
In this paper, we introduce advances in the sensor suite of an autonomous flying insect robot (FIR) weighing less than a gram. FIRs, because of their small weight and size, offer unparalleled advantages in terms of material cost and scalability. However, their size introduces considerable control challenges, notably high-speed dynamics, restricted power, and limited payload capacity. While there have been advancements in developing lightweight sensors, often drawing inspiration from biological systems, no sub-gram aircraft has been able to attain sustained hover without relying on feedback from external sensing such as a motion capture system. The lightest vehicle capable of sustained hovering-the first level of “sensor autonomy”-is the much larger 28 g Crazyflie. Previous work reported a reduction in size of that vehicle's avionics suite to 187 mg and 21 mW. Here, we report a further reduction in mass and power to only 78.4 mg and 15 mW. We replaced the laser rangefinder with a lighter and more efficient pressure sensor, and built a smaller optic flow sensor around a global-shutter imaging chip. A Kalman Filter (KF) fuses these measurements to estimate the state variables that are needed to control hover: pitch angle, translational velocity, and altitude. Our system achieved performance comparable to that of the Crazyflie's estimator while in flight, with root mean squared errors of 1.573 \textdeg, 0.186 \mathrmm / \mathrms, and 0.136 m, respectively, relative to motion capture.

Abstract:
Control Barrier Functions (CBFs) have become powerful tools for ensuring safety in nonlinear systems. How-ever, finding valid CBFs that guarantee persistent safety and feasibility remains an open challenge, especially in systems with input constraints. Traditional approaches often rely on manually tuning the parameters of the class K functions of the CBF conditions a priori. The performance of CBF-based controllers is highly sensitive to these fixed parameters, potentially leading to overly conservative behavior or safety violations. To overcome these issues, this paper introduces a learning-based optimal control framework for online adaptation of Input Constrained CBF (ICCBF) parameters in discrete-time nonlinear systems. Our method employs a probabilistic ensemble neural network to predict the performance and risk metrics, as defined in this work, for candidate parameters, accounting for both epistemic and aleatoric uncertainties. We propose a two-step verification process using Jensen-Rényi Divergence and distributionally-robust Conditional Value at Risk to identify valid parameters. This enables dynamic re-finement of ICCBF parameters based on current state and nearby environments, optimizing performance while ensuring safety within the verified parameter set. Experimental results demonstrate that our method outperforms both fixed-parameter and existing adaptive methods in robot navigation scenarios across safety and performance metrics. [Project Page]11Project page: https://www.taekyung.me/online-adaptive-cbf [Code] [Video]

Abstract:
The development of a multimodal fusion technique utilizing LiDAR-camera data has enabled precise 3D object detection for self-driving vehicles, particularly in ideal conditions with clear weather. Nevertheless, adverse weathers such as fog, snow, and rain remain a challenge for existing multimodal methods. These conditions lead to a reduced density of point clouds as a result of laser signal occlusion and attenuation. Additionally, as the distance grows, the point cloud becomes sparser, further challenging object detection tasks. To address these problems, we introduce a point reconstruction network employing equirectangular projection tailored for multimodal 3D object detection. This network incorporates a range-constrained noise filter to remove noise caused by adverse weather and an object-centric point generator designed to flexibly generate points for distant objects. Moreover, we propose a dual 2D auxiliary module to enhance image features and support the point reconstruction. Experimental evaluations conducted on adverse weather datasets demonstrate that the suggested approach surpasses current techniques. The implementation can be accessed at https://github.com/jhyoon964/oprnet.

Abstract:
Preventing cross-infection is crucial for robots designed to assist medical staff in isolation wards during outbreaks of infectious diseases like COVID-19. This paper proposes a modular robotic system with a working platform and a mobile base to prevent cross-infection during item delivery and waste transport. An alignment structure for combining the two platforms is introduced, and a marker map and barcode-based destination input system were developed to allow medical staff without specialized robotics knowledge to use the system without additional training. The effectiveness of this robot's service was evaluated through a System Usability Scale (SUS) test with twenty medical staff working in isolation wards, achieving an average score of 77.12. This indicates a high level of usability, suggesting that this robot can significantly contribute to safe and efficient hospital operations during pandemic situations.

Abstract:
This paper introduces a novel motion planning algorithm designed for a team of curvature-constrained tethered robots operating on sloped 3D terrains. Our approach addresses the critical issues of tether-terrain interaction, robot stability, and tether entanglement avoidance. The study focuses on a two-robot system, where stability is primarily dependent on tether tension, which is in turn limited by wheel traction. We propose a path-planning method that strategically utilizes terrain features (e.g., rocks) to augment tether tension through additional friction, thereby enhancing overall system stability. Our algorithm employs a modified tangent graph as the underlying structure for a hybrid A search, incorporating stability constraints throughout the planning process. The proposed method is extensively evaluated through various simulation experiments, demonstrating its effectiveness in planning safe and efficient paths.

Abstract:
Exoskeletons are being adopted as assistive devices in industries such as manufacturing, logistics, and construction, aimed at reducing musculoskeletal loads in workers. Presently, their design process assumes the user to be quasi-static, optimizing the design parameters for reduction of human joint torques followed by fine-tuning through usability studies and physical prototyping. We present a method for optimizing passive exoskeleton designs before the physical prototyping stage for muscle effort reduction in dynamic tasks such as arm reaching and walking. We employ fast MuJoCo-based simulations of human biomechanics to compute the joint torques, muscle forces and muscle activations while executing task trajectories using pre-trained reinforcement learning models from the literature. We train another set of reinforcement learning models that minimize joint torques and muscle effort rates by varying the exoskeleton's design parameters online during the task motions. Baselines for comparison include the default designs of shoulder and walking assist exoskeletons from the literature, and designs obtained through conventional optimization techniques. In terms of muscle effort rates, the RL-based designs improved upon these baselines by an average of 3.42% and 1.96% respectively in the arm reaching task, and 6.28% and 5.81% in the walking task. Our method can be adapted to evaluate exoskeletons in real-time through motion capture, and for muscle-aware online control of powered exoskeletons.

Abstract:
Visual place recognition (VPR) enables autonomous robots to identify previously visited locations, which contributes to tasks like simultaneous localization and mapping (SLAM). VPR faces challenges such as accurate image neighbor retrieval and appearance change in scenery. Event cameras, also known as dynamic vision sensors, are a new sensor modality for VPR and offer a promising solution to the challenges with their unique attributes: high temporal resolution (1MHz clock), ultra-low latency (in μs), and high dynamic range (>120dB). These attributes make event cameras less susceptible to motion blur and more robust in variable lighting conditions, making them suitable for addressing VPR challenges. However, the scarcity of event-based VPR datasets, partly due to the novelty and cost of event cameras, hampers their adoption. To fill this data gap, our paper introduces the NYC-Event-VPR dataset to the robotics and computer vision communities, featuring the Prophesee IMX636 HD event sensor (1280x720 resolution), combined with RGB camera and GPS module. It encompasses over 13 hours of geotagged event data, spanning 260 kilometers across New York City, covering diverse lighting and weather conditions, day/night scenarios, and multiple visits to various locations. Furthermore, our paper employs three frameworks to conduct generalization performance assessments, promoting innovation in event-based VPR and its integration into robotics applications.

Abstract:
Learning to manipulate objects efficiently, particularly those involving sustained contact (e.g., pushing, sliding) and articulated parts (e.g., drawers, doors), presents significant challenges. Traditional methods, such as robot-centric reinforce-ment learning (RL), imitation learning, and hybrid techniques, require massive training and often struggle to generalize across different objects and robot platforms. We propose a novel framework for learning object-centric manipulation policies in force space, decoupling the robot from the object. By directly applying forces to selected regions of the object, our method simplifies the action space, reduces unnecessary exploration, and decreases simulation overhead. This approach, trained in simulation on a small set of representative objects, captures ob-ject dynamics—such as joint configurations—allowing policies to generalize effectively to new, unseen objects. Decoupling these policies from robot-specific dynamics enables direct transfer to different robotic platforms (e.g., Kinova, Panda, URS) with-out retraining. Our evaluations demonstrate that the method significantly outperforms baselines, achieving over an order of magnitude improvement in training efficiency compared to other state-of-the-art methods. Additionally, operating in force space enhances policy transferability across diverse robot plat-forms and object types. We further showcase the applicability of our method in a real-world robotic setting. Link: https://tufts-ai-robotics-group.github.io/FLEX/

Abstract:
In the realms of autonomous driving and robotics, radar sensors are garnering growing interest. Scene understanding is crucial for the safe navigation of autonomous systems. Panoptic segmentation and tracking tasks enable the dynamic, semantic multilevel description of the environment surrounding vehicles and different instances. However, previous panoptic segmentation and tracking methods have primarily focused on LiDAR. To tackle the complex challenge of panoptic segmentation and tracking for radar data, we introduce RadarMask, an innovative method that addresses this issue for the first time within the radar domain. Our approach is end-to-end, requiring no post-processing. We also introduce simple and effective point cloud feature modules and target motion estimation modules tailored to the unique characteristics of radar points. Finally, we demonstrate the effectiveness of our algorithm on the RadarScenes dataset, achieving state-of-the-art performance in comparisons. The implementation of our method can be found at: https://github.com/yb-guo/RadarMask.

Abstract:
The precision of industrial robots is often limited by the relatively low stiffness of their joints, leading to positioning errors influenced by factors such as the mass and inertia of robotic links, external forces, and the counterbalance system (CBS). Counterbalance systems, typically consisting of hydropneumatic cylinders, are designed to reduce motor torque and assist in supporting heavier links. Traditionally, positioning errors in industrial robots have been corrected statically by determining pose-dependent stiffness values. However, recent numerical models incorporate inertial effects to improve positioning error correction, making accurate inertial parameter identification essential. These parameters are typically unknown and must be determined experimentally. While methodologies for inertial parameter estimation have been extensively studied, none have accounted for the effect of the counterbalance system in this process. To address this gap, a methodology for estimating inertial parameters was applied to a heavy industrial robot, considering the influence of the counterbalance system. A comparative analysis with and without the counterbalance system showed that its inclusion improved joint torque calculation accuracy, showing the necessity of considering it in dynamic parameter characterization methodologies.

Abstract:
This paper presents FLAF, a focal line and feature-constrained active view planning method for autonomous orientation adjustment of a rotatable active camera during mobile robot navigation. FLAF is built on a visual teach-and-repeat (VT&R) system, which enables robots to cruise various paths that fulfill many daily autonomous navigation requirements. The VT&R system integrates Visual Simultaneous Localization and Mapping (VSLAM) with trajectory following. However, tracking failures in feature-based VSLAM, particularly in textureless regions common in human-made environments, poses a significant challenge to real-world VT&R deployment. To address this, the proposed view planner is integrated into a feature-based VSLAM system, creating an active camerabased VSLAM (AC-SLAM) solution that mitigates tracking failures. Our system features a Pan-Tilt Unit (PTU)-based active camera mounted on a mobile robot. FLAF actively directs the camera toward more map points during path learning and toward more feature-identifiable map points while following the learned trajectory. Using FLAF, the AC-SLAM system constructs a complete path map during teaching and maintains stable localization during repeating. Experimental results in real scenarios show that FLAF significantly outperforms existing methods by accounting for feature identifiability, particularly the view angle of the features. During effectively dealing with low-texture regions in active view planning, considering feature identifiability enables our active VT&R system to perform well in challenging environments.

Abstract:
Information gathering focuses on designing strategies for a robot to collect data about a physical process, aiming for accurate field reconstruction. While many recent methods have been proposed to address this problem, they often assume the model of the physical process is a priori known and stationary-assumptions that rarely hold in practice. This paper presents a novel informative motion planning approach for online information gathering of a non-stationary Gaussian process. Our approach comprises two key components: an informative path planner that explores the physical field and an adaptive velocity planner that adjusts the robot's velocity profile exploiting the field's spatial variability. Additionally, we propose a path smoothing and tracking strategy to ensure continuous robot motion. Extensive simulations on a bathymetric mapping task demonstrate the effectiveness of our approach, showing superior performance in reconstructing non-stationary physical fields compared to several baseline methods.

Abstract:
Social robots have been extensively studied in recent decades, with many researchers exploring the use of modalities such as facial expressions to achieve more natural emotions in robots. Various methods have been attempted to generate and express robot emotions, including computational models that define an affect space and show dynamic emotion changes. However, the implementation of multimodal expression in previous models is ambiguous, and the generation of emotions in response to stimuli relies on heuristic methods. In this paper, we present a framework that enables robots to naturally express their emotions in a multimodal way, where the emotion can change over time based on the given stimulus values. By representing the robot's emotion as a position in an affect space of a computational emotion model, we consider the given stimuli values as driving forces that can shift the emotion position dynamically. In order to examine the feasibility of our proposed method, a mobile robot prototype was implemented that can recognize touch and express different emotions with facial expressions and movements. The experiment demonstrated that the emotion elicited by a given stimulus is contingent upon the robot's previous state, thereby imparting the impression that the robot possesses a distinctive emotion model. Furthermore, the Godspeed survey results indicated that our model was rated significantly higher than the baseline, which did not include a computational emotion model, in terms of anthropomorphism, animacy, and perceived intelligence. Notably, the unpredictability of emotion switching contributed to a perception of greater lifelikeness, which in turn enhanced the overall interaction experience.

Abstract:
Teleoperated humanoid robot systems have made substantial advancements in recent years, offering a physical avatar that harnesses human skills and decision-making while safeguarding users from hazardous environments. However, current telelocomotion interfaces often fail to accurately represent the robot's environment, limiting the user's ability to effectively navigate the robot through unstructured terrain. This paper presents an initial telelocomotion framework that integrates the ForceBot locomotion interface with the small-sized humanoid robot, HECTOR V2. The framework utilizes ForceBot to simulate walking motion and estimate the user's Center of Mass (CoM) trajectory, which serves as a tracking reference for the robot. On the robot side, a model predictive control (MPC) approach, based on a reduced-order single rigid body model, is employed to track the user's scaled trajectory. We present experimental results on ForceBot's CoM estimation and the robot's tracking performance, demonstrating the feasibility of this approach.

Abstract:
Mobile robots operating indoors must be prepared to navigate challenging scenes that contain transparent surfaces. This paper proposes a novel method for the fusion of acoustic and visual sensing modalities through implicit neural representations to enable dense reconstruction of transparent surfaces in indoor scenes. We propose a novel model that leverages generative latent optimization to learn an implicit representation of indoor scenes consisting of transparent surfaces. We demonstrate that we can query the implicit representation to enable volumetric rendering in image space or 3D geometry reconstruction (point clouds or mesh) with transparent surface prediction. We evaluate our method's effectiveness qualitatively and quantitatively on a new dataset collected using a custom, low-cost sensing platform featuring RGB-D cameras and ultrasonic sensors. Our method exhibits significant improvement over state-of-theart for transparent surface reconstruction. Website and Dataset: https://umfieldrobotics.github.io/VAIR_site/

Abstract:
Event data has recently emerged as a valuable complement to object tracking, offering dense temporal resolution and a high dynamic range. However, existing RGB-Event trackers struggle with targets exhibiting complex motion trajectories, where RGB features alone fail to provide sufficient discrimination. To address this, we propose EventTPT, an innovative RGB-Event tracking framework that leverages pivotal prompts embedded in historical trajectories for enhanced tracking. Specifically, EventTPT integrates the trajectories of multiple adjacent frames into a single event image using a time-weighted aggregation and subsequently inputs this as a visual prompt into the tracker for current frame locating. A cross-modal adaptive fusion module is further designed for object perception in scenarios with photometric inconsistency. Additionally, we introduce EventUAV, a novel and challenging RGB-Event tracking benchmark featuring objects with intricate motion dynamics and poor visibility in RGB-only modalities. Extensive experiments demonstrate that EventTPT surpasses state-of-the-art trackers on EventUAV and achieves competitive performance on other benchmarks (e.g., COESOT and VisEvent), underscoring its strong generalizability and robustness for resilient robotic vision systems. The code can be found at https://github.com/xiawenhao2022/EventTPT.

Abstract:
We tackle the challenges of decentralized multi-robot navigation in environments with nonconvex obstacles, where complete environmental knowledge is unavailable. While reactive methods like Artificial Potential Field (APF) offer simplicity and efficiency, they suffer from local minima, causing robots to become trapped due to their lack of global environmental awareness. Other existing solutions either rely on inter-robot communication, are limited to single-robot scenarios, or struggle to overcome nonconvex obstacles effectively. Our proposed methods enable collision-free navigation using only local sensor and state information without a map. By incorporating a wall-following (WF) behavior into the APF approach, our method allows robots to escape local minima, even in the presence of nonconvex and dynamic obstacles including other robots. We introduce two algorithms for switching between APF and WF: a rule-based system and an encoder network trained on expert demonstrations. Experimental results show that our approach achieves substantially higher success rates compared to state-of-the-art methods, highlighting its ability to overcome the limitations of local minima in complex environments.

Abstract:
Optical tweezers (OT) offer unparalleled capabilities for micromanipulation with submicron precision in biomedical applications. However, controlling conventional multi-trap OT to achieve cooperative manipulation of multiple complexshaped microrobots in dynamic environments poses a significant challenge. To address this, we introduce Interactive OT Gym, a reinforcement learning (RL)-based simulation platform designed for OT-driven microrobotics. Our platform supports complex physical field simulations and integrates haptic feedback interfaces, RL modules, and context-aware shared control strategies tailored for OT-driven microrobot in cooperative biological object manipulation tasks. This integration allows for an adaptive blend of manual and autonomous control, enabling seamless transitions between human input and autonomous operation. We evaluated the effectiveness of our platform using a cell manipulation task. Experimental results show that our shared control system significantly improves micromanipulation performance, reducing task completion time by approximately 67% compared to using pure human or RL control alone and achieving a 100% success rate. With its high fidelity, interactivity, low cost, and high-speed simulation capabilities, Interactive OT Gym serves as a user-friendly training and testing environment for the development of advanced interactive OT-driven micromanipulation systems and control algorithms. For more details on the project, please see our website https://sites.google.com/view/otgym

Abstract:
Tactile sensing plays a vital role in enabling robots to perform fine-grained, contact-rich tasks. However, the high dimensionality of tactile data, due to the large coverage on dexterous hands, poses significant challenges for effective tactile feature learning, especially for 3D tactile data, as there are no large standardized datasets and no strong pretrained backbones. To address these challenges, we propose a novel canonical representation that reduces the difficulty of 3D tactile feature learning and further introduces a force-based selfsupervised pretraining task to capture both local and net force features, which are crucial for dexterous manipulation. Our method achieves an average success rate of 78% across four fine-grained, contact-rich dexterous manipulation tasks in realworld experiments, demonstrating effectiveness and robustness compared to other methods. Further analysis shows that our method fully utilizes both spatial and force information from 3D tactile data to accomplish the tasks. The videos can be viewed at https://3dtacdex.github.io.

Abstract:
Deformable object manipulation (DOM) which is a common subtask in various surgical procedures represents an inevitable challenge in robot-assisted surgery (RAS) due to complex nonlinear deformation. This paper proposes a deep reinforcement learning guided adaptive control (RLAC) modelfree framework, which combines learning-based and Jacobianbased methods. To complement each other for optimized performance, we harness the sampling of deep reinforcement learning (DRL) policy explored in simulations to solve a reasonable estimation of the initial deformation Jacobian. In early control iterations, the actions suggested by the DRL agent are adopted until the estimated real-time Jacobian approximates the actual deformation model. Subsequently, the independent Jacobianbased adaptive control (AC) with sufficient initial deformation awareness begins execution to achieve precise internal feature manipulation on deformable objects. Experimental results demonstrate that our method enables more efficient positioning and exhibits near-optimal positioning paths. RLAC with robust sim-to-real performance provides a feasible approach for the complex autonomous DOM in the real world.

Abstract:
We present BehAV, a novel approach for autonomous robot navigation in outdoor scenes guided by human instructions and leveraging Vision Language Models (VLMs). Our method interprets human commands using a Large Language Model (LLM), and categorizes the instructions into navigation and behavioral guidelines. Navigation guidelines consist of directional commands (e.g., “move forward until“) and associated landmarks (e.g., “the building with blue windows”), while behavioral guidelines encompass regulatory actions (e.g., “stay on“) and their corresponding objects (e.g., “pavements“). We use VLMs for their zero-shot scene understanding capabilities to estimate landmark locations from RGB images for robot navigation. Further, we introduce a novel scene representation that utilizes VLMs to ground behavioral rules into a behavioral cost map. This cost map encodes the presence of behavioral objects within the scene and assigns costs based on their regulatory actions. The behavioral cost map is integrated with a LiDAR-based occupancy map for navigation. To navigate outdoor scenes while adhering to the instructed behaviors, we present an unconstrained Model Predictive Control (MPC)based planner that prioritizes both reaching landmarks and following behavioral guidelines. We evaluate the performance of BehAV on a quadruped robot across diverse real-world scenarios, demonstrating a 22.49 % improvement in alignment with human-teleoperated actions, as measured by Fréchet distance, and achieving a 40 % higher navigation success rate compared to state-of-the-art methods.

Affiliations: Digital and High-Frequency Electronics, Institute of Mechanics and Electronics, University of Southern Denmark, Odense, Denmark; Robotics and Mechatronics, Faculty of Electrical Engineering, Mathematics & Computer Science, University of Twente, Enschede, Netherlands; CNRS, Univ Rennes, Inria Campus de Beaulieu, Rennes, Cedex, France; Department of Computer, Control and Management Engineering, Sapienza University of Rome, Rome, Italy

Abstract:
This paper proposes a method to robustify model predictive path integral (MPPI) control by directly taking into account the effects of parameter uncertainty into the controller formulation. Leveraging the recent notion of closed-loop state sensitivity, the proposed MPPI can consider the state sensitivity against parameter mismatch as a part of the system state, and consequently exploit this additional information to address the challenge of model mismatch in sampling-based model predictive control. Using an obstacle avoidance scenario, we demonstrate the use of our approach to control an aerial robot. We present an embedded implementation of our method, utilizing parallelization of computations on a GPU. Finally, we show the increased robustness of our approach over a standard MPPI controller through hardware-in-the-loop simulations and validate its embedded real-time properties.

Abstract:
3D lane detection aims to identify lane categories and trends in 3D space, which is a vital and challenging task in autonomous driving. Existing methods introduce various priors to guide 3D lane prediction, which generally consist of a series of reference points for context aggregation. However, due to the misalignment between these reference points and the lanes, it is difficult to obtain complete and discriminative context for complex instances. In this paper, we are devoted to introducing 3D priors adaptive to lane appearances, which serve as references to aggregate the lane context. Specifically, we propose a projection-consistent reference generation strategy to keep the projected 3D reference points geometrically consistent with the corresponding lanes in images. In addition, a segmentation-lifting denoising strategy is designed to improve the ability of the model to map the lane segmentation into 3D space. To leverage more lane-related information, we propose a decoupled lane-context aggregation module by considering the perspectives of individual geometries and integrated layout, namely intra-lane and inter-lane context. Extensive experiments on the OpenLane dataset show that our approach outperforms previous methods and achieves the state-of-the-art performance. The code will be made publicly available.

Abstract:
Despite the growing interest in Behavioural Cloning for robots, few existing research has explicitly explored the impact of user interfaces on the effectiveness of expert demonstrations. We investigate the importance of user interface design in Behavioural Cloning, highlighting the critical role that interfaces play in conveying human demonstrations and robotics capabilities. This study compares the effectiveness of first and third-person perspective interfaces for robot shoe-lacing, a highly dexterous, bi-manual manipulation task that involves deformable objects and requires high precision. Our study highlights the importance of considering the impact of interface design on expert demonstration quality in Behavioural Cloning applications. By providing a first-person perspective, we observed significant differences in demonstration execution time and consistency compared to the third-person perspective. These findings suggest that the choice of interface can influence the quality of expert demonstrations, which in turn affects the performance of learning algorithms.

Abstract:
In real-world field operations, aerial grasping systems face significant challenges in dynamic environments due to strong winds, shifting surfaces, and the need to handle heavy loads. Particularly when dealing with heavy objects, the powerful propellers of the drone can inadvertently blow the target object away as it approaches, making the task even more difficult. To address these challenges, we introduce SPI- BOT, a novel drone-tethered mobile gripper system designed for robust and stable autonomous target retrieval. SPIBOT operates via a tether, much like a spider, allowing the drone to maintain a safe distance from the target. To ensure both stable mobility and secure grasping capabilities, SPIBOT is equipped with six legs and sensors to estimate the robot's and mission's states. It is designed with a reduced volume and weight compared to other hexapod robots, allowing it to be easily stowed under the drone and reeled in as needed. Designed for the 2024 MBZIRC Maritime Grand Challenge, SPIBOT is built to retrieve a 1kg target object in the highly dynamic conditions of the moving deck of a ship. This system integrates a real-time action selection algorithm that dynamically adjusts the robot's actions based on proximity to the mission goal and environmental conditions, enabling rapid and robust mission execution. Experimental results across various terrains, including a pontoon on a lake, a grass field, and rubber mats on coastal sand, demonstrate SPIBOT's ability to efficiently and reliably retrieve targets. SPIBOT swiftly converges on the target and completes its mission, even when dealing with irregular initial states and noisy information introduced by the drone.

Abstract:
Drawing inspiration from Maslow's “hierarchy of needs”, this paper develops a real-time control synthesis framework for robots to address hierarchical, complex objectives, recognizing that their behaviors are inherently driven by underlying needs. Each need is encoded by the zero-superlevel set of a control barrier function (CBF), which can be time-varying, and all the needs at the same level in a hierarchy are composed into a single one through Boolean compositions of the corresponding CBFs. The effectiveness of the proposed framework is demonstrated through a hypothetical interstellar exploration mission using laboratory robots, and novel results on nonsmooth CBF and time-varying CBF are derived.

Abstract:
Cooperative manipulation tasks impose various structure-, task-, and robot-specific constraints on mobile manip-ulators. However, current methods struggle to model and solve these myriad constraints simultaneously. We propose a twofold solution: first, we model constraints as a family of manifolds amenable to simultaneous solving. Second, we introduce the constrained nonlinear Kaczmarz (cNKZ) projection technique to produce constraint-satisfying solutions. Experiments show that cNKZ dramatically outperforms baseline approaches, which cannot find solutions at all. We integrate cNKZ with a sampling-based motion planning algorithm to generate complex, coordi-nated motions for 3–6 mobile manipulators (18–36 DoF), with cNKZ solving up to 80 nonlinear constraints simultaneously and achieving up to a 92% success rate in cluttered environments. We also demonstrate our approach on hardware using three Thrtlebot3 Waffle Pi robots with OpenMANIPULATOR-X arms.

Abstract:
Perception systems in robotic vehicles are crucial for safe and efficient operation, providing key state estimates necessary for planning and control. However, these systems are increasingly vulnerable to perception-based attacks, such as odometry spoofing, position spoofing, obstacle hiding, and object misclassification, which can lead to catastrophic failures. In this paper, we propose a novel approach to detect perception-based attacks by modeling inconsistencies between the physical and estimated states of the robot. Our approach offers a unified methodology for detecting different types of attacks with high accuracy and minimal computational overhead. We validate our method through extensive simulations and real-world scenarios, achieving a 99.5% success rate in detecting attacks, while maintaining a low latency (within 100ms).

Abstract:
Recently, many humanoid robots have been increasingly deployed in various facilities, including hospitals and assisted living environments, where they are often remotely controlled by human operators. Their kinematic redundancy enhances reachability and manipulability, enabling them to navigate complex, cluttered environments and perform a wide range of tasks. However, this redundancy also presents significant control challenges, particularly in coordinating the movements of the robot's macro-micro structure (torso and arms). Therefore, we propose various human-robot collaborative (HRC) methods for coordinating the torso and arm of remotely controlled mobile humanoid robots, aiming to balance autonomy and human input to enhance system efficiency and task execution. The proposed methods include human-initiated approaches, where users manually control torso movements, and robot-initiated approaches, which autonomously coordinate torso and arm based on factors such as reachability, task goal, or inferred human intent. We conducted a user study with \mathbfN \boldsymbol= \mathbf1 7 participants to compare the proposed approaches in terms of task performance, manipulability, and energy efficiency, and analyzed which methods were preferred by participants.

Abstract:
Grasp planning must consider an object's local geometry (at the finger contacts), for the range of applicable wrenches under friction, and its global geometry, for force closure and grasp quality. Most everyday objects have curved surfaces unamenable to a pure combinatorial approach but treatable with tools from differential geometry. Our idea is to “discretize” such a surface in a top-down fashion into elementary patches (e-patches), each consisting of points that would yield close enough wrenches. Preprocessing based on Gaussian curvature decomposes the surface into strictly convex, strictly concave, ruled, and saddle patches. The Gauss map guides the subdivision of any patch with a large variation in the contact force direction, with the aid of a Platonic solid. The principal component analysis (PCA) further subdivides any patch that has a large variation in torque. The final structure is called a patch tree, which stores e-patches at its leaves, and force or torque ranges at its internal nodes. Grasp synthesis and optimization operates on the patch tree with a stack to efficiently prune away non-promising finger placements. Simulation and experiment with a Shadow Hand have been conducted over everyday items. The patch tree exhibits different levels of surface granularity. It has a good promise for efficient planning of finger gaits to carry out grasping and tool manipulation.

Abstract:
Recent advancements in 3D Gaussian Splatting have significantly improved the efficiency and quality of dense semantic SLAM. However, previous methods are generally constrained by limited-category pre-trained classifiers and implicit semantic representation, which hinder their performance in open-set scenarios and restrict 3D object-level scene understanding. To address these issues, we propose OpenGS-SLAM, an innovative framework that utilizes 3D Gaussian representation to perform dense semantic SLAM in open-set environments. Our system integrates explicit semantic labels derived from 2D foundational models into the 3D Gaussian framework, facilitating robust 3D object-level scene understanding. We introduce Gaussian Voting Splatting to enable fast 2D label map rendering and scene updating. Additionally, we propose a Confidence-based 2D Label Consensus method to ensure consistent labeling across multiple views. Furthermore, we employ a Segmentation Counter Pruning strategy to improve the accuracy of semantic scene representation. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of our method in scene understanding, tracking, and mapping, achieving 10× faster semantic rendering and 2× lower storage costs compared to existing methods. Project page: https://young-bit.github.io/opengs-github.github.io/.

Abstract:
Robots in dynamic environments need fast, accurate models of how objects move in their environments to support agile planning. In sports such as ping pong, analytical models often struggle to accurately predict ball trajectories with spins due to complex aerodynamics, elastic behaviors, and the challenges of modeling sliding and rolling friction. On the other hand, despite the promise of data-driven methods, machine learning struggles to make accurate, consistent predictions without precise input. In this paper, we propose an end-to-end learning framework that can jointly train a dynamics model and a factor graph estimator. Our approach leverages a Gram-Schmidt (GS) process to extract roto-translational invariant representations to improve the model performance, which can further reduce the validation error compared to data augmentation method. Additionally, we propose a network architecture that enhances nonlinearity by using self-multiplicative bypasses in the layer connections. By leveraging these novel methods, our proposed approach predicts the ball's position with an RMSE of 37.2 mm at the apex after the first bounce, and 71.5 mm after the second bounce.

Abstract:
Most prior motion prediction endeavors in autonomous driving have inadequately encoded future scenarios, leading to predictions that may fail to accurately capture the diverse movements of agents (e.g., vehicles or pedestrians). To address this, we propose FutureNet, which explicitly integrates initially predicted trajectories into the future scenario and further encodes these future contexts to enhance subsequent forecasting. Additionally, most previous motion forecasting works have focused on predicting independent futures for each agent. However, safe and smooth autonomous driving requires accurately predicting the diverse future behaviors of numerous surrounding agents jointly in complex dynamic environments. Given that all agents occupy certain potential travel spaces and possess lane driving priority, we propose Lane Occupancy Field (LOF), a new representation with lane semantics for motion forecasting in autonomous driving. LOF can simultaneously capture the joint probability distribution of all road participants' future spatial-temporal positions. Due to the high compatibility between lane occupancy field prediction and trajectory prediction, we propose a novel network for joint prediction of these two tasks. Our approach ranks 1st on two large-scale motion forecasting benchmarks: Argoverse 1 and Argoverse 2, while it is also the champion method of the CVPR 2024 Argoverse 2 motion forecasting challenge.

Abstract:
Snake robots enable mobility through extreme terrains and confined environments in terrestrial and space applications. However, robust perception and localization for snake robots remain an open challenge due to the proximity of the sensor payload to the ground coupled with a limited field of view. To address this issue, we propose Blind-motion with Intermittently Scheduled Scans (BLISS) which combines proprioception-only mobility with intermittent scans to be resilient against both localization failures and collision risks. BLISS is formulated as an integrated task and motion planning (TAMP) problem that leads to a chance-constrained hybrid partially observable Markov decision process (CC-HPOMDP), known to be computationally intractable due to the curse of history. Our novelty lies in reformulating CC-HPOMDP as a tractable, convex mixed integer linear program. This allows us to solve BLISS-TAMP significantly faster and jointly derive optimal task-motion plans. Simulations and hardware experiments on the EELS snake robot show our method achieves over an order of magnitude computational improvement compared to state-of-the-art POMDP planners and >50 % better navigation time optimality versus classical two-stage planners.

Abstract:
Robotic planning and execution in open-world environments is a complex problem due to the vast state spaces and high variability of task embodiment. Recent advances in perception algorithms, combined with Large Language Models (LLMs) for planning, offer promising solutions to these challenges, as the common sense reasoning capabilities of LLMs provide a strong heuristic for efficiently searching the action space. However, prior work fails to address the possibility of hallucinations from LLMs, which results in failures to execute the planned actions largely due to logical fallacies at high-or low-levels. To contend with automation failure due to such hallucinations, we introduce ConceptAgent, a natural language-driven robotic platform designed for task execution in unstructured environments. With a focus on scalability and reliability of LLM-based planning in complex state and action spaces, we present innovations designed to limit these shortcomings, including 1) Predicate Grounding to prevent and recover from infeasible actions, and 2) an embodied version of LLM-guided Monte Carlo Tree Search with self reflection. ConceptAgent combines these planning enhancements with dynamic language aligned 3d scene graphs, and large multi-modal pretrained models to perceive, localize, and interact with its environment, enabling reliable task completion. In simulation experiments, ConceptAgent achieved a 19% task completion rate across three room layouts and 30 easy level embodied tasks outperforming other state-of-the-art LLM-driven reasoning baselines that scored 10.26% and 8.11% on the same benchmark. Additionally, ablation studies on moderate to hard embodied tasks revealed a 20% increase in task completion from the baseline agent to the fully enhanced ConceptAgent, highlighting the individual and combined contributions of Predicate Grounding and LLM-guided Tree Search to enable more robust automation in complex state and action spaces. Additionally, in real-world mobile manipulation trials, conducted in randomized, low-clutter environments, a ConceptAgent-driven Spot robot achieved a 40% task completion rate, demonstrating the performance of our perception system in real-world scenarios.

Abstract:
Object reorientation is a critical task for robotic grippers, especially when manipulating objects within constrained environments. The task poses significant challenges for motion planning due to the high-dimensional output actions with the complex input information, including unknown object properties and nonlinear contact forces. Traditional approaches simplify the problem by reducing degrees of freedom, limiting contact forms, or acquiring environment/object information in advance—significantly compromising adaptability. To address these challenges, we deconstruct the complex output actions into three fundamental types based on tactile sensing: task-oriented actions, constraint-oriented actions, and coordinating actions. These actions are then optimized online using gradient optimization to enhance adaptability. Key contributions include simplifying contact state perception, decomposing complex gripper actions, and enabling online action optimization for handling unknown objects or environmental constraints. Experimental results demonstrate that the proposed method is effective across a range of everyday objects, regardless of environmental contact. Additionally, the method exhibits robust performance even in the presence of unknown contacts and nonlinear external disturbances.

Abstract:
We introduce a semidefinite relaxation for optimal control of linear systems with time scaling. These problems are inherently nonconvex, since the system dynamics involves bilinear products between the discretization time step and the system state and controls. The proposed relaxation is closely related to the standard second-order semidefinite relaxation for quadratic constraints, but we carefully select a subset of the possible bilinear terms and apply a change of variables to achieve empirically tight relaxations while keeping the computational load light. We further extend our method to handle piecewise-affine (PWA) systems by formulating the PWA optimal-control problem as a shortest-path problem in a graph of convex sets (GCS). In this GCS, different paths represent different mode sequences for the PWA system, and the convex sets model the relaxed dynamics within each mode. By combining a tight convex relaxation of the GCS problem with our semidefinite relaxation with time scaling, we can solve PWA optimal-control problems through a single semidefinite program.

Abstract:
This study presents an innovative approach to optimal gait control for a soft quadruped robot enabled by four compressible tendon-driven soft actuators. Soft quadruped robots, compared to their rigid counterparts, are widely recognized for offering enhanced safety, lower weight, and simpler fabrication and control mechanisms. However, their highly deformable structure introduces nonlinear dynamics, making precise gait locomotion control complex. To solve this problem, we propose a novel model-based reinforcement learning (MBRL) method. The study employs a multi-stage approach, including state space restriction, data-driven surrogate model training, and MBRL development. Compared to benchmark methods, the proposed approach significantly improves the efficiency and performance of gait control policies. The developed policy is both robust and adaptable to the robot's deformable morphology. The study concludes by highlighting the practical applicability of these findings in real-world scenarios.

Abstract:
Robot-assisted catheterization has garnered a good attention for its potentials in treating cardiovascular diseases. However, advancing surgeon-robot collaboration still requires further research, particularly on task-specific automation. For instance, automated tool segmentation can assist surgeons in visualizing and tracking endovascular tools during procedures. While learning-based models have demonstrated state-of-the-art segmentation performances, generating ground-truth labels for fully-supervised methods is laborintensive, time consuming, and costly. In this study, we developed a weakly-supervised learning method that is based on multi-lateral pseudo labeling for tool segmentation in cardiovascular angiogram datasets. The method utilizes a modified U-Net architecture featuring one encoder and multiple laterally branched decoders. The decoders generate diverse pseudo labels under different perturbations to augment the available partial annotation for model training. A mixed loss function with shared consistency was adapted for this purpose. The weakly-supervised model was trained end-to-end and validated using partially annotated angiogram data from three cardiovascular catheterization procedures. Validation results show that the weakly-supervised model could perform closer to fully-supervised models. Furthermore, the proposed multi-lateral approach outperforms three well known weakly-supervised learning methods, offering the highest segmentation performance across the three angiogram datasets. Numerous ablation studies confirmed the model's consistent performance under different settings. Finally, the model was applied for tool segmentation in a robot-assisted catheterization experiments. The model enhanced visualization with high connectivity indices for guidewire and catheter, and a mean segmentation time of 35.26±11.29 ms per frame. This study provides a fast, stable, and less expensive method for segmentation and visualization of endovascular tools in robot-assisted cardiac catheterization.

Abstract:
The cross-modal 3D retrieval task aims to achieve mutual matching between text descriptions and 3D shapes. This has the potential to enhance the interaction between natural language and the 3D environment, especially within the realms of robotics and embodied artificial intelligence (AI) applications. However, the scarcity and expensiveness of 3D data constrain the performance of existing cross-modal 3D retrieval methods. These methods heavily rely on features derived from the limited number of 3D shapes, resulting in poor generalization ability across diverse scenarios. To address this challenge, we introduce SCA3D, a novel 3D shape and caption online data augmentation method for cross-modal 3D retrieval. Our approach uses the LLaVA model to create a component library, captioning each segmented part of every 3D shape within the dataset. Notably, it facilitates the generation of extensive new 3D-text pairs containing new semantic features. We employ both inter and intra distances to align various components into a new 3D shape, ensuring that the components do not overlap and are closely fitted. Further, text templates are utilized to process the captions of each component and generate new text descriptions. Besides, we use unimodal encoders to extract embeddings for 3D shapes and texts based on the enriched dataset. We then calculate fine-grained cross-modal similarity using Earth Mover's Distance (EMD) and enhance cross-modal matching with contrastive learning, enabling bidirectional retrieval between texts and 3D shapes. Extensive experiments show our SCA3D outperforms previous works on the Text2Shape dataset, raising the Shape-to-Text RR@1 score from 20.03 to 27.22 and the Text-to-Shape RR@1 score from 13.12 to 16.67. Codes can be found in https://github.com/3DAgentWorld/SCA3D.

Abstract:
Robust and accurate perception and prediction of the driving scenarios are crucial for autonomous driving vehicles (ADV). State-of-the-art ADV frameworks have evolved from conventional modular design to an end-to-end (E2E) pipeline that enables joint feature learning and optimization. However, the evaluation of uncertainties in the intermediate features propagated between perception and prediction units is missing in current E2E pipelines. Consequently, adverse and extreme environment factors may incur highly untrustworthy features that ultimately result in degraded perception and prediction. In this work, we propose a novel uncertainty-aware E2E visual perception and prediction framework that utilized Bird's Eye View (BEV) representations. A feature distribution estimation network is introduced to explicitly quantify the uncertainties in the intermediate BEV features extracted from the images. To better exploit temporal information and generate more robust features for scene prediction, an uncertainty-aware transformer is designed to utilize the guidance of the quantified feature uncertainty via the attention mechanism. In addition, an evidential decoder generates accurate future instance segmentations along with the associated uncertainties. Comprehensive experiments conducted on real-world dataset validate the superiority of our proposed framework over conventional pipelines. Codes are available at: https://github.com/Huang121381/UAPnP.

Abstract:
In autonomous driving, The perception capabilities of the ego-vehicle can be improved with roadside sensors, which can provide a holistic view of the environment. However, existing monocular detection methods designed for vehicle cameras are not suitable for roadside cameras due to viewpoint domain gaps. To bridge this gap and Improve ROAdside Monocular 3D object detection, we propose IROAM, a semantic-geometry decoupled contrastive learning framework, which takes vehicle-side and roadside data as input simultaneously. IROAM has two significant modules. In-Domain Query Interaction module utilizes a transformer to learn content and depth information for each domain and outputs object queries. Cross-Domain Query Enhancement To learn better feature representations from two domains, Cross-Domain Query Enhancement decouples queries into semantic and geometry parts and only the former is used for contrastive learning. Experiments demonstrate the effectiveness of IROAM in improving roadside detector's performance. The results validate that IROAM has the capabilities to learn cross-domain information.

Abstract:
Performing acrobatic maneuvers involving long aerial phases, such as precise dives or multiple backflips from significant heights, remains an open challenge in legged robot autonomy. Such aggressive motions often require accurate state predictions over long horizons with multiple contacts and extended flight phases. Most existing trajectory optimization (TO) methods rely on Euler or Runge-Kutta integration, which can accumulate significant prediction errors over long planning horizons. In this work, we propose a novel whole-body TO method using variational integration (VI) and full-body nonlinear dynamics for long-flight aggressive maneuvers. Compared to traditional Euler-based TO, our approach using VI preserves energy and momentum properties of the continuous-time system and reduces error between predicted and executed trajectories by factors of between 2 - 10 while achieving similar planning time. We successfully demonstrate long-flight triple backflips on a quadruped A1 robot model and backflips on a bipedal HECTOR robot model for various heights and distances, achieving landing angle errors of only a few degrees. In contrast, TO with Euler integration fails to achieve accurate landings in equivalent circumstances, e.g., with landing angle errors greater than 90° for triple backflips. We provide an open-source implementation of our VI -discretized TO to support further research on accurate dynamic maneuvers for multi-rigid-body robot systems with contact: https://github.com/DRCL-USC/VI_discretized_TO

Abstract:
Eye tracking is a key technology for gaze-based interactions in Extended Reality (XR), but traditional frame-based systems struggle to meet XR's demands for high accuracy, low latency, and power efficiency. Event cameras offer a promising alternative due to their high temporal resolution and low power consumption. In this paper, we present FACET (Fast and Accurate Event-based Eye Tracking), an end-to-end neural network that directly outputs pupil ellipse parameters from event data, optimized for real-time XR applications. The ellipse output can be directly used in subsequent ellipse-based pupil trackers. We enhance the EV-Eye dataset by expanding annotated data and converting original mask labels to ellipse-based annotations to train the model. Besides, a novel trigonometric loss is adopted to address angle discontinuities and a fast causal event volume event representation method is put forward. On the enhanced EV-Eye test set, FACET achieves an average pupil center error of \mathbf0. 2 0 pixels and an inference time of 0.53 ms, reducing pixel error and inference time by 1.6 × and 1.8 × compared to the prior art, EV-Eye, with 4.4 × and 11.7 × less parameters and arithmetic operations. The code is available at https://github.com/DeanJY/FACET.

Abstract:
Heavy-duty glass installation is a high-risk, precision-critical task in modern construction, traditionally performed through labor-intensive and error-prone manual methods. This paper presents a novel robotic framework that leverages diffusion-based self-supervised imitation learning from imperfect visual servoing demonstrations to achieve safe and precise glass installation. Specifically, our approach employs noisy and suboptimal demonstration data obtained via visual servoing to train a Denoising Diffusion Probabilistic Model (DDPM). This model iteratively refines installation trajectories, transforming them into smooth, precise, and collisionfree movements. Extensive experiments demonstrate that our method significantly surpasses conventional visual servoing and standard imitation learning baselines in terms of success rate, precision, and installation efficiency, while markedly improving operational safety. Our results establish a new benchmark for automating complex, high-risk tasks in construction robotics.

Abstract:
Billions of vascular access procedures are performed annually worldwide, serving as a crucial first step in various clinical diagnostic and therapeutic procedures. For pediatric or elderly individuals, whose vessels are small in size (typically 2 to 3 mm in diameter for adults and <1 mm in children), vascular access can be highly challenging. This study presents an image-guided robotic system aimed at enhancing the accuracy of difficult vascular access procedures. The system integrates a 6-DoF (Degrees of Freedom) robotic arm with a 3-DoF end-effector, ensuring precise navigation and needle insertion. Multi-modal imaging and sensing technologies have been utilized to endow the medical robot with precision and safety, while ultrasound (US) imaging guidance is specifically evaluated in this study. To evaluate in vivo vascular access in submillimeter vessels, we conducted ultrasound-guided robotic blood drawing on the tail veins (with a diameter of 0.7 ± 0.2 mm) of 40 rats. The results demonstrate that the system achieved a first-attempt success rate of 95%. The high first-attempt success rate in intravenous vascular access, even with small blood vessels, demonstrates the system's effectiveness in performing these procedures. This capability reduces the risk of failed attempts, minimizes patient discomfort, and enhances clinical efficiency.

Abstract:
Intubation is a critical medical procedure for securing airway patency in patients, but the inconsistent skill levels among medical practitioners necessitate the advancement of better robotic solutions. While orotracheal intubation robots have been widely developed, nasotracheal intubation remains essential in specific clinical scenarios. However, nasotracheal intubation robots are still underdeveloped and lack buffer protection mechanisms to ensure safety. This study presents a novel variable-stiffness nasotracheal intubation robot (NIR) with passive buffering. The proposed NIR is a modular platform capable of performing the main steps of nasotracheal intubation, validated through mannequin studies via teleoperation. We proposed a variable-stiffness fiberoptic bronchoscope (FOB) control module for the FOB distal end control, and validated its dual functionality in experiments: low-stiffness mode provides passive buffering during nasal cavity navigation, with a frontal peak force of 2.8 N and a lateral peak force of 0.12 N; high-stiffness mode enhances load-bearing capacity for near-glottis navigation, with a frontal bearing force of 4.9 N and a lateral bearing force of 0.42 N. Additionally, a compact (\mathbf74 × \mathbf64× \mathbf53 mm, 150 g) FOB feeding module with passive failure protection was designed to limit the max frontal impact force to 2.3 N.

Abstract:
With the rapid advancement of large language models (LLMs) and vision-language models (VLMs), significant progress has been made in developing open-vocabulary robotic manipulation systems. However, many existing approaches overlook the importance of object dynamics, limiting their applicability to more complex, dynamic tasks. In this work, we introduce KUDA, an open-vocabulary manipulation system that integrates dynamics learning and visual prompting through keypoints, leveraging both VLMs and learning-based neural dynamics models. Our key insight is that a keypoint-based target specification is simultaneously interpretable by VLMs and can be efficiently translated into cost functions for model-based planning. Given language instructions and visual observations, KUDA first assigns keypoints to the RGB image and queries the VLM to generate target specifications. These abstract keypoint-based representations are then converted into cost functions, which are optimized using a learned dynamics model to produce robotic trajectories. We evaluate KUDA on a range of manipulation tasks, including free-form language instructions across diverse object categories, multi-object interactions, and deformable or granular objects, demonstrating the effectiveness of our framework. The project page is available at http://kuda-dynamics.github.io.

Abstract:
Embodied perception is essential for intelligent vehicles and robots in interactive environmental understanding. However, these advancements primarily focus on vision, with limited attention given to using 3D modeling sensors, restricting a comprehensive understanding of objects in response to prompts containing qualitative and quantitative queries. Recently, as a promising automotive sensor with affordable cost, 4D millimeter-wave radars provide denser point clouds than conventional radars and perceive both semantic and physical characteristics of objects, thereby enhancing the reliability of perception systems. To foster the development of natural language-driven context understanding in radar scenes for 3D visual grounding, we construct the first dataset, Talk2Radar, which bridges these two modalities for 3D Referring Expression Comprehension (REC). Talk2Radar contains 8,682 referring prompt samples with 20, 558 referred objects. Moreover, we propose a novel model, T-RadarNet, for 3D REC on point clouds, achieving State-Of-The-Art (SOTA) performance on the Talk2Radar dataset compared to counterparts. Deformable-FPN and Gated Graph Fusion are meticulously designed for efficient point cloud feature modeling and cross-modal fusion between radar and text features, respectively. Comprehensive experiments provide deep insights into radar-based 3D REC. We release our project at https://github.com/GuanRunwei/Talk2Radar.

Abstract:
Autonomous underwater navigation faces significant challenges due to the complexity of the environment, limited localization methods, and poor visibility. This paper investigates the performance of various reinforcement learning (RL) algorithms-Proximal Policy Optimization (PPO), Trust Region Policy Optimization (TRPO), Soft Actor-Critic (SAC), Twin Delayed DDPG (TD3), and Advantage Actor-Critic (A2C)-to improve navigation capabilities of low-cost underwater robots equipped with multi-modal sensors. Advanced depth estimation models such as MiDaS and Depth Anything, combined with domain randomization techniques, are employed to enhance the system's robustness and generalization across varying underwater conditions. The proposed approach integrates real-time sensor data and historical actions to enable 3D maneuvering in simulated environments, leading to significant improvements in sensor fusion, depth perception, and obstacle avoidance. Simulation results demonstrate that the combination of RL techniques with sensor fusion considerably improves mapless autonomous underwater exploration, providing a robust solution for navigating unstructured aquatic environments. The complete implementation is available in an open-source repository, https://github.com/eather0056/BlueROV_Nav_DRL.

Abstract:
Controlling AUVs can be challenging because of the effect of complex non-linear hydrodynamic forces acting on the robot, which are significant in water and cannot be ignored. The problem is exacerbated for small AUVs for which the dynamics can change significantly with payload changes and deployments under different hydrodynamic conditions. The common approach to AUV control is a combination of passive stabilization with added buoyancy on top and weights on the bottom, and a PID controller tuned for simple and smooth motion primitives. However, the approach comes at the cost of sluggish controls and often the need to re-tune controllers with configuration changes. In this paper, we propose a fast (trainable in minutes), reinforcement learning-based approach for full 6 degree of freedom (DOF) control of a thruster-driven AUVs, taking 6-DOF command-conditioned inputs direct to thruster outputs. We present a new, highly parallelized simulator for underwater vehicle dynamics. We demonstrate this approach through zero-shot sim-to-real (with no tuning) transfer onto a real AUV that produces comparable results to hand-tuned PID controllers. Furthermore, we show that domain randomization on the simulator produces policies that are robust to small variations in vehicle's physical parameters.

Abstract:
Multi-Robot Systems (MRS) are increasingly de-ployed for hazardous tasks in urban environments. Among many tasks, search and rescue remains challenging as it deals with exploration in an unknown indoor constrained environ-ment. For example, without global knowledge of the map of a building floor, it is not advantageous to choose one path over another at a corridor junction. Also, if the assigned frontiers are far from the robot, backtracking along a corridor will cost more than moving forward. Since exploration along corridors is similar to solving a maze, this paper examines classical maze-solving algorithms that are known to be computationally fast and lightweight, such as the Right Hand Rule (RHR), Random Mouse (RM), and more. The authors have identified two gaps that need to be addressed before these algorithms can be applied to physical MRS. Firstly, these algorithms are not designed for the cooperation of multiple agents in exploration. Secondly, they are often applied to only a low-fidelity simulation environment, which requires some work to make these algorithms transferable to work in the commonly used occupancy grid map environment. In this paper, the authors introduced RACE, a fast and lightweight collective urban exploration and search algorithm based on a modified and condensed version of the Ant Colony Optimization (ACO) algorithm. The proposed solution is successfully verified in a low-fidelity simulation, evaluated against other exploration and search algorithms like RHR and RM. An innovative approach of RACE Simulation to Physical implementation is presented and a physical system evaluation is performed to evaluate RACE against a Rapidly-Exploring Random Tree algorithm. Finally, the proposed solution is further verified with a physical experiment, in which a quadrupedal robot is assigned to explore part of a floor of SUTD, spanning approximately (55m × 40m). RACE also showed potential in handling challenging closed-loop and dead-end environments.

Abstract:
Learning to predict spatiotemporal (ST) environmental processes from a sparse set of samples collected autonomously is a difficult task from both a sampling perspective (collecting the best sparse samples) and from a learning perspective (predicting the next timestep). In this work, we focus on investigating the sample collection process via multirobot informative path planning. We present an approach for incorporating multi-robot informative path planning into a spatiotemporal adaptive sampling framework while considering path length constraints for sampling location selection. We also incorporate informative path planning to determine the best path to collect samples along while en route to collecting the desired sample. We achieve this in a decentralized manner by decoupling the process into two stages: the first stage uses our spatiotemporal mixture of Gaussian Processes (STMGP) model to determine the most informative sampling location via a mutual information lower bound heuristic and the second stage plans an informative path to collect the desired sample and other additional informative samples via submodular function optimization. Moreover, we effectively leverage peer-to-peer communication to enable coordination. Simulation results on real-world spatiotemporal data are provided to validate the effectiveness of our proposed approach.

Abstract:
In this paper we explore if human mental models of objects, even when flawed, can be integrated with a collaborative robot's decision making framework to allow it to make smarter choices under partial observability for different object-related tasks such as assembly and troubleshooting. We demonstrate how (1) these informative causal models can be extracted from humans through crowdsourcing, (2) object assembly and troubleshooting can be formulated as Partially Observable Markov Decision Processes (POMDPs) and (3) our extracted causal models can be incorporated into those models in the form of approximate priors. Finally, (4) we use systematic experimentation in simulation to demonstrate the success of this approach, with 2 X average improvement in reward observed for object assembly tasks, and 1.4 X average improvement in reward observed for troubleshooting tasks.

Abstract:
In-hand manipulation and grasping are fundamental yet often separately addressed tasks in robotics. For deriving in-hand manipulation policies, reinforcement learning has recently shown great success. However, the derived controllers are not yet useful in real-world scenarios because they often require a human operator to place the objects in suitable initial (grasping) states. Finding stable grasps that also promote the desired in-hand manipulation goal is an open problem. In this work, we propose a method for bridging this gap by leveraging the critic network of a reinforcement learning agent trained for in-hand manipulation to score and select initial grasps. Our experiments show that this method significantly increases the success rate of in-hand manipulation without requiring additional training. We also present an implementation of a full grasp manipulation pipeline on a real-world system, enabling autonomous grasping and reorientation even of un-wieldy objects. Website: aidx-lab. org/manipulation/icra25

Abstract:
Large Language Models (LLMs) have demonstrated remarkable ability in long-horizon Task and Motion Planning (TAMP) by translating clear and straightforward natural language problems into formal specifications such as the Planning Domain Definition Language (PDDL). However, real-world problems are often ambiguous and involve many complex constraints. In this paper, we introduce Constraints as Specifications through LLMs (CaStL), a framework that identifies constraints such as goal conditions, action ordering, and action blocking from natural language in multiple stages. CaStL translates these constraints into PDDL and Python scripts, which are then solved using an custom PDDL solver. Tested across three PDDL domains, CaStL significantly improves constraint handling and planning success rates from natural language specification in complex scenarios.

Abstract:
Advances in robot learning have enabled robots to generate skills for a variety of tasks. Yet, robot learning is typically sample inefficient, struggles to learn from data sources exhibiting varied behaviors, and does not naturally incorporate constraints. These properties are critical for fast, agile tasks such as playing table tennis. Modern techniques for learning from demonstration improve sample efficiency and scale to diverse data, but are rarely evaluated on agile tasks. In the case of reinforcement learning, achieving good performance requires training on high-fidelity simulators. To overcome these limitations, we develop a novel diffusion modeling approach that is offline, constraint-guided, and expressive of diverse agile behaviors. The key to our approach is a kinematic constraint gradient guidance (KCGG) technique that computes gradients through both the forward kinematics of the robot arm and the diffusion model to direct the sampling process. KCGG minimizes the cost of violating constraints while simultaneously keeping the sampled trajectory in-distribution of the training data. We demonstrate the effectiveness of our approach for time-critical robotic tasks by evaluating KCGG in two challenging domains: simulated air hockey and real table tennis. In simulated air hockey, we achieved a 25.4% increase in block rate, while in table tennis, we achieved a 17.3% increase in success rate compared to imitation learning baselines.

Abstract:
Autonomous in-pipe inspection robots can automatically navigate through complex pipeline networks and detect potential risks from corrosion and defects, demonstrating great potential for replacing costly manual inspections. However, there is no publicly available simulation environment where researchers can validate their in-pipe navigation algorithms as far as we know, and the navigation algorithms on constrained 3D pipe surface which is the critical software component are less discussed. Firstly, this paper proposes an open-source In-Pipe Navigation Development Environment. It contains various pipeline models, a magnetic wheel climbing robot model realized by the adhesion plugin, and baseline algorithms for navigation tasks. Secondly, a novel effective path planning method is introduced. Instead of planning based on surface structures, the proposed method plans based on pipeline axis and maps it into local path using the Frenet-Serret formula, thereby generating smooth, feasible, and efficient paths. Finally, we conduct both qualitative and quantitative experiments in the proposed simulation and real-world environments. The results show the usability of the development environment, also robustness and efficiency of the proposed planning method.

Abstract:
This paper introduces a cable-length-based extended Kalman filter (L-EKF) framework to estimate the end-effector pose of a cable-driven parallel robot (CDPR). The L-EKF fuses end-effector accelerometer and rate gyroscope measurements with cable-length measurements. The main contribution compared to prior CDPR pose estimation EKF methods is that the L-EKF framework does not require an iterative forward kinematics algorithm to be solved each time step, reducing the computation time of the EKF. Moreover, the L-EKF is amenable to the inclusion of colored measurement noise, which provides a more realistic quantification of the kinematic uncertainty present in the cable-length measurements. Experimental results demonstrate that the L-EKF is computationally more efficient than previous forward-kinematics-based EKF methods, as well as the moderate improvement in pose estimation provided by the colored noise model.

Abstract:
Multi-robot clusters are increasingly deployed in indoor environments, where effective communication and 3D perception are critical for coordinated operations. Monocular cameras, known for their lightweight design, cost-effectiveness, and versatility, present a promising solution for these tasks. However, relying solely on monocular cameras for comprehensive perception and communication presents significant challenges. To address this, we introduce MonoLDP, a novel system that leverages monocular cameras for depth estimation, mutual pose estimation, and visible light communication in indoor environments, providing an integrated framework to overcome these limitations. MonoLDP features a two-stage network: (1) a depth estimation module that infers depth from monocular images, and (2) a depth-guided 3D object recognition network for agent-relative localization and pose estimation. We created a custom dataset to validate the accuracy of MonoLDP. On our indoor dataset, MonoLDP outperforms the baseline by 43.39% in 3D detection and 42.39% in bird's-eye view detection, with an average localization error of 0.104 m and an orientation error of 1.66 degrees. Moreover, the depth estimation network demonstrates excellent performance on the NYU v2 dataset. Additionally, the system achieves a communication rate of 1.2 Kbps with a bit error rate below 10-2 at a distance of up to 4 m using LED arrays. Our code will be released at https://github.com/RavenLiang1005/MonoLDP.git.

Abstract:
This paper presents a novel approach for realtime 3D navigation in large-scale complex environments by introducing a hierarchical 3D visibility graph (V-graph) and an efficient path search method. The proposed algorithm addresses the computational challenges of V-graph construction and shortest path search on the graph simultaneously. By introducing hierarchical 3D V-graph construction with heuristic visibility update, the 3D V-graph is constructed in O\left(K \cdot n^2 \log n\right) time, which guarantees real-time performance. The proposed iterative divide-and-conquer path search method can achieve near-optimal path solutions within the constraints of realtime operations. The algorithm ensures efficient 3D V-graph construction and path search. Extensive simulated and realworld environments validated that our algorithm reduces the travel time by 42%, achieves up to 24.8% higher trajectory efficiency, and runs faster than most benchmarks by orders of magnitude in complex environments. The code and developed simulator have been open-sourced to facilitate future research.

Abstract:
Recently, 4D millimetre-wave radar exhibits more stable perception ability than LiDAR and camera under adverse conditions (e.g. rain and fog). However, low-quality radar points hinder its application, especially the odometry task that requires a dense and accurate matching. To fully explore the potential of 4D radar, we introduce a learning-based odometry framework, enabling robust ego-motion estimation from finite and uncertain geometry information. First, for sparse radar points, we propose a local completion to supplement missing structures and provide denser guideline for aligning two frames. Then, a context-aware association with a hierarchical structure flexibly matches points of different scales aided by feature similarity, and improves local matching consistency through correlation balancing. Finally, we present a window-based optimizer that uses historical priors to establish a coupling state estimation and correct errors of inter-frame matching. The superiority of our algorithm is confirmed on View-of-Delft dataset, achieving around a 50% performance improvement over previous approaches and delivering accuracy on par with LiDAR odometry. The code will be released at https://github.com/NEU-REAL/CAO-RONet.

Abstract:
Autonomous exploration in large environments often leads to inefficient long backtracking, as distant targets are prioritized over closer ones. In this work, a hierarchical planning method is proposed, which employs region partitioning to systematically address the aforementioned issue. The space is dynamically partitioned at a coarse resolution, and as exploration progresses, regions with sufficient known areas are further subdivided to locate unknown areas more precisely. A utility function considering unknown area size, travel distance and sequence similarity is used, and the simulated annealing algorithm generates a subregion sequence for global guidance. Within each subregion, a linear acceleration model helps select target points. This method reduces computational load and minimizes long-distance backtracking, enabling more efficient high-frequency planning. Extensive simulations and real-world tests show that our method significantly improves exploration efficiency compared to existing vision-based techniques.

Abstract:
We present a strategy for te problem of robotic button mushroom harvesting (Agaricus Bisporus) that involves a Real2Sim2Real pipeline with dynamic scene reconstruction and a Model Predictive Path Integral (MPPI) control & planning architecture for generating optimal uprooting motion primitives based on a physics engine simulation framework. Given the complex, non-linear, anisotropic material properties of the mushrooms in combination with the multiple failure-mode modalities involved, we design a simulation framework around the PyBullet rigid-body-physics engine by utilizing first-order approximations of the equivalent continuum mechanics models. By exploiting the computational efficiency of the aforementioned simulation framework, we directly apply the MPPI control framework to generate offline optimal mushroom uprooting motion primitives, defining a set of cost objectives for an optimal and within-constraint harvesting plan. We show that with this planning strategy, the “root-bending” action emerges autonomously for the case of a single mushroom as an optimal uprooting maneuver, which corresponds well to empirical knowledge obtained by expert pickers. A video demonstration of the proposed architecture can be found in https://youtu.be/k38ePBsBego.

Abstract:
Due to labor shortages in specialty crop industries, a need for robotic automation to increase agricultural efficiency and productivity has arisen. Previous manipulation systems harvest well in uncluttered and structured environments. High tunnel environments are more compact and cluttered in nature, requiring a rethinking of the large form factor systems and grippers. We propose a novel co-designed framework incorporating a global detection camera and a local eye-in-hand camera that demonstrates precise localization of small fruits via closed-loop visual feedback and reliable error handling. Field experiments in high tunnels show that our system can reach 85.0% of cherry tomato fruit in 10.98s on average.

Abstract:
Recent advancements in robot tool use have unlocked their usage for novel tasks, yet the predominant focus is on rigid-body tools, while the investigation of soft-body tools and their dynamic interaction with rigid bodies remains unexplored. This paper takes a pioneering step towards dynamic one-shot soft tool use for manipulating rigid objects, a challenging problem posed by complex interactions and unobservable physical properties. To address these problems, we propose the Implicit Physics-aware (IPA) policy, designed to facilitate effective soft tool use across various environmental configurations. The IPA policy conducts system identification to implicitly identify physics information and predict goal-conditioned, one-shot actions accordingly. We validate our approach through a challenging task, i.e., transporting rigid objects using soft tools such as ropes to distant target positions in a single attempt under unknown environment physics parameters. Our experimental results indicate the effectiveness of our method in efficiently identifying physical properties, accurately predicting actions, and smoothly generalizing to real-world environments. The related video is available at: https://youtu.be/4hPrUDTc4Rg?si=WUZrT2vjLMt8qRWA

Abstract:
Aerial-terrestrial amphibious robots excel in search and rescue tasks in unstructured terrains but face challenges in autonomous navigation indoors. Traditional full-mapping methods can degrade global path planning performance, especially when semi-static obstacles shift, leading to suboptimal paths. We propose a method for constructing building structure grid maps that are unaffected by semistatic obstacles. Our approach includes a building structure recognition algorithm based on an octree structure to differentiate between occupied and free grid cells. Experimental results demonstrate that coverage path planning on building structure grid maps produces superior global paths compared to traditional grid maps, offering a more streamlined and robust solution for autonomous navigation of aerial-terrestrial amphibious robots in indoor environments.

Abstract:
Unmanned aerial vehicles equipped with modern vision algorithms are crucial for missions such as reconstruction and target acquisition. However, when deployed in the field, undulating terrain can cause significant fluctuations in image scale and degrade the performance of vision algorithms. Instead of developing specialized image processing schemes with limited adaptability, this paper presents a novel 3D visual coverage algorithm that is compatible with existing generic vision algorithms and maintains a uniform image scale for ground targets. In detail, photogrammetric constraints are initially introduced to generate aerial waypoints, and then the negative effects of valley clustering are addressed. Elastic Photogrammetric Constraints (EPC) are further proposed to eliminate valley clustering effects induced by saddle terrain. The experimental results demonstrate that EPC reduces the traversal path length by up to 37.38 % compared to the previous work, but with a minor trade-off in scale variations.

Abstract:
The scarcity of diverse and high-quality data impedes the quest to build a generalist robotic system. Current robotics data collection efforts face many challenges: the need for physical robotic hardware, setting up the environment, frequent resets, and the fatigue for data collectors operating real robots. We introduce DART, a teleoperation platform designed for crowdsourcing that reimagines robotic data collection by leveraging cloud-based simulation and augmented reality (AR) to address many limitations of prior data collection efforts. User studies show that DART enables higher data collection throughput and lower physical fatigue than real-world teleoperation. We also demonstrate that policies trained using DART-collected datasets successfully transfer to reality and are robust to unseen visual disturbances. All data collected through DART is automatically stored in a cloud-hosted database, DexHub, paving the path for an ever-growing data hub for robot learning. https://dexhub.ai/project

Abstract:
The role of robot following is crucial for effective human-robot collaboration. Traditional methods often rely on maintaining a significant distance between the robot and the human, which limits interaction and responsiveness. In contrast, close-proximity front-following facilitates immediate engagement, enhancing user experience and improving human-robot interaction. Nonetheless, it presents challenges in accurately interpreting human walking intentions due to a restricted observational field. In our paper, we introduce an innovative Depth-Temporal Attention Network that takes lower-limb depth images and robot motor signals as input, to accurately predict human walking intentions. This network leverages a depth attention module to capture essential spatial features and integrates a temporal attention mechanism to analyze movement dynamics. To enhance generalization, we use a domain adversarial module that focuses on shared features across diverse walking data, ensuring consistent performance across users. Experimental results demonstrate that our approach achieves an impressive average intention prediction accuracy of 91.09%, significantly surpassing baseline models by 12.59% to 23.66%. Additionally, an ablation study reveals that the depth-attention module substantially improves the model's understanding of depth features, resulting in an 11.44% increase in accuracy. With this high prediction accuracy, smooth front-following is achieved at close-proximity.

Abstract:
D Gaussian Splatting has shown remarkable capabilities in novel view rendering tasks and exhibits significant potential for multi-view optimization. However, the original 3D Gaussian Splatting lacks color representation for inputs in lowlight environments. Simply using enhanced images as inputs would lead to issues with multi-view consistency, and current single-view enhancement systems rely on pre-trained data, lacking scene generalization. These problems limit the application of 3D Gaussian Splatting in low-light conditions in the field of robotics, including high-fidelity modeling and feature matching. To address these challenges, we propose an unsupervised multiview stereoscopic system based on Gaussian Splatting, called Low-Light Gaussian Splatting (LLGS). This system aims to enhance images in low-light environments while reconstructing the scene. Our method introduces a decomposable Gaussian representation called M-Color, which separately characterizes color information for targeted enhancement. Furthermore, we propose an unsupervised optimization method with zeroknowledge priors, using direction-based enhancement to ensure multi-view consistency. Experiments conducted on real-world datasets demonstrate that our system outperforms state-of-theart methods in both low-light enhancement and 3D Gaussian Splatting.

Abstract:
Realistic 3D object and scene reconstruction is pivotal in advancing fields such as world model simulation and embodied intelligence. In this paper, we introduce Hash-GS, a storage-efficient method for large-scale scene reconstruction using anchor-based 3D Gaussian Splatting (3DGS). The vanilla 3DGS struggles with high memory demands due to the large number of primitives, especially in complex or extensive scenes. Hash-GS addresses these challenges with a compact representation by leveraging high-dimensional features to parameterize primitive properties, stored in compact hash tables, which reduces memory usage while preserving rendering quality. It also incorporates adaptive anchor management to efficiently control the number of anchors and neural Gaussians. Additionally, we introduce an analytic 3D smoothing filter to mitigate aliasing and support Level-of-Detail for optimized rendering across varying intrinsic parameters. Experimental results on several datasets demonstrate that Hash-GS improves storage efficiency while maintaining competitive rendering performance, especially in large-scale scenes.

Abstract:
We present a low-cost method to automate tension calibration for tendon-driven continuum robots (TDCRs), particularly those lacking tension sensing. Our method utilizes Hall effect sensors to localize the robot's tip with respect to the one-dimensional trajectory it follows under individual tendon actuation. We propose two workflows for robots with and without a static model, making the method generalizable to other tendon-driven soft robots. We demonstrate our method's ability to repeatably tension the tendons through associated tendon displacements. The calibration approach's measured repeatability (\pm 0.03 ~\textmm) is also benchmarked against manual calibration on a TDCR prototype, and its accuracy in achieving target tensions is assessed ((0.06 \pm 0.20) \mathrmN). We further investigate how tension calibration impacts open-loop tracking accuracy, confirming the effectiveness of our method to enhance motion consistency in open-loop control and teleoperation.

Abstract:
Map-based LiDAR localization, while widely used in autonomous systems, faces significant challenges in degraded environments due to the lack of distinct geometric features. This paper introduces SuperLoc, a robust LiDAR localization package that addresses key limitations in existing methods. SuperLoc features a novel predictive alignment risk assessment technique, enabling early detection and mitigation of potential failures before optimization. This approach significantly improves performance in challenging scenarios such as corridors, tunnels, and caves. Unlike existing degeneracy mitigation algorithms that rely on post-optimization analysis and heuristic thresholds, SuperLoc evaluates the localizability of raw sensor measurements. Experimental results demonstrate significant performance improvements over state-of-the-art methods across various degraded environments. Our approach achieves a 54% increase in accuracy and exhibits better robustness. To facilitate further research, we release our implementation along with datasets from eight challenging scenarios.

Abstract:
Real-world applications of path planning must contend with complicated constraint and objective functions imposed by the surrounding operational and regulatory environment. Traditional methods such as PRM and RRT have asymptotic guarantees, but often struggle in practice with complex blackbox objective/constraint functions, especially in compute-limited situations. Continuous Belief Tree Search (CBTS) addresses these limitations by maintaining local estimates of the objective function in order to sample new nodes from continuous space, often giving high-quality solutions more quickly. However, CBTS requires careful tuning of a control duration parameter, which introduces a tradeoff between compute time and path cost/feasibility. In environments with complex costs and constraints, there may be no single control duration that gives good paths in short compute time. This paper proposes Trust Region CBTS (TR-CBTS), an extension of CBTS with an adaptive control duration parameter inspired by trust region methods. TR-CBTS adjusts control duration based on information from recently sampled candidate nodes, allowing longer control duration where possible to speed up compute time, and shortening control duration when precise navigation in environments with complex, unknown constraint and objective functions. We show TR-CBTS outperforms existing comparable planners for a realistic robotic path planning application in autonomous ship routing.

Abstract:
The paper presents a method to stabilize dynamic gait for a legged robot with embodied compliance. Our approach introduces a unified description for rigid and compliant bodies to approximate their deformation and a formulation for deformable multibody systems. We develop the centroidal composite predictive deformed inertia (CCPDI) tensor of a deformable multibody system and show how to integrate it with the standard-of-practice model predictive controller (MPC). Simulation shows that the resultant control framework can stabilize trot stepping on a quadrupedal robot with both rigid and compliant spines under the same MPC configurations. Compared to standard MPC, the developed CCPDI-enabled MPC distributes the ground reactive forces closer to the heuristics for body balance, and it is thus more likely to stabilize the gaits of the compliant robot. A parametric study shows that our method preserves some level of robustness within a suitable envelope of key parameter values.

Abstract:
In this paper we explore kinematics ranging from anguilliform to thunniform achieved in a self-sensing multi-degree-of-freedom soft robotic fish and analyze the effect of them on the swimming. First, we examine the characteristics of the bending actuators of the robotic fish. Then, we express the kinematics of the fish as a propagating wave parameterized by three bending amplitudes and a wavelength, which are determined by the flow rates and phase shift of the pumps. We capture various motion patterns generated by different actuator inputs and directly measure the thrust generated by each pattern. We observe that the robotic swimmer can reproduce two different modes of propulsion, that are embodied by two distinct morphological patterns in nature: anguilliform and thunniform. When neither of modes are activated, propulsion is zero or even negative. Finally, we estimate the stationary swimming speed by towing the undulating fish, which satisfies the slip condition (with the speed of the body wave matching the swimming velocity). The analysis of a wide range of kinematic patterns in this study, including two extreme cases of anguilliform and thunniform modes, will provide insights for comprehensive understanding the mechanics of efficient swimming.

Abstract:
In various robotic applications, understanding accurate object poses for robots is essential for high-precision tasks such as factory assembly or daily insertions. Tactile sensing, which compensates for visual information, offers rich texture-based or force-based data for object pose estimation. However, previous methods for pose estimation typically over-look dynamic situations, such as slippage of grasped objects or movement of contacted objects during interactions with the environment, thus increasing the complexity of pose estimation. To address these challenges, we propose an efficient method that utilizes visual and tactile sensing to estimate object poses through particle filtering. We leverage visual information to track the pose of the contacted object in real-time and estimate the pose changes of the grasped object using displacement data obtained from tactile sensors. Our experimental evaluation on 13 objects with diverse geometric shapes demonstrated the ability to estimate high-precision poses, which revealed the robot's powerful ability to cope with dynamic scenes for compelled motion of objects, proving our framework's adaptability in practical scenarios with uncertainty.

Abstract:
The capability to efficiently search for objects in complex environments is fundamental for many real-world robot applications. Recent advances in open-vocabulary vision models have resulted in semantically-informed object navigation methods that allow a robot to search for an arbitrary object without prior training. However, these zero-shot methods have so far treated the environment as unknown for each consecutive query. In this paper we introduce a new benchmark for zero-shot multi-object navigation, allowing the robot to leverage information gathered from previous searches to more efficiently find new objects. To address this problem we build a reusable open-vocabulary feature map tailored for real-time object search. We further propose a probabilistic-semantic map update that mitigates common sources of errors in semantic feature extraction and leverage this semantic uncertainty for informed multi-object exploration. We evaluate our method on a set of object navigation tasks in both simulation as well as with a real robot, running in real-time on a Jetson Orin AGX. We demonstrate that it outperforms existing state-of-the-art approaches both on single and multi-object navigation tasks. Additional videos, code and the multi-object navigation benchmark will be available on https://finnbsch.github.io/OneMap.

Abstract:
Robot decision-making in partially observable, real-time, dynamic, and multi-agent environments remains a difficult and unsolved challenge. Model-free reinforcement learning (RL) is a promising approach to learning decisionmaking in such domains, however, end-to-end RL in complex environments is often intractable. To address this challenge in the RoboCup Standard Platform League (SPL) domain, we developed a novel architecture integrating RL within a classical robotics stack, while employing a multi-fidelity sim2real approach and decomposing behavior into learned sub-behaviors with heuristic selection. Our architecture led to victory in the 2024 RoboCup SPL Challenge Shield Division. In this work, we fully describe our system's architecture and empirically analyze key design decisions that contributed to its success. Our approach demonstrates how RL-based behaviors can be integrated into complete robot behavior architectures.

Abstract:
Tracking controllers enable robotic systems to accurately follow planned reference trajectories. In particular, reinforcement learning (RL) has shown promise in the synthesis of controllers for systems with complex dynamics and modest online compute budgets. However, the poor sample efficiency of RL and the challenges of reward design make training slow and sometimes unstable, especially for high-dimensional systems. In this work, we leverage the inherent Lie group symmetries of robotic systems with a floating base to mitigate these challenges when learning tracking controllers. We model a general tracking problem as a Markov decision process (MDP) that captures the evolution of both the physical and reference states. Next, we prove that symmetry in the underlying dynamics and running costs leads to an MDP homomorphism, a mapping that allows a policy trained on a lower-dimensional “quotient” MDP to be lifted to an optimal tracking controller for the original system. We compare this symmetry-informed approach to an unstructured baseline, using Proximal Policy Optimization (PPO) to learn tracking controllers for three systems: the Particle (a forced point mass), the Astrobee (a fully-actuated space robot), and the Quadrotor (an underactuated system). Results show that a symmetry-aware approach both accelerates training and reduces tracking error at convergence.

Abstract:
Adversarial attacks exploit vulnerabilities in a model's decision boundaries through small, carefully crafted perturbations that lead to significant mispredictions. In 3D vision, the high dimensionality and sparsity of data greatly expand the attack surface, making 3D vision particularly vulnerable for safety-critical robotics. To enhance 3D vision's adversarial robustness, we propose a training objective that simultaneously minimizes prediction loss and mutual information (MI) under adversarial perturbations to contain the upper bound of misprediction errors. This approach simplifies handling adversarial examples compared to conventional methods, which require explicit searching and training on adversarial samples. However, minimizing prediction loss conflicts with minimizing MI, leading to reduced robustness and catastrophic forgetting. To address this, we integrate curriculum advisors in the training setup that gradually introduce adversarial objectives to balance training and prevent models from being overwhelmed by difficult cases early in the process. The advisors also enhance robustness by encouraging training on diverse MI examples through entropy regularizers. We evaluated our method on ModelNet40 and KITTI using PointNet, DGCNN, SECOND, and PointTransformers, achieving 2-5% accuracy gains on ModelNet40 and a 5-10% mAP improvement in object detection. Our code is publicly available at https://github.com/nstrndrbi/Mine-N-Learn.

Abstract:
We present a robotic table tennis platform that achieves a variety of hit styles and ball-spins with high precision, power, and consistency. This is enabled by a custom lightweight, high-torque, low rotor inertia, five degree-of-freedom arm capable of high acceleration. To generate swing trajectories, we formulate an optimal control problem (OCP) that constrains the state of the paddle at the time of the strike. The terminal position is given by a predicted ball trajectory, and the terminal orientation and velocity of the paddle are chosen to match various possible styles of hits: loops (topspin), drives (flat), and chops (backspin). Finally, we construct a fixed-horizon model predictive controller (MPC) around this OCP to allow the hardware to quickly react to changes in the predicted ball trajectory. We validate on hardware that the system is capable of hitting balls with an average exit velocity of 11 \mathrmm / \mathrms at an 88% success rate across the three swing types.

Abstract:
Generative AI systems have shown impressive capabilities in creating text, code, and images. Inspired by the importance of research in industrial Design for Assembly, we introduce a novel problem: Generative Design-for-RobotAssembly (GDfRA). The task is to generate an assembly based on a natural language prompt (e.g., “giraffe”) and an image of available physical components, such as 3D-printed blocks. The output is an assembly, a spatial arrangement of these components, accompanied by instructions for a robot to build it. The output geometry must 1) resemble the requested object and 2) be reliably assembled by a 6 DoF robot arm with a suction gripper. We then present Blox-Net, a GDfRA system that combines generative vision language models with well-established methods in computer vision, simulation, perturbation analysis, motion planning, and physical robot experimentation to solve a class of GDfRA problems without human supervision. Blox-Net achieved a Top-1 accuracy of \mathbf6 3. 5 % in the semantic accuracy of its designed assemblies. Six designs, after Blox-Net's automated pertubation redesign, were reliably assembled by a robot, achieving near-perfect success across \mathbf1 0 consecutive assembly iterations with human intervention only during reset prior to assembly. The entire pipeline from the textual word to reliable physical assembly is performed without human intervention. Project Page: https://bloxnet.org/

Abstract:
Recently, camera-radar fusion-based 3D object detection methods in bird's eye view (BEV) have gained attention due to the complementary characteristics and cost-effectiveness of these sensors. Previous approaches using forward projection struggle with sparse BEV feature generation, while those employing backward projection overlook depth ambiguity, leading to false positives. In this paper, to address the aforementioned limitations, we propose a novel camera-radar fusion-based 3D object detection and segmentation model named CRAB (Camera-Radar fusion for reducing depth Ambiguity in Backward projection-based view transformation), using a backward projection that leverages radar to mitigate depth ambiguity. During the view transformation, CRAB aggregates perspective view image context features into BEV queries. It improves depth distinction among queries along the same ray by combining the dense but unreliable depth distribution from images with the sparse yet precise depth information from radar occupancy. We further introduce spatial cross-attention with a feature map containing radar context information to enhance the comprehension of the 3D scene. When evaluated on the nuScenes open dataset, our proposed approach achieves a state-of-the-art performance among backward projection-based camera-radar fusion methods with 62.4% NDS and 54.0% mAP in 3D object detection.

Abstract:
Recently, camera-based Bird's-Eye View (BEV) representation has gained significant traction in 3D object detection. However, training high-performance BEV 3D detectors typically requires a large number of annotated samples, which can be costly. Traditional semi-supervised methods for BEV 3D object detection face challenges including loss of rich depth information, inconsistent object representations across spaces, and unreliable pseudo label generation, leading to decreased accuracy and performance. Addressing this challenge, we pioneer the introduction of a semi-supervised BEV 3D object detection framework. Our approach leverages a small set of labeled data alongside a larger set of unlabeled data, significantly reducing annotation costs while maintaining robust detection performance. Firstly, we propose a depth-based self-refinement module to generate high-quality and stable pseudo labels, which can effectively regulate training with noisy labels. Secondly, we designed a denoising labels regression module that integrates denoising for both labeled and unlabeled data. Thirdly, in order to alleviate object inconsistency, we propose a consistent object-guided alignment method to ensure the consistency of objects in multi-spaces. Finally, our method can be easily plugged into various BEV 3D detection networks. Extensive experiments show that the proposed method achieves a new state-of-the-art compared to various camera-based 3D detectors tested on multiple public autonomous driving datasets.

Abstract:
Ensuring the sterility of medical equipment, particularly endoscopes used in environments teeming with diverse pathogens and drug-resistant bacteria, is crucial for safe medical procedures. However, the complexity of endoscope reprocessing, which involves numerous dexterous manual manipulations, poses significant challenges. Achieving certification for sterilization requires precise, repetitive execution with strict tolerances. In this study, we propose a framework that automates the handling and storage of endoscopes right after the sterilization process and employs compliant collaborative robots to address these dexterous manipulation challenges. In the first stage, we identified the key manipulation skills involved in the process through observations and feedback from medical personnel. In the second stage, we proposed a system that employs a high-level action planner to orchestrate the removal and storage of endoscopes, integrating two collaborative robots and a linear unit. Through real-time force measurements, compliant control, task knowledge, and safety protocols, we establish a system that ensures the safety of both medical equipment and personnel in proximity. In our first experiment, we conducted 50 trials with a 100 % reliability rate. Each trial had an execution time of 102 seconds, with a variance of 1.2 seconds. In our second experiment, we performed 10 trials with a human obstructing the transfer path, facing away from the robot. In all cases, the system successfully and promptly detected the collision. This work pioneers the automation of medical reprocessing in sterile environments using tactile robots and addresses the associated challenges.

Abstract:
Legged locomotion is not just about mobility; it also encompasses crucial objectives such as energy efficiency, safety, and user experience, which are vital for real-world applications. However, key factors such as battery power consumption and stepping noise are often inaccurately modeled or missing in common simulators, leaving these aspects poorly optimized or unaddressed by current sim-to-real methods. Hand-designed proxies, such as mechanical power and foot contact forces, have been used to address these challenges but are often problem-specific and inaccurate. In this paper, we propose a data-driven framework for fine-tuning locomotion policies, targeting these hard-to-simulate objectives. Our framework leverages real-world data to model these objectives and incorporates the learned model into simulation for policy improvement. We demonstrate the effectiveness of our framework on power saving for quadruped locomotion, achieving a significant 24–28% net reduction in total power consumption from the battery pack at various speeds. In essence, our approach offers a versatile solution for optimizing hard-to-simulate objectives in quadruped locomotion, providing an easy-to-adapt paradigm for continual improving with real-world knowledge. Project page https://hard-to-sim.github.io/.

Abstract:
Multi-view image-based BEV (Bird's Eye View) 3D perception is gaining attention as an alternative to highcost LiDAR systems and has achieved notable success. However, there is a significant safety concern for future image-based BEV autonomous driving in low-light conditions (such as nighttime) while the limited research on BEV detectors for these scenes. In this paper, we attempt to enhance low-light BEV perception with illumination-guided feature fusion. We propose Retinex-BEVFormer, which uses illumination information generated by the Retinex theory to enhance the model's robustness to varying lighting conditions and improve detection performance in low-light scenes. Additionally, to address the illumination estimation discontinuity from multi-view images that can adversely affect detection, we propose the MVB-Retinex module, which balances illumination estimation by leveraging overlapping regions between adjacent images. Notably, our proposed method is a plug-and-play module that can be applied to any image-based BEV detector method and does not require any additional ground truth supervision. We conduct extensive experiments on the nuScenes dataset, validating our algorithm in nighttime and daytime scenes. Compared to the baseline, our algorithm achieves a 2.9% increase in mAP on the validation set with minimal computational cost, especially showing a 3.6% improvement in the nighttime scene. The experiments demonstrate that our Retinex-BEVFormer effectively improves detection performance under low light conditions and enhances performance under normal illumination, indicating increased robustness of the BEV detector.

Abstract:
Neural Signed Distance Fields (SDFs) provide a differentiable environment representation to readily obtain collision checks and well-defined gradients for robot navigation tasks. However, updating neural SDFs as the scene evolves entails re-training, which is tedious, time consuming, and inefficient, making it unsuitable for robot navigation with limited field-of-view in dynamic environments. Towards this objective, we propose a compositional framework of neural SDFs to solve robot navigation in indoor environments using only an onboard RGB-D sensor. Our framework embodies a dual mode procedure for trajectory optimization, with different modes using complementary methods of modeling collision costs and collision avoidance gradients. The primary stage queries the robot body's SDF, swept along the route to goal, at the obstacle point cloud, enabling swift local optimization of trajectories. The secondary stage infers the visible scene's SDF by aligning and composing the SDF representations of its constituents, providing better informed costs and gradients for trajectory optimization. The dual mode procedure combines the best of both stages, achieving a success rate of 98%, 14.4% higher than baseline with comparable amortized plan time on iGibson 2.0. We also demonstrate its effectiveness in adapting to realworld indoor scenarios. The video demonstrations and code are available at the https://stalhabukhari.github.io/icra25-sdf-dyn-nav.

Abstract:
In robotic task planning, symbolic planners using rule-based representations like PDDL are effective but struggle with long-sequential tasks in complicated environments due to exponentially increasing search space. Meanwhile, LLM-based approaches, which are grounded in artificial neural networks, offer faster inference and commonsense reasoning but suffer from lower success rates. To address the limitations of the current symbolic (slow speed) or LLM-based approaches (low accuracy), we propose a novel neuro-symbolic task planner that decomposes complex tasks into subgoals using LLM and carries out task planning for each subgoal using either symbolic or MCTS-based LLM planners, depending on the subgoal complexity. This decomposition reduces planning time and improves success rates by narrowing the search space and enabling LLMs to focus on more manageable tasks. Our method significantly reduces planning time while maintaining high success rates across three task planning domains, as well as real-world and simulated robotics environments. More details are available at http://graphics.ewha.ac.kr/LLMTAMP/.

Abstract:
This paper addresses a significant challenge in achieving collaborative tasks; how can a robot or multiple robots, endowed with a library of pre-learned primitive movements, generate multiple simultaneous coordinated robotic movements, adapting and optimizing those in the library, to complete one collaborative task? This work can thus be seen as a follow-up to the work with a motion presented as dynamic movement primitive (DMP) that now considers collaborative tasks and the existence of multiple robots/manipulators. Specifically, we start with a simple task using one DMP and extend it to accommodate the coordinated execution of multiple DMPs in robots with multiple manipulators or-alternatively-multiple robots with a single manipulator. We investigate mechanisms to jointly optimize multiple DMPs to perform one task in a coordinated fashion. The joint trajectory is built from initial DMPs learned for a single manipulator, and its optimization must comply with task-specific constraints. We illustrate the application of our approach both in a simulated environment and in a simulated and real Baxter robot.

Abstract:
In limbed robotics, end-effectors must serve dual functions, such as both feet for locomotion and grippers for grasping, which presents design challenges. This paper introduces a multi-modal end-effector capable of transitioning between flat and line foot configurations while providing grasping capabilities. MAGPIE integrates eight-axis force sensing using proposed mechanisms with Hall effect sensors, enabling both contact and tactile force measurements. We present a computational design framework for our sensing mechanism that accounts for noise and interference, allowing for desired sensitivity and force ranges and generating ideal inverse models. The hardware implementation of MAGPIE is validated through experiments, demonstrating its capability as a foot and verifying the performance of the sensing mechanisms, ideal models, and gated network-based models.

Abstract:
Integrating sensors into soft links with complex geometries without compromising their flexibility, precision, or structural integrity remains one of the main challenges in soft robotics. This article presents the design, fabrication, and electromechanical evaluation of a 3D-printed flexible strain sensor tailored for monitoring and controlling these links. By combining Fused Filament Fabrication (FFF) and Direct Ink Writing (DIW) technologies, we manufactured a sensor composed of a thermoplastic polyurethane (TPU) substrate and a pattern of silver (Ag) nanoparticles ink, ensuring high flexibility and conductivity. We performed electromechanical tests to assess the sensor's performance, including three-point bending tests, cyclic loading to evaluate its durability, and angular deflection measurements to confirm its precision in detecting bending angles. The sensor demonstrated efficient piezoresistive behavior within a defined working range between 3% and 8% of flexure strain with a Gauge Factor (GF) of 0.24 and stable repeatability. We also tested its integration into a soft link, showing that the sensor maintains flexibility and accuracy during deformation.

Abstract:
The development of autonomous driving has boosted the research on autonomous racing. However, existing local trajectory planning methods have difficulty planning trajectories with optimal velocity profiles at racetracks with sharp corners, thus weakening the performance of autonomous racing. To address this problem, we propose a local trajectory planning method that integrates Velocity Prediction based on Model Predictive Contouring Control (VPMPCC). The optimal parameters of VPMPCC are learned through Bayesian Optimization (BO) based on a proposed novel Objective Function adapted to Racing (OFR). Specifically, VPMPCC achieves velocity prediction by encoding the racetrack as a reference velocity profile and incorporating it into the optimization problem. This method optimizes the velocity profile of local trajectories, especially at corners with significant curvature. The proposed OFR balances racing performance with vehicle safety, ensuring safe and efficient BO training. In the simulation, the number of training iterations for OFR-based BO is reduced by 42.86 % compared to the state-of-the-art method. The optimal simulation-trained parameters are then applied to a real-world F1TENTH vehicle without retraining. During prolonged racing on a custom-built racetrack featuring significant sharp corners, the mean projected velocity of VPMPCC reaches \mathbf9 3. 1 8 % of the vehicle's handling limits. The released code is available at https://github.com/zhouhengli/VPMPCC.

Abstract:
Recent innovations in autonomous drones have facilitated time-optimal flight in single-drone configurations, and enhanced maneuverability in multi-drone systems by applying optimal control and learning-based methods. However, few studies have achieved time-optimal motion planning for multi-drone systems, particularly during highly agile maneuvers or in dynamic scenarios. This paper presents a decentralized policy network using multi-agent reinforcement learning for time-optimal multi-drone flight. To strike a balance between flight efficiency and collision avoidance, we introduce a soft collision-free mechanism inspired by optimization-based methods. By customizing PPO in a centralized training, decentralized execution (CTDE) fashion, we unlock higher efficiency and stability in training while ensuring lightweight implementation. Extensive simulations show that, despite slight performance tradeoffs compared to single-drone systems, our multi-drone approach maintains near-time-optimal performance with a low collision rate. Real-world experiments validate our method, with two quadrotors using the same network as in simulation achieving a maximum speed of 13.65 m/s and a maximum body rate of 13.4 rad/s in a 5.5 m × 5.5 m × 2.0 m space across various tracks, relying entirely on onboard computation [video33https://youtu.be/KACuFMtGGpo][code44https://github.com/KafuuChikai/Dashing-for-the-Golden-Snitch-Multi-Drone-RL].

Abstract:
This work demonstrates universal dynamic perching capabilities for quadrotors of various sizes and on surfaces with different orientations. By employing a non-dimensionalization framework and deep reinforcement learning, we systematically assessed how robot size and surface orientation affect landing capabilities. We hypothesized that maintaining geometric proportions across different robot scales ensures consistent perching behavior, which was validated in both simulation and experimental tests. Additionally, we investigated the effects of joint stiffness and damping in the landing gear on perching behaviors and performance. While joint stiffness had minimal impact, joint damping ratios influenced landing success under vertical approaching conditions. The study also identified a critical velocity threshold necessary for successful perching, determined by the robot's maneuverability and leg geometry. Overall, this research advances robotic perching capabilities, offering insights into the role of mechanical design and scaling effects, and lays the groundwork for future drone autonomy and operational efficiency in unstructured environments.

Abstract:
Safe autonomous navigation in a priori unknown environments is an essential skill for mobile robots to reliably and adaptively perform diverse tasks (e.g., delivery, inspection, and interaction) in unstructured cluttered environments. Hybrid metric-topological maps, constructed as a pose graph of local submaps, offer a computationally efficient world representation for adaptive mapping, planning, and control at the regional level. In this paper, we consider a pose graph of locally sensed star-convex scan regions as a metric-topological map, with star convexity enabling simple yet effective local navigation strategies. We design a new family of safe local scan navigation policies and present a perception-driven feedback motion planning method through the sequential composition of local scan navigation policies, enabling provably correct and safe robot navigation over the union of local scan regions. We introduce a new concept of bridging and frontier scans for automated key scan selection and exploration for integrated mapping and navigation in unknown environments. We demonstrate the effectiveness of our key-scan-based navigation and mapping framework using a mobile robot equipped with a 360° laser range scanner in 2D cluttered environments through numerical ROS-Gazebo simulations and real hardware experiments.

Abstract:
Robot swarms have the potential to help groups of people with social tasks, given their ability to scale to large numbers of robots and users. Developing multi-human-swarm interaction is therefore crucial to support multiple people interacting with the swarm simultaneously - which is an area that is scarcely researched, unlike single-human, single-robot or single-human, multi-robot interaction. Moreover, most robots are still confined to laboratory settings. In this paper, we present our work with MOSAIX, a swarm of robot Tiles, that facilitated ideation at a science museum. 63 robots were used as a swarm of smart sticky notes, collecting input from the public and aggregating it based on themes, providing an evolving visualization tool that engaged visitors and fostered their participation. Our contribution lies in creating a large-scale (63 robots and 294 attendees) public event, with a completely decentralized swarm system in real-life settings. We also discuss learnings we obtained that might help future researchers create multi-human-swarm interaction with the public.

Abstract:
Hand-eye calibration is an important task in vision-guided robotic systems and is crucial for determining the transformation matrix between the camera coordinate system and the robot end-effector. Existing methods, for multi-view robotic systems, usually rely on accurate geometric models or manual assistance, generalize poorly, and can be very complicated and inefficient. Therefore, in this study, we propose PlaneHEC, a generalized hand-eye calibration method that does not require complex models and can be accomplished using only depth cameras, which achieves the optimal and fastest calibration results using arbitrary planar surfaces like walls and tables. PlaneHEC introduces hand-eye calibration equations based on planar constraints, which makes it strongly interpretable and generalizable. PlaneHEC also uses a comprehensive solution that starts with a closed-form solution and improves it with iterative optimization, which greatly improves accuracy. We comprehensively evaluated the performance of PlaneHEC in both simulated and real-world environments and compared the results with other point-cloud-based calibration methods, proving its superiority. Our approach achieves universal and fast calibration with an innovative design of computational models, providing a strong contribution to the development of multi-agent systems and embodied intelligence.

Abstract:
Supernumerary Robotic Limbs (SRLs) can assist human motions by providing extra degrees of freedom (DoFs) and body support. The extra DoFs lead to larger design space in structure and control policies, which is complex and time-consuming with the traditional manual design process. In this pilot study, we proposed a novel morphology-controller co-optimization framework to automatically generate and optimize the SRL structure based on the locomotion task input. There are two layers, with the inner layer optimizing the controller to achieve human-robot synchronization, and the outer layer optimizing the morphology parameters for performance enhancement. We validated the proposed framework through simulations using SRLs in a load-bearing locomotion task. The results demonstrate that the controller optimization can automatically generate realistic gait patterns and stable human-robot synchronization, while the SRLs significantly improve the user's load-bearing capability. Additionally, the co-optimization process reduces both the manufacturing cost of the SRL and the torque on the joints. This approach shows potential for exhaustive exploration of the design space and acceleration of the design process. Future works will be done in a more realistic SRL generative design model and achieve Sim2Real for practical uses.

Abstract:
Our team developed a riding ballbot (called PURE) that is dynamically stable, omnidirectional, and driven by lean-to-steer control. A hands-free admittance control scheme (HACS) was previously integrated to allow riders with different torso functions to control the robot's movements via torso leaning and twisting. Such an interface requires motor coordination skills and could result in collisions with obstacles due to low proficiency. Hence, a shared controller (SC) that limits the speed of PURE could be helpful to ensure the safety of riders. However, the self-balancing dynamics of PURE could result in a weak control authority of its motion, in which the torso motion of the rider could easily disturb the tracking of the command speed dictated by the shared controller. Thus, we proposed an interactive hands-free admittance control scheme (iHACS), with the following features to improve the speed-tracking performance of PURE: control gain personalization module and interaction compensation module. Human riding tests of simple tasks, idle-keeping and speed-limiting, were conducted to compare the performance of HACS and iHACS. Two manual wheelchair users and two able-bodied individuals participated in this study. They were instructed to use “adver-sarial” torso motions that would tax the PURE's ability to keep the ballbot idling or below a set speed, i.e., competing objectives between rider and robot. In the idle-keeping tasks, iHACS demonstrated minimal translational motion and low command speed tracking RMSE, even with significant torso lean angles. During the speed-limiting task, where the commanded speed was saturated at 0.5 m/s, the system achieved an average maximum speed of 1.1 m/s with iHACS, compared with that of over 1.9 m/s with HACS. These results suggest that iHACS can enhance PURE's control authority over the rider, which enables PURE to provide physical interactions back to the rider and results in a collaborative rider-robot synergy.

Abstract:
To address the challenges associated with shape sensing of continuum manipulators (CMs) using Fiber Bragg Grating (FBG) optical fibers, we present a unique shape sensing assembly utilizing solely a single Optical Frequency Domain Reflectometry (OFDR) fiber attached to a flat nitinol wire (NiTi). Integrating this easy-to-manufacture unique sensor with a long and soft CM with 170 mm length, we performed different experiments to evaluate its C -, J -, and S-shape reconstruction ability. Results demonstrate phenomenal shape reconstruction accuracy for the performed C-shape (<3.14 ~\textmm tip error, < 2.54 mm shape error), J-shape (<1.91 ~\textmm tip error, <1.11 \mathbfm m shape error), and S-shape (<\mathbf1. 7 4 ~ m m tip error, <\mathbf1. 4 0 ~ m m shape error) experiments.

Abstract:
Partial hand amputations significantly affect the physical and psychosocial well-being of individuals, yet intuitive control of externally powered prostheses remains an open challenge. To address this gap, we developed a force-controlled prosthetic finger activated by electromyography (EMG) signals. The prototype, constructed around a wrist brace, functions as a supernumerary finger placed near the index, allowing for early-stage evaluation on unimpaired subjects. A neural network-based model was then implemented to estimate fingertip forces from EMG inputs, allowing for online adjustment of the prosthetic finger grip strength. The force estimation model was validated through experiments with ten participants, demonstrating its effectiveness in predicting forces. Additionally, online trials with four users wearing the prosthesis exhibited precise control over the device. Our findings highlight the potential of using EMG-based force estimation to enhance the functionality of prosthetic fingers.

Abstract:
Sit-to-stand and stand-to-sit motions are important in daily activities. However, elderly individuals often find these motions difficult to perform with declining lower limb strength, which causes a considerable reduction to their quality of life. In this study, a sensing method for controlling robotic assistive devices was proposed. This method utilizes electromyographic measurements and a deep neural network to predict motion initiation, and it estimates the timing of triggering assistive devices. Experimental results indicate that four muscle synergy patterns are required to represent the sit-to-stand and stand-to-sit motions together, with two of them being shared between both movements. Subsequently, a long short-term memory network was designed to forecast these two motions, and the result indicates that the prediction accuracy reached 92.95% ± 0.83% with forecasting time of 300 ms.

Abstract:
Many countries are developing an Urban Air Mobility (UAM) capability defining an Uncrewed Aircraft Systems (UAS) Traffic Management (UTM) architecture to allow safe UAS services in urban environments (e.g., delivery, inspection, air taxis, etc.). The main considerations are air worthiness, operator certification, air traffic management, C2 Link, detect and avoid (DAA), safety management, and security. In addition, if thousands of simultaneous UAS flights are to be achieved, it is not possible for them to be controlled individually by human operators. This makes it necessary to have a rigorous and safe automation methodology to handle such a number of flights. A lane-based airspace structure has been proposed which reduces the complexity of strategic deconfliction by providing UAS agents with a set of pre-defined airway corridors called lanes [1], [2]. This yields collateral benefits including UAS information privacy, robust contingency handling exploiting the lane structure, as well as improved observability and control of the air space. A robust set of UTM parameters and policies must be determined based on the performance characteristics of the deployed UAS platforms, and a methodology which constitutes a first step toward this end is proposed and demonstrated here. In order to realize this approach, a set of initial experiments have been performed to determine the constraints imposed by the UTM on UAS platform capabilities and vice versa. Initial implementation parameters and policies are defined. The major contribution here is a methodology to calibrate UTM safety parameters (e.g., headway, platform speed) in terms of specific platform models' operational characteristics. That is, UTM parameters are a function of platform and not some arbitrarily imposed values. Safety uncertainty is then characterized by the calibration method.

Abstract:
Large Language Models (LLMs) have demonstrated remarkable abilities in various language tasks, making them promising candidates for decision-making in robotics. Inspired by Hierarchical Reinforcement Learning (HRL), we propose Retrieval-Augmented Hierarchical in-context reinforcement Learning (RAHL), a novel framework that decomposes complex tasks into sub-tasks using an LLM-based high-level policy, in which a complex task is decomposed into sub-tasks by a high-level policy on-the-fly. The sub-tasks, defined by goals, are assigned to the low-level policy to complete. To improve the agent's performance in multi-episode execution, we propose Hindsight Modular Reflection (HMR), where, instead of reflecting on the full trajectory, we let the agent reflect on shorter sub-trajectories using intermediate goals to improve reflection efficiency. We evaluated the decision-making ability of the proposed RAHL in three benchmark environments, ALFWorld, Webshop, and HotpotQA, where the results show that RAHL can achieve an improvement in the performance of, respectively, 9%, 42%, and 10% in 5 execution episodes compared to state-of-the-art baselines. We also implemented RAHL on the Boston Dynamics SPOT robot, which is shown to effectively scan the environment, find entrances, and navigate to new rooms controlled by the LLM policy.

Abstract:
Robot-assisted bite acquisition involves picking up food items with varying shapes, compliance, sizes, and textures. Fully autonomous strategies may not generalize efficiently across this diversity. We propose leveraging feedback from the care recipient when encountering novel food items. However, frequent queries impose a workload on the user. We formulate human-in-the-loop bite acquisition within a contextual bandit framework and introduce LINUCB-QG, a method that selectively asks for help using a predictive model of querying work-load based on query types and timings. This model is trained on data collected in an online study involving 14 participants with mobility limitations, 3 occupational therapists simulating physical limitations, and 89 participants without limitations. We demonstrate that our method better balances task performance and querying workload compared to autonomous and always-querying baselines and adjusts its querying behavior to account for higher workload in users with mobility limitations. We validate this through experiments in a simulated food dataset and a user study with 19 participants, including one with severe mobility limitations. Please check out our project website at: emprise.cs.comell.edu/hilbiteacquisition/.

Abstract:
Optimal path planning requires finding a series of feasible states from the starting point to the goal to optimize objectives. Popular path planning algorithms, such as Effort Informed Trees (EIT), employ effort heuristics to guide the search. Effective heuristics are accurate and computationally efficient, but achieving both can be challenging due to their conflicting nature. This paper proposes Direction Informed Trees (DIT), a sampling-based planner that focuses on optimizing the search direction for each edge, resulting in goal bias during exploration. We define edges as generalized vectors and integrate similarity indexes to establish a directional filter that selects the nearest neighbors and estimates direction costs. The estimated direction cost heuristics are utilized in edge evaluation. This strategy allows the exploration to share directional information efficiently. DIT convergence faster than existing single-query, sampling-based planners on tested problems in \mathbbR^4 to \mathbbR^16 and has been demonstrated in real-world environments with various planning tasks. A video showcasing our experimental results is available at: https://youtu.be/2SX6QT2NOek.

Abstract:
Recovering the metric 3D shape from a single image is particularly relevant for robotics and embodied in-telligence applications, where accurate spatial understanding is crucial for navigation and interaction with environments. Usu-ally, the mainstream approaches achieve it through monocular depth estimation. However, without camera intrinsics, the 3D metric shape can not be recovered from depth alone. In this study, we theoretically demonstrate that depth serves as a 3D prior constraint for estimating camera intrinsics and uncover the reciprocal relations between these two elements. Motivated by this, we propose a collaborative learning framework for jointly estimating depth and camera intrinsics, named CoL3D, to learn metric 3D shapes from single images. Specifically, CoL3D adopts a unified network and performs collaborative optimization at three levels: depth, camera intrinsics, and 3D point clouds. For camera intrinsics, we design a canonical incidence field mechanism as a prior that enables the model to learn the residual incident field for enhanced calibration. Additionally, we incorporate a shape similarity measurement loss in the point cloud space, which improves the quality of 3D shapes essential for robotic applications. As a result, when training and testing on a single dataset with in-domain settings, CoL3D delivers outstanding performance in both depth estimation and camera calibration across several indoor and outdoor benchmark datasets, which leads to remarkable 3D shape quality for the perception capabilities of robots.

Abstract:
Dog guides offer an effective mobility solution for blind or visually impaired (BVI) individuals, but conventional dog guides have limitations including the need for care, potential distractions, societal prejudice, high costs, and limited availability. To address these challenges, we seek to develop a robot dog guide capable of performing the tasks of a conventional dog guide, enhanced with additional features. In this work, we focus on design research to identify functional and aesthetic design concepts to implement into a quadrupedal robot. The aesthetic design remains relevant even for BVI users due to their sensitivity toward societal perceptions and the need for smooth integration into society. We collected data through interviews and surveys to answer specific design questions pertaining to the appearance, texture, features, and method of controlling and communicating with the robot. Our study identified essential and preferred features for a future robot dog guide, which are supported by relevant statistics aligning with each suggestion. These findings will inform the future development of user-centered designs to effectively meet the needs of BVI individuals.

Abstract:
Recent advancements in Ultra-Wideband (UWB) radar technology have enabled contactless, non-line-of-sight vital sign monitoring, making it a valuable tool for healthcare. However, UWB radar's ability to capture sensitive physiological data, even through walls, raises significant privacy concerns, particularly in human-robot interactions and autonomous systems that rely on radar for sensing human presence and physiological functions. In this paper, we present Anti-Sensing, a novel defense mechanism designed to prevent unauthorized radarbased sensing. Our approach introduces physically realizable perturbations, such as oscillatory motion from wearable devices, to disrupt radar sensing by mimicking natural cardiac motion, thereby misleading heart rate (HR) estimations. We develop a gradient-based algorithm to optimize the frequency and spatial amplitude of these oscillations for maximal disruption while ensuring physiological plausibility. Through both simulations and real-world experiments with radar data and neural networkbased HR sensing models, we demonstrate the effectiveness of Anti-Sensing in significantly degrading model accuracy, offering a practical solution for privacy preservation.

Abstract:
This study presents an Actor-Critic Cooperative Compensated Model Predictive Controller (\textAC^3 \textMPC) designed to address unknown system dynamics. To avoid the difficulty of modeling highly complex dynamics and ensuring real-time control feasibility and performance, this work uses deep reinforcement learning with a model predictive controller in a cooperative framework to handle unknown dynamics. The model-based controller takes on the primary role as both controllers are provided with predictive information about the other. This improves tracking performance and retention of inherent robustness of the model predictive controller. We evaluate this framework for off-road autonomous driving on unknown deformable terrains that represent sandy deformable soil, sandy and rocky soil, and cohesive clay-like deformable soil. Our findings demonstrate that our controller statistically outperforms standalone model-based and learning-based controllers by upto 29.2% and 10.2%. This framework generalized well over varied and previously unseen terrain characteristics to track longitudinal reference speeds with lower errors. Furthermore, this required significantly less training data compared to purely learning-based controller, while delivering better performance even when under-trained.

Abstract:
Navigating large-scale outdoor environments requires complex reasoning in terms of geometric structures, environmental semantics, and terrain characteristics, which are typically captured by onboard sensors such as LiDAR and cameras. While current mobile robots can navigate such environments using pre-defined, high-precision maps based on hand-crafted rules catered for the specific environment, they lack commonsense reasoning capabilities, especially the traversability analysis, that most humans possess when navigating unknown outdoor spaces. To address this gap, we introduce the Global Navigation Dataset (GND), a large-scale dataset that integrates multi-modal sensory data, including 3D LiDAR point clouds and RGB and 360° images, as well as multi-category traversability maps (pedestrian walkways, vehicle roadways, stairs, off-road terrain, and obstacles) from ten university campuses. These environments encompass a variety of parks, urban settings, elevation changes, and campus layouts of different scales. The dataset covers approximately 2.7 ~\textkm^2 and includes at least 350 buildings in total. We also present a set of novel applications of GND to showcase its utility to enable global robot navigation, such as map-based global navigation, mapless navigation, and global place recognition. GND's website can be found at https://cs.gmu.edu/xiao/Research/GND/.

Abstract:
Despite significant advancements in large language models (LLMs) that enhance robot agents' understanding and execution of natural language (NL) commands, ensuring the agents adhere to user-specified constraints remains challenging, particularly for complex commands and long-horizon tasks. To address this challenge, we present three key insights, equivalence voting, constrained decoding, and domain-specific fine-tuning, which significantly enhance LLM planners' capability in handling complex tasks. Equivalence voting ensures consistency by generating and sampling multiple Linear Temporal Logic (LTL) formulas from NL commands, grouping equivalent LTL formulas, and selecting the majority group of formulas as the final LTL formula. Constrained decoding then uses the generated LTL formula to enforce the autoregressive inference of plans, ensuring the generated plans conform to the LTL. Domain-specific fine-tuning customizes LLMs to produce safe and efficient plans within specific task domains. Our approach, Safe Efficient LLM Planner (SELP), combines these insights to create LLM planners to generate plans adhering to user commands with high confidence. We demonstrate the effective-ness and generalizability of SELP across different robot agents and tasks, including drone navigation and robot manipulation. For drone navigation tasks, SELP outperforms state-of-the-art planners by 10.8% in safety rate (i.e., finishing tasks conforming to NL commands) and by 19.8% in plan efficiency. For robot manipulation tasks, SELP achieves 20.4% improvement in safety rate. Our datasets for evaluating NL-to-LTL and robot task planning will be released in github.com/lt-asset/selp.

Abstract:
We present closed-form expressions for marginalizing and conditioning Gaussians onto linear manifolds, and demonstrate how to apply these expressions to smooth non-linear manifolds through linearization. Although marginalization and conditioning onto axis-aligned manifolds are well-established procedures, doing so onto non-axis-aligned manifolds is not as well understood. We demonstrate the utility of our expressions through three applications: 1) approximation of the projected normal distribution, where the quality of our linearized approximation increases as problem non-linearity decreases; 2) covariance extraction in Koopman SLAM, where our covariances are shown to be consistent on a real-world dataset; and 3) covariance extraction in constrained GTSAM, where our covariances are shown to be consistent in simulation.

Abstract:
Several robotic frameworks have been recently developed to assist ophthalmic surgeons in performing complex vitreoretinal procedures such as subretinal injection of advanced therapeutics. These surgical robots show promising capabilities; however, most of them have to limit their working volume to achieve maximum accuracy. Moreover, the visible area seen through the surgical microscope is limited and solely depends on the eye posture. If the eye posture, trocar position, and robot configuration are not correctly arranged, the instrument may not reach the target position, and the preparation will have to be redone. Therefore, this paper proposes the optimization framework of the eye tilting and the robot positioning to reach various target areas for different patients. Our method was validated with an adjustable phantom eye model, and the error of this workflow was 0.13 ± 1.65 deg (rotational joint around Y axis), -1.40 ± 1.13 deg (around X axis), and 1.80 ± 1.51 mm (depth, Z). The potential error sources are also analyzed in the discussion section.

Abstract:
This paper addresses the optimization of human-robot collaborative work-cells before their physical deployment. Most of the times, such environments are designed based on the experience of the system integrators, often leading to sub-optimal solutions. Accurate simulators of the robotic cell, accounting for the presence of the human as well, are available today and can be used in the pre-deployment. We propose an iterative optimization scheme where a digital model of the work-cell is updated based on a genetic algorithm. The methodology focuses on the layout optimization and task allocation, encoding both the problems simultaneously in the design variables handled by the genetic algorithm, while the task scheduling problem depends on the result of the upper-level one. The final solution balances conflicting objectives in the fitness function and is validated to show the impact of the objectives with respect to a baseline, which represents possible initial choices selected based on the human judgment.

Abstract:
For pneumatic artificial muscles, it is always considered the more maximum contraction ratio the better. While for human joint assisting applications, PAMs with configurable maximum contraction rate are more suitable because of advantageous safety and adaptability. A PAM based on planar-to-specific-wave body shape morph is proposed in this work. Shape-morphing-based braided artificial muscles (SBAMs) have uniqueness of initial elasticity and maximum contraction ration programmability, which meet the favors of human joint assisting applications. The basic structure and working mechanism of contraction in SBAMs will be explained, and their mathematical model will also be established. According to the experimental results, a SBAM prototype generates a force more than 140 times its weight under an easily accessible pressure of 150 kPa. A mannequin wearing the SBAM enables actively flexes its elbow over 120 °.

Abstract:
Soft robots are known for their ability to perform tasks with great adaptability, enabled by their distributed, non-uniform stiffness and actuation. Bending is the most fundamental motion for soft robot design, but creating robust, and easy-to-fabricate soft bending joint with tunable properties remains an active problem of research. In this work, we demonstrate an inflatable actuation module for soft robots with a defined bending plane enabled by forced partial wrinkling. This lowers the structural stiffness in the bending direction, with the final stiffness easily designed by the ratio of wrinkled and unwrinkled regions. We show the stiffness properties of the actuation module through a first-principle model validated by experimental characterization, and demonstrate the module's ability to maintain the kinematic constraint over a large range of loading conditions. We illustrate how these properties give the potential for complex actuation in a soft continuum robot and for decoupling actuation force and efficiency from load capacity. The module provides a novel method for embedding intelligent actuation into soft pneumatic robots.

Affiliations: Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Hong Kong; Shenzhen International Graduate School, Tsinghua University, China; Department of Medicine and Therapeutics and Institute of Digestive Disease, The Chinese University of Hong Kong, Hong Kong; Department of Mechanical and Automation Engineering, T Stone Robotics Institute, Shun Hing Institute of Advanced Engineering, Multi-Scale Medical Robotics Center, and Institute of Medical Intelligence and XR, The Chinese University of Hong Kong, Hong Kong

Abstract:
Flexible endoscope motion tracking and analysis in mechanical simulators have proven useful for endoscopy training. Common motion tracking methods based on electromagnetic tracker are however limited by their high cost and material susceptibility. In this work, the motion-guided dual-camera vision tracker is proposed to provide robust and accurate tracking of the endoscope tip's 3D position. The tracker addresses several unique challenges of tracking flexible endoscope tip inside a dynamic, life-sized mechanical simulator. To address the appearance variation and keep dualcamera tracking consistency, the cross-camera mutual template strategy (CMT) is proposed by introducing dynamic transient mutual templates. To alleviate large occlusion and light-induced distortion, the Mamba-based motion-guided prediction head (MMH) is presented to aggregate historical motion with visual tracking. The proposed tracker achieves superior performance against state-of-the-art vision trackers, achieving 42% and 72% improvements against the second-best method in average error and maximum error. Further motion analysis involving novice and expert endoscopists also shows that the tip 3D motion provided by the proposed tracker enables more reliable motion analysis and more substantial differentiation between different expertise levels, compared with other trackers. Project page: https://github.com/PieceZhang/MotionDCTrack

Abstract:
The skill to drift a car-i.e., operate in a state of controlled oversteer like professional drivers-could give future autonomous cars maximum flexibility when they need to retain control in adverse conditions or avoid collisions. We investigate real-time drifting strategies that put the car where needed while bypassing expensive trajectory optimization. To this end, we design a reinforcement learning agent that builds on the concept of tire energy absorption to autonomously drift through changing and complex waypoint configurations while safely staying within track bounds. We achieve zero-shot deployment on the car by training the agent in a simulation environment built on top of a neural stochastic differential equation vehicle model learned from pre-collected driving data. Experiments on a Toyota GR Supra and Lexus LC 500 show that the agent is capable of drifting smoothly through varying waypoint configurations with tracking error as low as 10 cm while stably pushing the vehicles to sideslip angles of up to 63°.

Abstract:
An end-to-end (E2E) visuomotor policy is typically treated as a unified whole, but recent approaches using out-of-domain (OOD) data to pretrain the visual encoder have cleanly separated the visual encoder from the network, with the remainder referred to as the policy. We propose Visual Alignment Testing, an experimental framework designed to evaluate the validity of this functional separation. Our results indicate that in E2E-trained models, visual encoders actively contribute to decision-making resulting from motor data supervision, contradicting the assumed functional separation. In contrast, OOD-pretrained models, where encoders lack this capability, experience an average performance drop of 42% in our benchmark results, compared to the state-of-the-art performance achieved by E2E policies. We believe this initial exploration of visual encoders' role can provide a first step towards guiding future pretraining methods to address their decision-making ability, such as developing task-conditioned or context-aware encoders.

Abstract:
Cross-view geo-localization plays a critical role in Unmanned Aerial Vehicle (UAV) localization and navigation. However, significant challenges arise from the drastic viewpoint differences and appearance variations between images. Existing methods predominantly rely on semantic features from RGB images, often neglecting the importance of spatial structural information in capturing viewpoint-invariant features. To address this issue, we incorporate geometric structural information from normal images and introduce a Joint perception network to integrate RGB and Normal images (JRN-Geo). Our approach utilizes a dual-branch feature extraction framework, leveraging a Difference-Aware Fusion Module (DAFM) and Joint-Constrained Interaction Aggregation (JCIA) strategy to enable deep fusion and joint-constrained semantic and structural information representation. Furthermore, we propose a 3D geographic augmentation technique to generate potential viewpoint variation samples, enhancing the network's ability to learn viewpoint-invariant features. Extensive experiments on the University-1652 and SUES-200 datasets validate the robustness of our method against complex viewpoint variations, achieving state-of-the-art performance.

Abstract:
Robot assisted endoluminal intervention is an emerging technique for both benign and malignant luminal lesions. With vision-based navigation, when combined with pre-operative imaging data as priors, it is possible to recover position and pose of the endoscope without the need of additional sensors. In practice, however, aligning pre-operative and intra-operative domains is complicated by significant texture differences. Although methods such as style transfer can be used to address this issue, they require large datasets from both source and target domains with prolonged training times. This paper proposes an efficient domain transfer method based on stylized Gaussian splatting, only requiring a few of real images (10 images) with very fast training time. Specifically, the transfer process includes two phases. In the first phase, the 3D models reconstructed from CT scans are represented as differential Gaussian point clouds. In the second phase, only color appearance related parameters are optimized to transfer the style and preserve the visual content. A novel structure consistency loss is applied to latent features and depth levels to enhance the stability of the transferred images. Detailed validation was performed to demonstrate the performance advantages of the proposed method compared to that of the current state-of-the-art, highlighting the potential for intra-operative surgical navigation.

Abstract:
Autonomous Underwater Vehicles (AUVs) encounter significant challenges in confined spaces like ports and testing tanks, where vehicle-environment interactions, such as wave reflections and unsteady flows, introduce complex, time-varying disturbances. Model-based state estimation methods can struggle to handle these dynamics, leading to localization errors. To address this, we propose a data-driven velocity estimation approach using Inertial Measurement Units (IMUs) and a Gated Recurrent Unit (GRU) neural network, capturing temporal dependencies and rejecting external disturbances. This velocity estimator is then integrated into a sensor fusion framework using an asynchronous Kalman filter to improve localization by fusing on-board and off-board sensor information. Experimental validation on miniature AUVs demonstrates the effectiveness of the proposed method in enhancing accuracy for velocity and position estimation in environments with significant disturbances due to interactions between the vehicle and the environment.

Abstract:
This paper introduces a novel fan-out model that improves accuracy over previous models. The commonly used models rely on neglect time, the time an agent operates independently, which confounds both human and robot abilities. The proposed model separates neglect time into two functionally distinct concepts: the time a robot can operate self-sufficiently, and the time a human estimates the robot can do so. Previous research indicates fan-out is often overestimated. This work explains why robot ability provides an upper bound to fan-out, but that actual achieved fan-out is influenced by both the human and robot abilities. We conduct a study to validate this new model and show improved performance over the two most common fan-out models. The results show that both previous models overestimate as predicted. Using the new fan-out model, we show that as the difference between human estimation and robot abilities grows, the actual fan-out will fall further from the upper bound potential fan-out. By including assessments of both the robotic and human elements, the new model provides a more nuanced understanding of the dynamics at play and the factors involved in scaling Human Multi-Robot Teams.

Abstract:
Task allocation in multi-human multi-robot (MHMR) teams presents significant challenges due to the inherent heterogeneity of team members, the dynamics of task execution, and the information uncertainty of operational states. Existing approaches often fail to address these challenges simultaneously, resulting in suboptimal performance. To tackle this, we propose ATA-HRL, an adaptive task allocation framework using hierarchical reinforcement learning (HRL), which incorporates initial task allocation (ITA) that leverages team heterogeneity and conditional task reallocation in response to dynamic operational states. Additionally, we introduce an auxiliary state representation learning task to manage information uncertainty and enhance task execution. Through an extensive case study in large-scale environmental monitoring tasks, we demonstrate the benefits of our approach. More details are available on our website: https://sites.google.com/view/ata-hrl.

Abstract:
Multi-agent path planning is a critical challenge in robotics, requiring agents to navigate complex environments while avoiding collisions and optimizing travel efficiency. This work addresses the limitations of existing approaches by combining Gaussian belief propagation with path integration and introducing a novel tracking factor to ensure strict adherence to global paths. The proposed method is tested with two different global path-planning approaches: rapidly exploring random trees and a structured planner, which leverages predefined lane structures to improve coordination. A simulation environment was developed to validate the proposed method across diverse scenarios, each posing unique challenges in navigation and communication. Simulation results demonstrate that the tracking factor reduces path deviation by 28% in single-agent and 16% in multi-agent scenarios, highlighting its effectiveness in improving multi-agent coordination, especially when combined with structured global planning.

Abstract:
Exploring planetary bodies with lower gravity, such as the moon and Mars, allows legged robots to utilize jumping as an efficient form of locomotion thus giving them a valuable advantage over traditional rovers for exploration. Motivated by this fact, this paper presents the design, simulation, and learning-based “in-flight” attitude control of Olympus, a jumping legged robot tailored to the gravity of Mars. First, the design requirements are outlined followed by detailing how simulation enabled optimizing the robot's design - from its legs to the overall configuration - towards high vertical jumping, forward jumping distance, and in-flight attitude reorientation. Subsequently, the reinforcement learning policy used to track desired in-flight attitude maneuvers is presented. Successfully crossing the sim2real gap, extensive experimental studies of attitude reorientation tests are demonstrated.

Abstract:
Large Language Models (LLMs) excel at generating contextually relevant text but lack logical reasoning abilities. They rely on statistical patterns rather than logical inference, making them unreliable for structured decision-making. Integrating LLMs with task planning can address this limitation by combining their natural language understanding with the precise, goal-oriented reasoning of planners. This paper introduces ViPlan, a hybrid system that leverages Vision Language Models (VLMs) to extract high-level semantic information from visual and textual inputs while integrating classical planners for logical reasoning. ViPlan utilizes VLMs to generate syntactically correct and semantically meaningful PDDL problem files from images and natural language instructions, which are then processed by a task planner to generate an executable plan. The entire process is embedded within a behavior tree framework, enhancing efficiency, reactivity, replanning, modularity, and flexibility. The generation and planning capabilities of ViPlan are empirically evaluated with simulated and real-world experiments.

Abstract:
Curriculum learning is a training mechanism in reinforcement learning (RL) that facilitates the achievement of complex policies by progressively increasing the task difficulty during training. However, designing effective curricula for a specific task often requires extensive domain knowledge and human intervention, which limits its applicability across various domains. Our core idea is that large language models (LLMs), with their extensive training on diverse language data and ability to encapsulate world knowledge, present significant potential for efficiently breaking down tasks and decomposing skills across various robotics environments. Additionally, the demonstrated success of LLMs in translating natural language into executable code for RL agents strengthens their role in generating task curricula. In this work, we propose CurricuLLM, which leverages the high-level planning and programming capabilities of LLMs for curriculum design, thereby enhancing the efficient learning of complex target tasks. CurricuLLM consists of: (Step 1) Generating a sequence of subtasks that aid target task learning in natural language form, (Step 2) Translating natural language description of subtasks in executable task code, including the reward code and goal distribution code, and (Step 3) Evaluating trained policies based on trajectory rollout and subtask description. We evaluate Cur-ricuLLM in various robotics simulation environments, ranging from manipulation, navigation, and locomotion, to show that CurricuLLM can aid learning complex robot control tasks. In addition, we validate humanoid locomotion policy learned through CurricuLLM in the real-world. Project website is https://iconlab.negarmehr.com/CurricuLLM/

Abstract:
Most existing event-based camera pose relocalization (CPR) learning methods implicitly encode environmental information into network parameters to achieve end-to-end mapping from event stream to pose. However, these end-to-end CPR methods fail to utilize prior environmental information effectively. As the scale of the environment increases, the difficulty of this mapping relationship grows significantly, reducing the robustness of the end-to-end methods across different scenarios. To address the above issues, this paper proposes the first coarse-to-fine event-based CPR framework, which achieves a new paradigm from end-to-end pose regression network to a hierarchical approach. In the coarse localization stage, we effectively encode similarity features by incorporating the fine-grained temporal information, achieving accurate retrieval of nearby event stream. In the pose refinement stage, we present an Event Spatio-temporal Pose Refinement Network (ESPR-Net) based on the Recurrent Convolutional Neural Networks (RCNN) architecture, which is capable of learning more nu-anced spatio-temporal features to achieve accurate regression of the relative pose. Finally, we conducted a comprehensive comparison on the IJRR and M3ED dataset, achieving state-of-the-art (SOTA) performance on both. Notably, our method attains a significant 83 % performance improvement on the outdoor M3ED dataset.

Abstract:
Augmented Reality (AR) offers potential for enhancing human-robot collaboration by enabling intuitive interaction and real-time feedback. A crucial aspect of AR-robot integration is accurate spatial registration to align virtual content with the physical robotic workspace. This paper systematically investigates the effects of different tracking techniques and registration parameters on AR-to-robot registration accuracy, focusing on paired-point methods. We evaluate four marker detection algorithms - ARToolkit, Vuforia, ArUco, and retroreflective tracking - analyzing the influence of viewing distance, angle, marker size, point distance, distribution, and quantity. Our results show that ARToolkit provides the highest registration accuracy. While larger markers and positioning registration point centroids close to target locations consistently improved accuracy, other factors such as point distance and quantity were highly dependent on the tracking techniques used. Additionally, we propose an effective refinement method using point cloud registration, significantly improving accuracy by integrating data from points recorded between registration locations. These findings offer practical guidelines for enhancing AR-robot registration, with future work needed to assess the transferability to other AR devices and robots.

Abstract:
Direct teaching in collaborative manipulators, an essential method for intuitive trajectory control, faces significant challenges due to friction in robot joints. To address this, we present a novel friction compensation framework to improve direct teaching methods for robots. Our approach focuses on mitigating friction in the joints most susceptible to frictional effects, ensuring smoother and more precise motion. The proposed framework uses deep neural networks (DNN) to model the complex friction behavior. This approach circumvents the difficulties associated with traditional friction compensation model selection. We develop specific data input preprocessing algorithms that optimize friction estimation when paired with standard encoders commonly used in collaborative robots. In addition, our custom loss function is specifically designed to improve DNN training in these low-velocity regions. To evaluate the effectiveness of our framework, we conduct comprehensive ablation studies assessing the impact of two critical components: the preprocessing algorithms and the custom loss function. These studies provide insight into the contributions of each element to overall performance. Experimental validation using two 6-DoF collaborative robots demonstrates the practical applicability and effectiveness of our approach.

Abstract:
Reliable 6D pose estimation is crucial for robotic tasks but presents significant challenges in environments with occlusion. Recent approaches tend to directly predict pose parameters of object with deep neural networks, lacking the modeling ability of non-adjacent and complex relationships of surface points in occluded scenarios. To solve this problem, we propose a novel occlusion-aware 6D pose estimation framework, which uses depth-guided graph neural network (GNN) to model potential relationships from RGBD input. Two semantic information, which are mask and binary code of object, are adaptively fused to extract 2D-3D correspondence related features in an effective manner. Both enhanced graph features and fused semantic information contribute to the performance improvement of pose estimation with occlusion. Extensive experiments indicate that our approach outperforms comparative methods by 1.2% and 1.9% on LMO and YCBV datasets (up to 30% for certain objects) and its validity is also verified under real-world pose estimation test.

Abstract:
This study presents a comprehensive experimental analysis of Twisted String Actuators (TSA), focused on enhancing contraction modelling accuracy and establishing a baseline for TSA tension and impedance control efficacy. A novel TSA string radius function is introduced, computing effective radii for multi-strand bundles based on axial actuator tension. The proposed model was validated in physical experiments, resulting in a reduction of maximal errors between measured and simulated actuator contraction trajectories from up to 60 % in established models to around 10% in our work. Additionally, the tension-dependent radius modification effectively reduced errors between the estimated and the measured bundle tension by an order of magnitude, marking an essential step towards TSA control independent of bundle tension measurements. TSA tension control was assessed based on four metrics: accu-racy, precision, impact stability, and bandwidth, following ISO 9283:1998 standards. The quality of tension control was found to be dependent on bundle tension, twisting angle and strand quantity, whereas impact stability was maintained in all config-urations. Joint impedance control with TSA was evaluated for perturbation stability and position control bandwidth, where the latter was enhanced with increasing joint stiffness. The presented analysis informs designers about the capabilities of TSAs in different configurations, and their respective suitability for desired applications.

Abstract:
Fisheye cameras offer robots the ability to capture human movements across a wider field of view (FOV) than standard pinhole cameras, making them particularly useful for applications in human-robot interaction and automotive contexts. However, accurately detecting human poses in fisheye images is challenging due to the curved distortions inherent to fisheye optics. While various methods for undistorting fisheye images have been proposed, their effectiveness and limitations for poses that cover a wide FOV has not been systematically evaluated in the context of absolute human pose estimation from monocular fisheye images. To address this gap, we evaluate the impact of pinhole, equidistant and double sphere camera models, as well as cylindrical projection methods, on 3D human pose estimation accuracy. We find that in close-up scenarios, pinhole projection is inadequate, and the optimal projection method varies with the FOV covered by the human pose. The usage of advanced fisheye models like the double sphere model significantly enhances 3D human pose estimation accuracy. We propose a heuristic for selecting the appropriate projection model based on the detection bounding box to enhance prediction quality. Additionally, we introduce and evaluate on our novel FISHnCHIPS dataset, which features 3D human skeleton annotations in fisheye images, including images from unconventional angles, such as extreme close-ups, ground-mounted cameras, and wide-FOV poses, available at: https://www.vision.rwth-aachen.de/fishnchips.

Abstract:
In high-speed quadcopter racing, finding a single controller that works well across different platforms remains challenging. This work presents the first neural network controller for drone racing that generalizes across physically distinct quadcopters. We demonstrate that a single network, trained with domain randomization, can robustly control various types of quadcopters. The network relies solely on the current state to directly compute motor commands. The effectiveness of this generalized controller is validated through real-world tests on two substantially different crafts (3-inch and 5-inch race quadcopters). We further compare the performance of this generalized controller with controllers specifically trained for the 3-inch and 5-inch drone, using their identified model parameters with varying levels of domain randomization (0%, 10%, 20%, 30%). While the generalized controller shows slightly slower speeds compared to the fine-tuned models, it excels in adaptability across different platforms. Our results show that no randomization fails sim-to-real transfer while increasing randomization improves robustness but reduces speed. Despite this trade-off, our findings highlight the potential of domain randomization for generalizing controllers, paving the way for universal AI controllers that can adapt to any platform.

Abstract:
Endoscope tracking is commonly utilized to provide surgeons with in-body camera poses and visual fields during invasive procedures. The fundamental aspect of endoscopic navigation lies in precisely and continuously tracing the position and orientation of the endoscope within monocular endoscopic video sequences in a preoperative data space. This work proposes a new spatially constrained and deeply learned bilateral structural intensity-depth 2D-3D registration framework for autonomously navigating a flexible endoscope. Concretely, a novel bilateral structural intensity-depth similarity function is defined to tackle the deficiency of using image intensity, while a cross-domain monocular depth estimation model trained on virtual image data is used to accurately predict real image dense depth. Additionally, a spatial constraint is introduced to precisely reinitialize an optimizer to reduce accumulative tracking errors. We validate our method on clinical data, with the experimental results showing that our method significantly outperforms current vision-based navigation methods. Particularly, the average of position and orientation errors were reduced from (4.59mm, 9.22°) to (1.65mm, 4.67°).

Abstract:
This paper introduces Model Q-II, an enhanced underactuated robotic hand designed to improve dexterous manipulation through expanded grasping modes and manipulation primitives. The Model Q-II incorporates tripod and enhanced power grasping modes, achieving increased versatility without adding additional actuators. The design employs passive mechanisms, such as lateral contact walls and a finger-locking system, to facilitate seamless transitions between modes, enabling precise pinch-to-tripod and pinch-to-power gating. These enhancements allow the hand to perform complex in-hand manipulations, including multi-directional object positioning. Theoretical analysis, simulations, and experimental evaluations validate the hand's performance, demonstrating improved grasping force, range, and manipulation capabilities. The results highlight Model Q-II's ability to handle various tasks, offering a robust, cost-effective solution for applications requiring both precise and powerful grasping.

Abstract:
Modern orchards are planted in structured rows with distinct panel divisions to improve management. Accurate and efficient joint segmentation of point cloud from Panel to Tree and Branch (P2TB) is essential for robotic operations. However, most current segmentation methods focus on single-instance segmentation and depend on a sequence of deep networks to perform joint tasks. This strategy hinders the use of hierarchical information embedded in the data, leading to both error accumulation and increased costs for annotation and computation, which limits its scalability for real-world applications. In this study, we proposed a novel approach that incorporated a Real2Sim L-TreeGen for training data generation and a joint model (J-P2TB) designed for the P2TB task. The J-P2TB model, trained on the generated simulation dataset, was used for joint segmentation of real-world panel point clouds via zero-shot learning. Compared to representative methods, our model outperformed them in most segmentation metrics while using 40% fewer learnable parameters. This Sim2Real result highlighted the efficacy of L-TreeGen in model training and the performance of J-P2TB for joint segmentation, demonstrating its strong accuracy, efficiency, and generalizability for real-world applications. These improvements would not only greatly benefit the development of robots for automated orchard operations but also advance digital twin technology, enabling the facilitation of field robotics across various domains.

Abstract:
Effective pollination is a key challenge for indoor farming, since bees struggle to navigate without the sun. While a variety of robotic system solutions have been proposed, it remains difficult to autonomously check that a flower has been sufficiently pollinated to produce high-quality fruit, which is especially critical for self-pollinating crops such as strawberries. To this end, this work proposes a novel robotic system for indoor farming. The proposed hardware combines a 7 -degree-of-freedom (DOF) manipulator arm with a custom end-effector, comprised of an endoscope camera, a 2-DOF microscope subsystem, and a custom vibrating pollination tool; this is paired with algorithms to detect and estimate the pose of strawberry flowers, navigate to each flower, pollinate using the tool, and inspect with the microscope. The key novelty is vibrating the flower from below while simultaneously inspecting with a microscope from above. Each subsystem is validated via extensive experiments.

Abstract:
Autonomous exploration of unknown space is an essential component for the deployment of mobile robots in the real world. Safe navigation is crucial for all robotics applications and requires accurate and consistent maps of the robot's surroundings. To achieve full autonomy and allow deployment in a wide variety of environments, the robot must rely on onboard state estimation which is prone to drift over time. We propose a Micro Aerial Vehicle (MAV) exploration framework based on local submaps to allow retaining global consistency by applying loop-closure corrections to the relative submap poses. To enable large-scale exploration we efficiently compute global, environment-wide frontiers from the local submap frontiers and use a sampling-based next-best-view exploration planner. Our method seamlessly supports using either a LiDAR sensor or a depth camera, making it suitable for different kinds of MAV platforms. We perform comparative evaluations in simulation against a state-of-the-art submap-based exploration framework to showcase the efficiency and reconstruction quality of our approach. Finally, we demonstrate the applicability of our method to real-world MAVs, one equipped with a LiDAR and the other with a depth camera.

Abstract:
We address the challenge of safe control in decentralized multi-agent robotic settings, where agents use uncertain black-box models to predict other agents' trajectories. We use the recently proposed conformal decision theory to adapt the restrictiveness of control barrier functions-based safety constraints based on observed prediction errors. We use these constraints to synthesize controllers that balance between the objectives of safety and task accomplishment, despite the prediction errors. We provide an upper bound on the average over time of the value of a monotonic function of the difference between the safety constraint based on the predicted trajectories and the constraint based on the ground truth ones. We validate our theory through experimental results showing the performance of our controllers when navigating a robot in the multi-agent scenes in the Stanford Drone Dataset.

Abstract:
Underwater docking enhances the operational capabilities of Autonomous Underwater Vehicles (AUVs) by facilitating energy and data transfer. Optical beacons serve as the primary guidance method for AUVs to localize and track docking stations. This paper presents VIP-Dock, a novel optical beacon tracking algorithm for robust underwater docking of AUVs. VIP-Dock addresses the challenge of maintaining accurate beacon tracking under visual interference by integrating visual, inertial, and pressure perception. Employing an unscented Kalman filter framework, the VIP-Dock algorithm provides continuous optimal estimation of beacon positions. Experimental results demonstrated VIP-Dock's real-time tracking performance in actual docking scenarios and its ability to maintain accuracy during visual input failure. Implementation in a simulation platform for an underwater vertical shuttle showed significant improvement, increasing docking success rates from 62% to 84% across 100 trials under simulated current disturbances.

Abstract:
Microspine grippers are small spines commonly found on insect legs that reinforce surface interaction by engaging with asperities to increase shear force and traction. An array of such microspines, when integrated into the limbs or undercarriage of a robot, can provide the ability to maneuver uneven terrains, traverse inclines, and even climb walls. Meanwhile, the conformability and adaptability of soft robots makes them ideal candidates for applications involving traversal of complex, unstructured terrains. However, there remains a real-life realization gap for soft locomotors pertaining to their transition from controlled lab environment to the field that can be bridged by improving grip stability through effective integration of microspines. In this research, a passive, compliant microspine stacked array design is proposed to enhance the locomotion capabilities of mobile soft robots. A microspine array integration method effectively addresses the stiffness mismatch between soft, compliant, and rigid components. Additionally, a reduction in complexity results from actuation of the surface-conformable soft limb using a single actuator. The two-row, stacked microspine array configuration offers improved gripping capabilities on steep and irregular surfaces. This design is incorporated into three different robot configurations - the baseline without microspines and two others with different combinations of microspine arrays. Field experiments are conducted on surfaces of varying surface roughness and non-uniformity - concrete, brick, compact sand, and tree roots. Experimental results demonstrate that the inclusion of microspine arrays increases planar displacement an average of 10 times. The improved grip stability, repeatability, and, terrain traversability is reflected by a decrease in the relative standard deviation of the locomotion gaits.

Abstract:
Size reconfigurable robots have been introduced for coverage applications to improve performance. The size reconfiguration ability allows a robot to access narrow areas in a smaller size while covering open spaces in a larger size, improving productivity. This paper proposes a novel CPP method consisting of an Overlapping Reduction Criterion (ORC) and a Reconfiguration Reduction Criterion (RRC) for a size-reconfigurable robot to improve performance in dynamic workspaces. A Glasius Bio-inspired Neural Network (GBNN) is adapted to guide the robot toward unvisited cells considering neural activity variation. The size variation is managed by utilizing a collection of grid maps generated for various size configurations of the robot. The RRC and ORC penalize the movements requiring size reconfigurations or creating isolated unvisited regions in the decision-making process of next movement selection yielding to reduce reconfigurations and overlapping. According to the results, the proposed CPP method surpasses state of the art in terms of performance indexes reconfiguration count, overlapping, path distance, and coverage time by significant margins.

Abstract:
Freeform modular self-reconfigurable robot (MSRR) systems overcome traditional docking limitations, enabling rapid and continuous connections between modules in any direction. Recent advancements in freeform MSRR technology have significantly enhanced connectivity and mobility. However, limitations in connector strength and operational efficiency in existing designs restrict performance. This paper proposes a rigid freeform connector and a rigid magnetic track design to improve the connection and motion performance of the SnailBot. Each SnailBot is equipped with a multi-channel rope-driven gripper, a metal spherical shell with densely distributed circular holes on the back, and a rigid chain design conforming to the spherical surface. This combination allows each SnailBot to move precisely along the surface of a peer, facilitated by the ferromagnetic spherical shell and magnetic track. The integration of the gripper and spherical shell hole array provides robust inter-module connections in any position and orientation. The effectiveness of these designs has been validated through a series of experiments and analyses, demonstrating improved connection and motion performance in the SnailBot dual-mode connector system and expanding its potential applications and functional capabilities.

Abstract:
The increasing complexity of tasks performed by hybrid aerial robotic systems, such as tail-sitters, demands a more integrated approach to their design. Traditional sequential design methods fall short because they separate the control system design from the conceptual design, limiting the poten-tial for discovering coupled solutions. This disjointed process constrains the design space, making it difficult to optimize both the control performance and system dynamics simultaneously. In response to this limitation, there has been growing interest in mission-specific dynamic design procedures, which aim to address specific operational challenges by integrating control and design early in the development process. The multi-disciplinary approach of control co-design (CCD) expands the design space by solving control and system design problems concurrently. The recently introduced DAIMYO framework demonstrated that combining multi-fidelity modelling with a nested CCD approach can tackle the situ-to-reel gap. However, DAIMYO's reliance on Bayesian optimization to account for the computational cost increase of a nested formulation limits its scalability. To address these issues, we propose KUGE, a simultaneous CCD strategy that reduces computational complexity and overcomes dimensionality restrictions through a combined effort of stochastic optimization and Gaussian processes. We validate the effectiveness of KUGE by applying it to the dynamic design of a tail-sitter, showing that it is competitive with the DAIMYO architecture while offering greater computational efficiency.

Abstract:
We address the problem of coordination and control of Connected and Automated Vehicles (CAVs) in the presence of imperfect observations in mixed traffic environment. A commonly used approach is learning-based decision-making, such as reinforcement learning (RL). However, most existing safe RL methods suffer from two limitations: (i) they assume accurate state information, and (ii) safety is generally defined over the expectation of the trajectories. It remains challenging to design optimal coordination between multi-agents while ensuring hard safety constraints under system state uncertainties (e.g., those that arise from noisy sensor measurements, communication, or state estimation methods) at every time step. We propose a safety guaranteed hierarchical coordination and control scheme called Safe-RMM to address the challenge. Specifically, the high-level coordination policy of CAVs in mixed traffic environment is trained by the Robust Multi-Agent Proximal Policy Optimization (RMAPPO) method. Though trained without uncertainty, our method leverages a worst-case Q network to ensure the model's robust performances when state uncertainties are present during testing. The low-level controller is implemented using model predictive control (MPC) with robust Control Barrier Functions (CBFs) to guarantee safety through their forward invariance property. We compare our method with baselines in different road networks in the CARLA simulator. Results show that our method provides the best evaluated safety and efficiency in challenging mixed traffic environments with uncertainties.

Abstract:
Path planning with strong environmental adaptability plays a crucial role in robotic navigation in unstructured outdoor environments, especially in the case of low-quality location and map information. The path planning ability of a robot depends on the identification of the traversability of global and local ground areas. In real-world scenarios, the complexity of outdoor open environments makes it difficult for robots to identify the traversability of ground areas that lack a clearly defined structure. Moreover, most existing methods have rarely analyzed the integration of local and global traversability identifications in unstructured outdoor scenarios. To address this problem, we propose a novel method, Dual-BEV Nav, first introducing Bird's Eye View (BEV) representations into local planning to generate high-quality traversable paths. Then, these paths are projected into the global traversability probability map generated by the global BEV planning model to obtain the optimal path. By integrating the traversability from both local and global BEV, we establish a dual-layer BEV heuristic planning paradigm, enabling long-distance navigation in unstructured outdoor environments. We test our approach through both public dataset evaluations and real-world robot deployments, yielding promising results. Compared to baselines, the Dual-BEV Nav improved temporal distance prediction accuracy by up to 18.26%. In the real-world deployment, under conditions significantly different from the training set and with notable occlusions in the global BEV, the Dual-BEV Nav successfully achieved a 65-meter-long outdoor navigation. Further analysis demonstrates that the local BEV representation significantly enhances the rationality of the planning, while the global BEV probability map ensures the robustness of the overall planning.

Abstract:
Previous studies on robotic manipulation are based on a limited understanding of the underlying 3D motion constraints and affordances. To address these challenges, we propose a comprehensive paradigm, termed UniAff, that integrates 3D object-centric manipulation and task understanding in a unified formulation. Specifically, we constructed a dataset labeled with manipulation-related key attributes, comprising 900 articulated objects from 19 categories and 600 tools from 12 categories. Furthermore, we leverage MLLMs to infer object-centric representations for manipulation tasks, including affordance recognition and reasoning about 3D motion constraints. Comprehensive experiments in both simulation and real-world settings indicate that UniAff significantly improves the generalization of robotic manipulation for tools and articulated objects. We hope that UniAff will serve as a general baseline for unified robotic manipulation tasks in the future. Images, videos, dataset and code are published on the project website at:https://sites.google.com/view/uni-aff/home.

Abstract:
Predictive maintenance is a key aspect for the safety of critical infrastructure such as bridges, dams, and tunnels, where a failure can lead to catastrophic outcomes in terms of human lives and costs. The surge in Artificial Intelligence-driven visual robotic inspection methods necessitates high-quality datasets containing diverse defect classes with several instances on different conditions (e.g., material, illumination). In this context, we introduce a Controllable Object Inpainting Generative Adversarial Network (COIGAN) to synthetically generate realistic images that augment defect datasets. The effectiveness of the model is quantitatively validated by a Fréchet Inception Distance, which measures the similarity between the generated and training samples. To further evaluate the impact of COIGAN-generated images, a segmentation task was conducted, utilizing key performance metrics such as segmentation accuracy, mAP, mIoU, and F1 score, demonstrating that the synthetic images integrate seamlessly and produce results comparable to real defect images. Subsequently, COIGAN generability was successfully used for the segmentation of a defect-free dataset by inpainting defects. The results showcase COIGAN's ability to learn defect patterns and apply them in new contexts, preserving the original features of the base image and allowing the creation of new datasets with a desired multi-class distribution. Specifically, in the context of predictive maintenance, COIGAN enriches datasets, enabling deep learning models to more effectively identify potential infrastructure anomalies. Project page: https://bit.ly/4bzxwqf.

Abstract:
Controlled microrobots in fluidic environments hold promise for precise drug delivery and cell manipulation, opening new ways for personalized healthcare. However, coordinating magnetic microrobot swarms presents significant challenges due to the complexity of the associated actuation mechanisms. While existing methods to achieve motion differentiation in collections of microrobots rely on design variations among them, the work reported here applies to homogeneous collectives and enables them to be steered as a whole or in fragments, by means of a common externally generated force field. This paper contributes to an emerging set of methods that enable swarm control through manipulation of these force fields. This paper in particular exploits the nature of force field equilibria in a quadrupole workspace configuration as a means of steering the swarm while maintaining its cohesion. The approach also enables splitting the swarm in two subgroups in order to direct each simultaneously to a different location.

Abstract:
We present an end-to-end framework for planning tight assembly operations, where the input is a set of digital models, and the output is a full execution plan for a physical robotic arm, including the trajectory placement and the grasping. The framework builds on our earlier results on tight assembly plan-ning for free-flying objects and includes the following novel components: (i) the framework itself together with physical demon-strations, (ii) trajectory placement based on novel dynamic path-wise IK and (iii) post processing of the free-flying paths to relax the tightness and smooth the path. The framework provides guarantees as to the quality of the outcome trajectory. For each component we provide the algorithmic details and a full open-source software package for reproducing the process. Lastly, we demonstrate the framework with tight and challenging assembly problems (as well as puzzles, which are planned to be hard to assemble), using a UR5e robotic arm in the real world and in simulation. See the figure at the top for a physical UR5e assembling the alpha-z puzzle (known to be considerably more complicated to assemble than the celebrated alpha puzzle). Full video clips of all the assembly demonstrations together with our open source software are available at our project page: https://tau-cgl.github.io/Full-Cycle-Assembly-Operation/

Abstract:
This paper presents a wearable soft sensing band with stretchable sensors to monitorcle activity by estimating muscle volume changes. Unlike conventional surface electromyography (sEMG) sensing techniques, which require excessive pressure or adhesive electrodes, the proposed sensing method allows muscle volume variations to be detected simply by placing the device on the skin without additional pressure or adhesives. The band was evaluated in isometric-static and isometric-varying torque estimation tasks, demonstrating superior accuracy to sEMG, with a relative torque to maximum torque estimation error of less than 11.5%. In isometric-varying conditions, relative torque was estimated with an average error of 10.1% at frequencies of 0.1 Hz, 0.2 Hz and 0.5 Hz. Furthermore, the band achieved a classification accuracy of 92.9% in recognizing ten distinct hand gestures, highlighting its capability to differentiate between multiple muscle activations. The lightweight and flexible design addresses limitations of sEMG, such as signal noise, skin irritation, and complex calibration. Experimental results validate the potential of the proposed sensing method for applications in muscle activity monitoring across healthcare, rehabilitation, and sports, and it also offers potential for use in robot teaching for reference motion generation.

Abstract:
Recent advancements in large language models (LLMs) have significantly enhanced machines' ability to understand and follow human instructions. In many tasks, LLMs have demonstrated performance that rivals human-level common sense. However, directly applying LLMs to domain-specific use cases, such as robotic pick-and-place, remains a challenge. Tasks that are intuitive for humans, who rely on prior knowledge and skills, become complex for robots. Industrial robotic applications like pick-and-place require a high degree of accuracy, often exceeding 90 %. In response to these challenges in domain-specific applications, we propose IntelliRMS, a novel system-oriented architecture for instruction-following robotic manipulation. The IntelliRMS synergizes the linguistic and open-vocabulary visual capabilities of foundational models to arrive at an accurate, robust and scalable system. Further, we demonstrate the effectiveness of IntelliRMS in a real-world industrial Bin-picking scenario within the retail sector, validating its performance with a comprehensive dataset.

Abstract:
Point cloud recognition models are known to be vulnerable to adversarial attacks. The state-of-the-art defense solutions either focus on partial features of the point cloud, limiting their effectiveness, or rely heavily on known adversarial examples, reducing their generalizability, while others, like point cloud reconstruction, will degrade the classifier's accuracy on clean examples. To address this, we introduce SynerGuard, a novel robust point cloud classification framework mitigating adversarial attacks by considering comprehensive geometric and topological attributes of the point cloud, without relying on known adversarial examples while attaining classification accuracies on clean examples. We comprehensively test SynerGuard against seven attack types from three leading adversarial attack approaches on two widely used datasets, ModelNet40 and ShapeNetPart. The results demonstrate SynERGUARD's superiority against existing defenses in mitigating adversarial attacks, as well as managing clean examples.

Abstract:
Palm-sized autonomous nano-drones, i.e., sub-50 g in weight, recently entered the drone racing scenario, where they are tasked to avoid obstacles and navigate as fast as possible through gates. However, in contrast with their bigger counterparts, i.e., kg-scale drones, nano-drones expose three orders of magnitude less onboard memory and compute power, demanding more efficient and lightweight vision-based pipelines to win the race. This work presents a map-free vision-based (using only a monocular camera) autonomous nano-drone that combines a real-time deep learning gate detection front-end with a classic yet elegant and effective visual servoing control back-end, only relying on onboard resources. Starting from two state-of-the-art tiny deep learning models, we adapt them for our specific task, and after a mixed simulator-real-world training, we integrate and deploy them aboard our nano-drone. Our best-performing pipeline costs of only 24 M multiply-accumulate operations per frame, resulting in a closed-loop control performance of 30 Hz, while achieving a gate detection root mean square error of 1.4 pixels, on our ~20 k real-world image dataset. In-field experiments highlight the capability of our nano-drone to successfully navigate through 15 gates in 4 min, never crashing and covering a total travel distance of ~ 100 m, with a peak flight speed of 1.9 m/s. Finally, to stress the generalization capability of our system, we also test it in a never-seen-before environment, where it navigates through gates for more than 4 min.

Abstract:
Compared to conventional decomposition methods that use ellipses or polygons to represent free space, starshaped representation can better capture the natural distribution of sensor data, thereby exploiting a larger portion of traversable space. This paper introduces a novel motion planning and control framework for navigating robots in unknown and cluttered environments using a dynamically constructed starshaped roadmap. Our approach generates a starshaped representation of the surrounding free space from real-time sensor data using piece-wise polynomials. Additionally, an incremental roadmap maintaining the connectivity information is constructed, and a searching algorithm efficiently selects short-term goals on this roadmap. Importantly, this framework addresses dead-end situations with a graph updating mechanism. To ensure safe and efficient movement within the starshaped roadmap, we propose a reactive controller based on Dynamic System Modulation (DSM). This controller facilitates smooth motion within starshaped regions and their intersections, avoiding conservative and short-sighted behaviors and allowing the system to handle intricate obstacle configurations in unknown and cluttered environments. Comprehensive evaluations in both simulations and real-world experiments show that the proposed method achieves higher success rates and reduced travel times compared to other methods. It effectively manages intricate obstacle configurations, avoiding conservative and myopic behaviors. The source code will be released on website11Available at: github.com/kkkkkaiai/starshaped_roadmap.

Abstract:
Steerable needles are novel medical devices capa-ble of following curved paths through tissue, enabling them to avoid anatomical obstacles and steer to hard-to-reach sites in tissue, including targets in the lung for lung cancer diagnosis. Steerable needles are typically deployed into tissue from an insertion surface, and selecting the insertion site is critical for procedure success as it determines which paths the needle can take to its target. Prior motion planners for steerable needles typically only plan from a specific start pose to the target. We introduce a new resolution-optimal steerable needle motion planner that efficiently finds plans from an insertion surface to a target position, handling additional degrees of freedom at both the start and the target. Our algorithm systematically builds a search tree consisting of needle motion primitives backward from the target towards the insertion surface, which allows it to provide an optimality guarantee up to the resolution of the primitives. The algorithm finds higher-quality plans faster than prior state-of-the-art motion planners, as demonstrated in anatomical scenario simulations in the lung.

Abstract:
To address complex mission tasks, multirotors benefit from in-flight reconfiguration that enhances their morphological adaptability. This paper presents the Center-Driven Scissor Extendable Airframe (CDSEA), a novel one-degree-of-freedom (DOF) morphing airframe designed to replace traditional fixed-size airframes. The CDSEA allows a quadrotor to achieve significant morphological changes during flight, with rotors deploying radially from a central point. This capability facilitates substantial variations in footprint radius and ensures smooth transitions. The paper details the mechanical design, as well as kinematic and dynamic analyses, and discusses the actuator selection strategy for the CDSEA. Experimental results with a prototype demonstrate that the CDSEA achieves a footprint-radius deformation ratio of 2.5 and a morphing time of 0.3 seconds, surpassing existing solutions. Additionally, the design improves obstacle avoidance and wind resistance. These results underscore the CDSEA's potential as an advanced solution for enhancing UAV adaptive navigation performance in complex environments.

Abstract:
Robot navigation methods allow mobile robots to operate in applications such as warehouses or hospitals. While the environment in which the robot operates imposes requirements on its navigation behavior, most existing methods do not allow the end-user to configure the robot's behavior and priorities, possibly leading to undesirable behavior (e.g., fast driving in a hospital). We propose a novel approach to adapt robot motion behavior based on natural language instructions provided by the end-user. Our zero-shot method uses an existing Visual Language Model to interpret a user text query or an image of the environment. This information is used to generate the cost function and reconfigure the parameters of a Model Predictive Controller, translating the user's instruction to the robot's motion behavior. This allows our method to safely and effectively navigate in dynamic and challenging environments. We extensively evaluate our method's individual components and demonstrate the effectiveness of our method on a ground robot in simulation and real-world experiments, and across a variety of environments and user specifications.

Abstract:
Distributed multi-robot coordination is critical to achieving reliable robotic missions that exploit the collective capability of swarm robots. In particular, the consensus and formation control problems have been extensively studied, resulting in distributed controllers that enable robots to rely only on information from themselves and their immediate neighbors. However, these algorithms are usually designed for specific objectives (e.g., cooperative object transportation, environmental coverage, etc.), requiring the controllers to be re-designed for domain variations. Therefore, we propose a new parametric framework inspired by gravitational fields that allow simultaneous coordination of robots at multiple levels, enabling generalization and domain adaptation. Our approach is built on top of a connectivity-preserving formation controller, with need-based and task-based ad hoc coordination at private, local, and global layers of a swarm robot team. We demonstrate the remarkable potential of our framework through extensive simulations and real-world swarm robot experiments in three representative multi-robot tasks involving tight coordination: 1) robot-initiated rendezvous at different coordination layers, 2) coordinated boundary tracking and coverage of environmental processes, and 3) accommodating task executions and motion control while satisfying the coordination laws.

Abstract:
As robots are prone to make errors that undermine trust, effective trust repair strategies are essential in effective human-robot collaboration. Our lab study evaluates three trust repair strategies -apology, denial, and compensation- following two types of trust violations: competence-based and integritybased. Consistent with prior research, integrity-based violations reduced moral trust more, while competence-based violations impacted performance trust. Denial caused greater discomfort than apology or compensation across both violation types. Dispositional trust influenced repair strategies effectiveness, particularly in willingness to engage and re-engage. Notably, individuals with high dispositional trust were more receptive to apologies. These findings underscore the need to consider individual trust differences, suggesting robots should assess human trust disposition to effectively foster continued collaboration.

Abstract:
Accurately estimating the 6-DoF pose of objects is a fundamental challenge in computer vision and robotics. While category-level pose estimation based on RGBD data has achieved good performance in recent years, estimating poses solely from RGB images remains a significant challenge. Existing RGB-based category-level methods primarily focus on recovering object point clouds from RGB images, and pose prediction is not performed end-to-end by a network. This paper presents a Category-level and Instance-level Pose Estimation Network (CIPE), which models pose estimation as a set prediction problem and enables direct pose regression from RGB images. To further enhance the network's ability to learn object poses, first, a novel learnable rotation representation that redefines rotation learning within Euclidean space is introduced to facilitate rotation regression. Additionally, we propose a prior-query fusion strategy that utilizes a pre-trained point cloud feature extraction network to integrate categorical object features with bounding boxes, thereby improving the incorporation of category information. Experimental results demonstrate that CIPE significantly outperforms existing RGB-based methods on both category-level and instance-level datasets. The code is available at https://github.com/jialeren/CIPE.

Abstract:
In this paper, we introduce a kinodynamic model predictive control (MPC) framework that exploits unidirectional parallel springs (UPS) to improve the energy efficiency of dynamic legged robots. The proposed method employs a hierarchical control structure, where the solution of MPC with simplified dynamic models is used to warm-start the kinody-namic MPC, which accounts for nonlinear centroidal dynamics and kinematic constraints. The proposed approach enables energy efficient dynamic hopping on legged robots by using UPS to reduce peak motor torques and energy consumption during stance phases. Simulation results demonstrated a 38.8% reduction in the cost of transport (CoT) for a monoped robot equipped with UPS during high-speed hopping. Additionally, preliminary hardware experiments show a 14.8% reduction in energy consumption.

Abstract:
Robust perception of the environment is a critical challenge for robots, especially those that use mobile platforms or humanoid forms to perform manipulation tasks. Active vision, leveraging strategic camera movements and adaptive imaging parameters, holds great potential for addressing critical challenges such as achieving high accuracy in precise manipulation, ensuring low latency for rapid responsiveness, and overcoming occlusions and illumination variations in dynamic environments. This paper introduces a novel, cost-effective, and easily deployable active vision system designed to enhance visual perception for robotic applications. Integrated with a novel hybrid software setup, the system utilizes ArUco markers to achieve high-accuracy, low-latency performance, boasting sub-millimeter and sub-degree accuracy at 200 Hz with a latency of less than 15 ms. Additionally, a new measurement and evaluation procedure is presented, offering benchmarking for marker-based object detection systems that for the first time includes rotation measurements as well. The benchmarking results for the proposed system indicate that achieving the desired performance levels necessitates specialized active vision measurement strategies. For instance, to ensure high positional accuracy, the system needs precise object centering, while high rotational accuracy requires accounting for lateral or rotational offsets.

Abstract:
Accurate 3D pose estimation of grasped objects is an important prerequisite for robots to perform assembly or in-hand manipulation tasks, but object occlusion by the robot's own hand greatly increases the difficulty of this perceptual task. Here, we propose that combining visual information and proprioception with binary, low-resolution tactile contact measurements from across the interior surface of an articulated robotic hand can mitigate this issue. The visuo-tactile object-pose-estimation problem is formulated probabilistically in a factor graph. The pose of the object is optimized to align with the three kinds of measurements using a robust cost function to reduce the influence of visual or tactile outlier readings. The advantages of the proposed approach are first demonstrated in simulation: a custom 15-DoF robot hand with one binary tactile sensor per link grasps 17 YCB objects while observed by an RGB-D camera. This low-resolution inhand tactile sensing significantly improves object-pose estimates under high occlusion and also high visual noise. We also show these benefits through grasping tests with a preliminary real version of our tactile hand, obtaining reasonable visuo-tactile estimates of object pose at approximately 13.3 Hz on average.

Abstract:
Accurate 6D object pose estimation is crucial in industrial automation, particularly in robotic bin picking, where objects are often textureless, reflective, and arranged in cluttered environments. Multi-view pose estimation methods offer significant advantages over single-view methods by providing more comprehensive information, effectively handling occlusions and lack of features, and resolving depth ambiguities. However, current multi-view methods often rely on late-stage information fusion, limiting their ability to fully exploit complementary multi-view data. This paper presents a novel approach to enhance multiview 6D pose estimation by introducing a Feature Exchange Transformer (FET) for early-stage feature fusion. This approach leverages self-attention and epipolar cross-attention mechanisms to enable multi-layer feature aggregation across views. Additionally, we introduce a coarse-to-fine strategy for an efficient feature exchange at multiple network layers. Our method, implemented on top of EpiSurfEmb[1], enhances the utilization of multi-view information, leading to significant improvements in pose estimation accuracy and robustness, especially in challenging bin-picking scenarios. We evaluate our approach on the ROBI dataset, demonstrating that it outperforms both the baseline EpiSurfEmb and other state-of-the-art multi-view pose estimation methods.

Abstract:
We introduce a novel framework for assistive urban navigation for individuals with low vision. Utilizing a smart glasses platform developed by Biel Glasses, which provide a continuous stream of stereo images and GPS fixes, we generate an Event Map based on key semantic elements extracted by carefully prompted visual question-answering (VQA) models. For individuals with blurry or reduced fields of vision (low vision), traversing city streets poses a variety of challenges; they may struggle to perceive construction work, potholes, crowded sidewalks, and other ambiguous obstacles obstructing their paths. Some tasks, such as distinguishing traffic light signals, are nigh impossible without assistance from a companion or city infrastructure aimed towards accessibility. Although the majority of these problems may be solved with individually tailored traditional computer vision algorithms, developing and running a suite of these algorithms is challenging and resource demanding. Therefore, our proposed solution capitalizes on a single underlying implementation that need only be extended by adding queries. We validate our approach using a custom dataset of over 1,300 annotated images from various locations around Barcelona, reporting performance across different urban navigation tasks. We demonstrate the performance of the end to end system on a run of data collected by the Biel Glasses platform.

Abstract:
This paper presents AlignBot, a novel framework designed to optimize VLM-powered customized task planning for household robots by effectively aligning with user reminders. In domestic settings, aligning task planning with user reminders poses significant challenges due to the limited quantity, diversity, and multimodal nature of the reminders. To address these challenges, AlignBot employs a fine-tuned LLaVA-7B model, functioning as an adapter for GPT-40. This adapter model internalizes diverse forms of user reminders-such as personalized preferences, corrective guidance, and contextual assistance-into structured instruction-formatted cues that prompt GPT-40 in generating customized task plans. Additionally, AlignBot integrates a dynamic retrieval mechanism that selects task-relevant historical successes as prompts for GPT-40, further enhancing task planning accuracy. To validate the effectiveness of AlignBot, experiments are conducted in real-world household environments, which are constructed within the laboratory to replicate typical household settings. A multimodal dataset with over 1,500 entries derived from volunteer reminders is used for training and evaluation. The results demonstrate that AlignBot significantly improves customized task planning, outperforming existing LLM- and VLM-powered planners by interpreting and aligning with user reminders, achieving 86.8 % success rate compared to the vanilla GPT-40 baseline at 21.6%, reflecting a 65% improvement and over four times greater effectiveness. Supplementary materials are available at: https://yding25.com/AlignBot/

Abstract:
This paper proposes an optimization-based task and motion planning framework, named “Logic Network Flow”, to integrate signal temporal logic (STL) specifications into efficient mixed-binary linear programmings. In this framework, temporal predicates are encoded as polyhedron constraints on each edge of the network flow, instead of as constraints between the nodes as in the traditional Logic Tree formulation. Synthesized with Dynamic Network Flows, Logic Network Flows render a tighter convex relaxation compared to Logic Trees derived from these STL specifications. Our formulation is evaluated on several multi-robot motion planning case studies. Empirical results demonstrate that our formulation outperforms Logic Tree formulation in terms of computation time for several planning problems. As the problem size scales up, our method still discovers better lower and upper bounds by exploring fewer number of nodes during the branch-and-bound process, although this comes at the cost of increased computational load for each node when exploring branches.

Abstract:
Multi-agent path finding (MAPF) in dynamic and complex environments is a highly challenging task. Recent research has focused on the scalability of agent numbers or the complexity of the environment. Usually, they disregard the agents' physical constraints or use a differential-driven model. However, this approach fails to adequately capture the kinematic and dynamic constraints of real-world vehicles, particularly those equipped with Ackermann steering. This paper presents a novel algorithm named MARF that combines multi-agent reinforcement learning (MARL) with a Frenet lattice planner. The MARL foundation endows the algorithm with enhanced generalization capabilities while preserving computational efficiency. By incorporating Frenet lattice trajectories into the action space of the MARL framework, agents are capable of generating smooth and feasible trajectories that respect the kinematic and dynamic constraints. In addition, we adopt a centralized training and decentralized execution (CTDE) framework, where a network of shared value functions enables efficient cooperation among agents during decision-making. Simulation results and real-world experiments in different scenarios demonstrate that our method achieves superior performance in terms of success rate, average speed, extra distance of trajectory, and computing time.

Abstract:
Robotic manipulation in cluttered environments presents significant challenges, particularly when the clutter includes thin, deformable objects like cables, which complicate perception and decision-making processes. In the context of datacenters, the automation of networking tasks often involves the manipulation of optical transceivers within densely packed cable configurations. Such environments are characterized by an abundance of delicate, overlapping, and intersecting cables, leading to frequent occlusions. This paper introduces an innovative system designed for the manipulation of optical transceivers in environments cluttered by cables. Our integrated approach combines advanced 3D scene understanding with a heuristic-based pushing policy to effectively manipulate optical transceivers amidst clutter. The system's perception component utilizes image segmentation and 3D reconstruction to accurately model the transceivers and surrounding cables. Meanwhile, the planning aspect employs a search algorithm with task-specific heuristics, to navigate the gripper, displace obstructing cables, and safely achieve a precise pre-grasp position in front of the target transceiver. We have conducted extensive evaluations of our methodology in both simulated and real-world settings, demonstrating its high success rates, robustness, and proficiency in addressing the unique challenges posed by cable-occluded environments within datacenters.

Abstract:
Reinforcement Learning (RL) has shown promise in robotic navigation tasks, yet applying it to real-world environments remains challenging due to dynamic complexities and the need for dynamically feasible actions. We propose a hierarchical control framework based on Spiking Deep Reinforcement Learning (SDRL) for robust robot navigation in real environments. Our approach utilizes a two-layer architecture: a high-level decision layer powered by a Spiking GRU network for handling partially observable environments, and a low-level executive layer employing Continuous Attractor Neural Networks (CANNs) to ensure precise and continuous actions. This hierarchical structure allows real-time decisionmaking that respects the physical constraints of the robot. Experimental results show that our method adapts effectively to new environments without fine-tuning and surpasses existing methods in performance. We also explore the implementation on the Darwin3 chip, paving the way for biologically inspired motion control in future robotic applications.

Abstract:
In-space assembly is a key capability to enable construction of large-scale structures required for sustained human presence in space. Robotic assembly is critical to reduce required crew time and risk, while modularity ensures that solutions are versatile and adaptive to complex mission concepts. NASA's Automated Reconfigurable Mission Adaptive Digital Assembly Systems (ARMADAS) project demonstrated that robots with relatively low cost, size, and degrees-of-freedom (DoFs) can be used for large-scale modular lattice structure assembly. This is possible by using the structural modules for robotic systems metrology and error mitigation. Robots with reduced complexity may lead to advantages in initial and maintenance cost, offering an alternative to large, complex, and expensive robots. In this paper, we describe the Structure Omni-directional Foldable Explorer (SOF-E), a robot with significantly lower mass and DoF compared to the previous ARMADAS robot architecture. Although SOF-E is a five DoF robot with only two or three control states per actuator, it is capable of transporting and placing structural modules by collaborating with other instances of itself. We discuss the mechanical design and architecture of SOF-E, including analysis of energy usage during each operation. Experiments demonstrate that during locomotion and module transport tasks, SOF-E requires significantly lower energy than the previous cargo transport robot architecture, the Scaling Omni-directional Lattice Locomoting Explorer (SOLL-E). The cost of transport metric is used to compare the energy efficiency of the operation.

Abstract:
Open-set object detection (OSOD) is highly desirable for robotic manipulation in unstructured environments. However, existing OSOD methods often fail to meet the requirements of robotic applications due to their high computational burden and complex deployment. To address this issue, this paper proposes a light-weight framework called Decoupled OSOD (DOSOD), which is a practical and highly efficient solution to support real-time OSOD tasks in robotic systems. Specifically, DOSOD builds upon the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector. A Multilayer Perceptron (MLP) adaptor is developed to transform text embeddings extracted by the VLM into a joint space, within which the detector learns the region representations of classagnostic proposals. Cross-modality features are directly aligned in the joint space, avoiding the complex feature interactions and thereby improving computational efficiency. DOSOD operates like a traditional closed-set detector during the testing phase, effectively bridging the gap between closed-set and openset detection. Compared to the baseline YOLO-World, the proposed DOSOD significantly enhances real-time performance while maintaining comparable accuracy. The slight DOSODS model achieves a Fixed AP of 26.7 %, compared to 26.2 % for YOLO-World-v1-S and 22.7 % for YOLO-World-v2-S, using similar backbones on the LVIS minival dataset. Meanwhile, the FPS of DOSOD-S is 57.1 % higher than YOLO-World-v1S and 29.6 % higher than YOLO-World-v2-S. Meanwhile, we demonstrate that the DOSOD model facilitates the deployment of edge devices. The codes and models are publicly available at https://github.com/D-Robotics-AI-Lab/DOSOD.

Abstract:
As robots become increasingly capable, users will want to describe high-level missions and have robots infer the relevant details. Because pre-built maps are difficult to obtain in many realistic settings, accomplishing such missions will require the robot to map and plan online. While many semantic planning methods operate online, they are typically designed for well specified missions such as object search or exploration. Recently, Large Language Models (LLMs) have demonstrated powerful contextual reasoning abilities over a range of robotic tasks described in natural language. However, existing LLM-enabled planners typically do not consider online planning or complex missions; rather, relevant subtasks and semantics are provided by a pre-built map or a user. We address these limitations via SPINE, an online planner for missions with incomplete mission specifications provided in natural language. The planner uses an LLM to reason about subtasks implied by the mission specification and then realizes these subtasks in a receding horizon framework. Tasks are automatically validated for safety and refined online with new map observations. We evaluate SPINE in simulation and real-world settings with missions that require multiple steps of semantic reasoning and exploration in cluttered outdoor environments of over 20,000m2. Compared to baselines that use existing LLM-enabled planning approaches, our method is over twice as efficient in terms of time and distance, requires less user interactions, and does not require a full map. Additional resources are provided at https://zacravichandran.github.io/SPINE.

Abstract:
Robot-assisted task-oriented training demonstrates immense potential in rehabilitation area. Parallel robots, with advantages such as low inertia and high stiffness, facilitate precise haptic feedback, yet their application in rehabilitation is limited by workspace constraints. To this end, we propose a design scheme for a haptic robot based on a reconfigurable asymmetric parallel mechanism. We first introduce a two-stage multi-objective optimization method to obtain the optimal parameter configurations. Then, to achieve precise assembling of the reconfigurable mechanism in each configuration, corresponding positioning mechanisms are designed. System performance tests validate the robot's capabilities under different configurations: workspace meets design requirements, stiffness output reaches 30 N/mm, force output is 40 N, RMS of maximum back-driven force along x, y, and z axes is 7.5 N, and RMS of maximum back-driven torque around x and y axes is 567.4 N. mm. Target tracking and virtual channel trajectory tracking experiments demonstrate the system's haptic rendering ability for gross motor tasks (GMTs) and fine motor tasks (FMTs), respectively. The developed 6-DOF haptic robot holds promise for versatile task-oriented rehabilitation training.

Affiliations: Department of Automation, Key Laboratory of System Control and Information Processing of Ministry of Education, Key Laboratory of Marine Intelligent Equipment and System of Ministry of Education, Shanghai Engineering Research Center of Intelligent Control and Management, Shanghai Jiao Tong University, Shanghai, China; Dimanshen Technology Co., Ltd. specializes in D SLAM and robotic vision fusion technology, offering all-terrain intelligent robotic solutions for smart security and smart campus applications.; Engineering Research Center of Intelligent Control for Underground Space, Ministry of Education, School of Information and Control Engineering, Advanced Robotics Research Center, China University of Mining and Technology, Xuzhou, China

Abstract:
Dynamic scene reconstruction for autonomous driving enables vehicles to perceive and interpret complex scene changes more precisely. Dynamic Neural Radiance Fields (NeRFs) have recently shown promising capability in scene modeling. However, many existing methods rely heavily on accurate poses inputs and multi-sensor data, leading to increased system complexity. To address this, we propose FreeDriveRF, which reconstructs dynamic driving scenes using only sequential RGB images without requiring poses inputs. We innovatively decouple dynamic and static parts at the early sampling level using semantic supervision, mitigating image blurring and artifacts. To overcome the challenges posed by object motion and occlusion in monocular camera, we introduce a warped ray-guided dynamic object rendering consistency loss, utilizing optical flow to better constrain the dynamic modeling process. Additionally, we incorporate estimated dynamic flow to constrain the pose optimization process, improving the stability and accuracy of unbounded scene reconstruction. Extensive experiments conducted on the KITTI and Waymo datasets demonstrate the superior performance of our method in dynamic scene modeling for autonomous driving. Our implementation will be available at https://github.com/IRMVLab/FreeDriveRF.

Abstract:
Continuum robots are well-suited for constrained environments due to their superior flexibility and structural compliance. However, relying solely on passive compliance may lead to damage to both the robot and the surrounding environment. This work proposes a finite-time Cartesian impedance control scheme for tendon-driven continuum manipulators (TDCMs), where a second-order low-pass filter is used to adjust the reference trajectory according to the external robot tip force. The controller is designed using the command filtered backstepping method, and the finite-time stability is established by the designed Lyapunov function. In TDCM systems, the tendons operate antagonistically, and actuators often fail to quickly reach the desired tendon tension, leading to partial failures. To address this, we propose an actuator fault compensation algorithm to enhance system performance and reliability. We conducted trajectory tracking experiments on a multi-segment TDCM prototype, the results demonstrate that the designed Cartesian impedance controller achieves effective compliance control effect and high position control accuracy.

Abstract:
Robotic grasping is facing a variety of real-world uncertainties caused by non-static object states, unknown object properties, and cluttered object arrangements. The difficulty of grasping increases with the presence of more uncertainties, where commonly used learning-based approaches struggle to perform consistently across varying conditions. In this study, we integrate the idea of similarity matching to tackle the challenge of grasping novel objects that are simultaneously in motion and densely cluttered using a single RGBD camera, where multiple uncertainties coexist. We achieve this by shifting visual detection from global to local states and operating grasp planning from static to dynamic scenes. Notably, we introduce optimization methods to enhance planning efficiency for this time-sensitive task. Our proposed system can adapt to various object types, arrangements and movement speeds without the need for extensive training, as demonstrated by real-world experiments.

Abstract:
Robotic manipulation tasks commonly rely on computer vision or tactile sensing to extract the physical characteristics of an object. However, this additional sensing capability adds complexity and financial cost to a robotic system. Our work investigates the inexpensive alternative of feature extraction via proprioceptive sensing. Our goal is to determine whether proprioceptive data combined with in-hand-manipulation provides sufficient information to enable geometric reconstruction of object profiles. We use a newly designed 3-DOF robotic gripper with variable-friction finger surfaces to perform model-free in-hand-manipulation on a set of test objects comprised of two dimensional convex prisms. We have devised a manipulation sequence based on the rotation and sliding of test objects to allow side-counting with the successful measurement of shapes and sizes with average angle and size errors of 1.64% and 6.76% respectively. In addition, we have outlined potential research directions aimed at resolving inherent limitations of proprioceptive approaches and making our algorithm generalisable to any arbitrary shape.

Abstract:
We present a novel 3D reconstruction-based SLAM (Simultaneous Localization and Mapping) approach for robots that leverage multimodal sensory input data, including a camera and a 2D lidar. By integrating these inputs with the gaussian splatting technique, our method significantly enhances performance over traditional SLAM approaches. Traditional SLAM techniques often struggle with the limitations of monocular vision and fail to accurately map and locate objects in dynamic and cluttered environments. Purely relying on camera to localize the robot and map creation is challenging in the presence of dynamic obstacles in the scene. To address this, we proposed a multimodal sensor fusion-based 3D reconstruction. Our approach employs lidar-based localization to achieve precise positioning of both the camera and the robot, while utilizing the gaussian splatting technique for robust environmental mapping and 3D reconstruction. This approach is robust to dynamic obstacles in the scene. We have conducted extensive experiments in various real-world and simulated environments, demonstrating that our method not only outperforms traditional monocular SLAM approaches but also achieves higher accuracy in terms of localization and constructed map. Our results demonstrate substantial improvements in 3D reconstruction for mobile robots, achieving reduced computational load, higher FPS and enhanced scaling accuracy.

Abstract:
Robotic fish hold significant promise as efficient underwater systems, yet their inability to accurately perceive ambient flow hinders their deployment in real-world scenarios. Inspired by the natural lateral line system(LLS), a flow-responsive organ in fish that plays a crucial role in behaviors such as rheotaxis, this paper introduces the first Artificial Lateral Line System (ALLS)-based ambient flow classifier for robotic fish that allows robotic fish to perceive flow fields while swimming freely. To be specific, using just 5 pressure sensors and 3.5 minutes of swimming data, we trained a Long Short-Term Memory (LSTM) network, achieving a classification accuracy of 81.25% across 8 flow speed categories, ranging from 0.08 m/s to 0.18 m/s. A key innovation of this work is the formulation of ambient flow perception as a classification task, which not only enables the robotic fish to extract meaningful information but also enhances the robustness and generalizability of the perception framework. Extensive experiments further identify critical factors such as affecting the effectiveness of the ambient flow classifier, offering valuable insights for future development.

Abstract:
We present a control framework specifically for physical human-humanoid collaboration involving the transportation and manipulation of heavy objects. Using this framework, the humanoid can exhibit desired levels of compliance with the object to be co-transported. This desired compliance is achieved through an admittance model. A Model Predictive Control (MPC) problem, based on a novel Interaction Linear Inverted Pendulum (I-LIP) model, generates footstep patterns that facilitate this desired compliant behavior while keeping the robot stable. Subsequently, we have an object-informed low-level quadratic program (QP) that sends control input to realize the high-level plans on the robot. The stiffness parameters of the I-LIP are modulated in real time for better compliance tracking performance of the robot. We verify all the results through simulation on the humanoid platform, the Digit, showing the prowess of the framework in collaboratively transporting heavy objects with a human.

Abstract:
Efficiently searching for multiple objects in a partially known environment, where only the names and locations of landmarks are available, presents significant challenges. Existing search algorithms in the literature fail to fully utilize prior knowledge to improve search efficiency, and exhibit significantly diminished efficiency when extended to multiobject search. To address these limitations, we propose an inference-based multi-object reactive search framework. This framework utilizes the COMET inference model to reason about co-occurrence values between the target objects and known landmarks, thereby enhancing search efficiency. These co-occurrence values are integrated into a reactive temporal logic motion planning strategy, which allows the robot search for multiple objects with temporal logic constraints specified by LTL and adapt dynamically if the inferred reasoning differs from the actual object arrangement encountered during the search. Extensive simulations were conducted to evaluate the feasibility and efficiency of the proposed motion planning algorithm. Results demonstrate that the integration of commonsense reasoning with reactive temporal logic planning significantly improves multi-object search efficiency. Project website: https://sites.google.com/view/imors.

Abstract:
For successful peg-in-hole assembly, predefined sub-tasks should be executed sequentially according to the current contact state. Therefore, recognizing contact state transitions is essential in order to determine whether to continue the current task or proceed to the next. In that context, learning-based solutions have shown outstanding results. However, these methods heavily rely on balanced datasets, which are challenging to obtain due to the short duration of certain contact states and rare failure cases. To address this issue, this paper proposes a framework for estimating contact state transitions using anomaly detection through input data reconstruction. The proposed framework operates in a semi-supervised manner, eliminating the need for balanced datasets during training. For input data reconstruction, a convolutional neural network is combined with a variational autoencoder to process various sensor measurements as a multivariate time series. Unlike traditional binary anomaly detection, the proposed anomaly detector scores reconstruction errors and leverages domain knowledge to identify various contact state transitions in the peg-in-hole assembly. The effectiveness of the proposed framework is validated through experiments using a torque-controlled dual manipulator system.

Abstract:
Circular electrical connectors (CECs) have a wide range of applications in scenarios that require reliable connections. However, sockets are often located in narrow scenes with random spatial orientations, complex lighting conditions, and obstructions from cables, making it difficult to accurately locate them through cameras. Besides, due to the complex geometric structure of CECs and the presence of electrode protection slots, the existing research on the assembly of cylindrical or polygonal pegs and holes may not be applicable to the assembly of such components. To this end, this article proposes a novel robotic assembly strategy for CECs with small relative initial deviations, whose core is to design a search trajectory and heuristic force strategy to perceive force/pose (F/P) discontinuity characteristics under different geometric constraints. This assembly strategy is independent of the CEC's size and is not affected by the socket's spatial orientation. The experiments with two different sizes of CECs on a robot equipped with a 6-dimensional force/torque (\mathbfF / \mathbfT) sensor are conducted, and the effectiveness and robustness of the proposed assembly strategy for CECs are demonstrated.

Abstract:
Safe and reliable environmental perception is crucial for the highly automated or even autonomous operation of agriculture machines. However, developing such a system is challenging due to imperfect perception sensors. This article proposes a fault management system (FMS) for detecting, diagnosing, and mitigating risks that compromise the safety and reliability of perception systems. This article aims to develop an improved image quality safety model (IQSM) for the FMS to detect and diagnose the causes of performance insufficiencies in object detection. The IQSM exhibits remarkable performance, achieving an accuracy of about 98%, demonstrating its ability to effectively identify performance insufficiencies under pre-defined hazardous scenarios.

Abstract:
Place recognition and relative localization are crucial for realizing the potential of collaboration in ground and aerial robot teams. Many existing works focus only on ground robots and are not well-suited for heterogeneous robot systems in large-scale environments. In this paper, we propose a novel pipeline based on BEV density image, combined with an enhanced data structure, for place recognition in air-ground robotic collaboration systems. An efficient height alignment algorithm is proposed for relative localization. Extensive experiments on various types of public datasets validate the efficacy of our method compared to other SOTA works. We also show that our method is capable to detect inter- and intra-robot loop closures in a ground and aerial multi-session SLAM system.

Abstract:
This paper presents a distributed pursuit frame-work for environments with obstacles considering state measurement uncertainty. Our framework consists of two primary components: the computation of safe pursuit regions based on Voronoi cell (VC) and the solution of an adaptive robust path controller based on Control Barrier Function (CBF). Initially, the chance constrained obstacle-aware Voronoi cell (CCOVC) for each pursuer is constructed by calculating separation hyperplane and buffer terms. Subsequently, we formulate chance CBF and chance Control Lyapunov Function (CLF) constraints, using convex approximation to determine their upper bounds. We then find the adaptive robust path controller by solving a Quadratically Constrained Quadratic Program (QCQP). The advantage of this framework lies in its capability to adaptively compute the path controller and ensure robust collision avoidance among pursuers and with obstacles. Simulation and experimental results demonstrate the effectiveness and robustness of the proposed framework.

Abstract:
While it is relatively easier to train humanoid robots to mimic specific locomotion skills, it is more challenging to learn from various motions and adhere to continuously changing commands. These robots must accurately track motion instructions, seamlessly transition between a variety of movements, and master intermediate motions not present in their reference data. In this work, we propose a novel approach that integrates human-like motion transfer with precise velocity tracking by a series of improvements to classical imitation learning. To enhance generalization, we employ the Wasserstein divergence criterion (WGAN-div). Furthermore, a Hybrid Internal Model provides structured estimates of hidden states and velocity to enhance mobile stability and environment adaptability, while a curiosity bonus fosters exploration. Our comprehensive method promises highly human-like locomotion that adapts to varying velocity requirements, direct generalization to unseen motions and multitasking, as well as zero-shot transfer to the simulator and the real world across different terrains. These advancements are validated through simulations across various robot models and extensive real-world experiments.

Abstract:
This paper proposes a safety-critical locomotion control framework employed for legged robots exploring through infeasible path in obstacle-rich environments. Our research focus is on achieving safe and robust locomotion where robots confront unavoidable obstacles en route to their designated destination. Through the utilization of outcomes from physical interactions with unknown objects, we establish a hierarchy among the safety-critical conditions avoiding the obstacles. This hierarchy enables the generation of a safe reference trajectory that adeptly mitigates conflicts among safety conditions and reduce the risk while controlling the robot toward its destination without additional motion planning methods. In addition, robust bipedal locomotion is achieved by utilizing the Hybrid Linear Inverted Pendulum model, coupled with a disturbance observer addressing a disturbance from the physical interaction.

Abstract:
This paper presents a novel under-actuated twofinger gripper that passively adapts to various environments and maintains its grip posture using a passive-locking mechanism. The proposed mechanism features fingers with three phalanges, each incorporating four-bar and eight-bar linkages arranged in parallel. These linkages perform crucial functions, including maintaining the grip angle and ensuring passive characteristics during pinch grips. Previous grippers with passive mechanisms and three-phalanx fingers faced issues with gripping instability, particularly when changes in the passive joint angle were caused by object inertia or external lateral forces. To address this problem, we propose a new passive-locking mechanism utilizing an eight-bar linkage. This innovative design is engineered to adapt to environmental conditions, establish a secure grip, and maintain the grip angle of the passive joint after the grip is achieved. To demonstrate the advantages of the proposed mechanism, this paper conducts a fingertip force vector analysis and a mobility analysis according to the pinch sequence. It also details the derivation process and principles of the mechanism. The gripper's operational range and gripping force are examined through kinematic analysis and verified by simulation. Furthermore, the study shows that the proposed mechanism effectively responds to environmental constraints, even in environments with obstacles surrounding the object. Comparative experiments with and without a contact bar indicate that the proposed gripper can stably secure an object in scenarios involving swing motions and external forces of approximately 5 N.

Abstract:
Nonlinear compliant actuators are being increasingly used in human-robot interaction scenarios due to their inherent flexibility. However, a limitation is that nonlinear hysteresis exists, which will degrade the force/torque tracking performance if the hysteresis is not modeled accurately. Moreover, the existing methods are difficult to deal with the multi-loop asymmetry hysteresis. In this work, we present a novel modeling method, in which the hysteresis curves are decoupled into nonlinear reference lines and symmetrical hysteresis loops. A hybrid hysteresis model based on power function and Maxwellslip model is then developed to fit the nonlinear reference lines and symmetrical hysteresis loops respectively. Experiments were conducted on a nonlinear compliant actuator and the results show that the root-mean-square-errors (RMSE) of the hysteresis model decreases by 24.4% when compared with the Maxwellslip based hysteresis model.

Abstract:
This paper addresses key aspects of domain randomization in generating synthetic data for manufacturing object detection applications. To this end, we present a comprehensive data generation pipeline that reflects different factors: object characteristics, background, illumination, camera settings, and post-processing. We also introduce the Synthetic Industrial Parts Object Detection dataset (SIP15-OD) consisting of 15 objects from three industrial use cases under varying environments as a test bed for the study, while also employing an industrial dataset publicly available for robotic applications. In our experiments, we present more abundant results and insights into the feasibility as well as challenges of sim-toreal object detection. In particular, we identified material properties, rendering methods, post-processing, and distractors as important factors. Our method, leveraging these, achieves top performance on the public dataset with Yolov8 models trained exclusively on synthetic data; mAP@50 scores of 96.4% for the robotics dataset, and 94.1%, 99.5%, and 95.3% across three of the SIP15-OD use cases, respectively. The results showcase the effectiveness of the proposed domain randomization, potentially covering the distribution close to real data for the applications.

Abstract:
In this paper, we present a novel method to control a rigidly connected location on the vehicle, such as a point on the implement in case of agricultural tasks. Agricultural robots are transforming modern farming by enabling precise and efficient operations, replacing humans in arduous tasks while reducing the use of chemicals. Traditionally, path-following algorithms are designed to guide the vehicle's center along a predefined trajectory. However, since the actual agronomic task is performed by the implement, it is essential to control a specific point on the implement itself rather than the vehicle's center. As such, we present in this paper two approaches for achieving the control of an offset point on the robot. The first approach adapts existing control laws, initially intended for the rear axle's midpoint, to manage the desired lateral deviation. The second approach employs backstepping control techniques to create a control law that directly targets the implement. We conduct realworld experiments, highlighting the limitations of traditional approaches for offset point control, and demonstrating the strengths and weaknesses of the proposed methods.

Abstract:
Data is crucial for robotic manipulation, as it underpins the development of robotic systems for complex tasks. While high-quality, diverse datasets enhance the performance and adaptability of robotic manipulation policies, collecting extensive expert-level data is resource-intensive. Consequently, many current datasets suffer from quality inconsistencies due to operator variability, highlighting the need for methods to utilize mixed-quality data effectively. To mitigate these issues, we propose “Select Segments to Imitate” (S2I), a framework that selects and optimizes mixed-quality demonstration data at the segment level, while ensuring plug-and-play compatibility with existing robotic manipulation policies. The framework has three components: demonstration segmentation dividing origin data into meaningful segments, segment selection using contrastive learning to find high-quality segments, and trajectory optimization to refine suboptimal segments for better policy learning. We evaluate S2I through comprehensive experiments in simulation and real-world environments across six tasks, demonstrating that with only 3 expert demonstrations for reference, S2I can improve the performance of various downstream policies when trained with mixed-quality demonstrations. Project website: https://tonyfang.net/s2i/.

Abstract:
Achieving accurate segmentation in low-light scenes is challenging due to 1) severe domain shift encountered when models trained on daylight data are applied to such scenes and 2) lack of large-scale fine-grained labels in low-light conditions. A good idea is to use the generalization capabilities of segmentation foundation models like Segment Anything Model (SAM) to address the scarcity of annotated data. However, applying SAM to low-light scenes faces a severe domain shift issue due to the lack of inductive bias in effectively transforming low-light features into natural-light ones. To address this issue, we propose to adapt SAM for low-light scenes. To reduce the reliance on labels of low-light data, we develop a self-training method that makes SAM generate source-free predictions. To reduce the domain gap between low-light target data and SAM's natural-light trained data, we design a transformation head that enhances low-light features prior to the application of SAM. We further propose a domain shift compensation loss that trains our model to select a domain-adaptation-optimal illumination-enhanced feature map. Experimental results demonstrate that our method well outperforms the state of the art on the Dark Zurich and Nighttime Driving datasets. Code is available at https://github.com/HongminMu/SALS.

Abstract:
We introduce a novel gradient-based approach for solving sequential tasks by dynamically adjusting the underlying myopic potential field in response to feedback and the world's regularities. This adjustment implicitly considers subgoals encoded in these regularities, enabling the solution of long sequential tasks, as demonstrated by solving the traditional planning domain of Blocks World–without any planning. Unlike conventional planning methods, our feedbackdriven approach adapts to uncertain and dynamic environments, as demonstrated by one hundred real-world trials involving drawer manipulation. These experiments highlight the robustness of our method compared to planning and show how interactive perception and error recovery naturally emerge from gradient descent without explicitly implementing them. This offers a computationally efficient alternative to planning for a variety of sequential tasks, while aligning with observations on biological problem-solving strategies.

Abstract:
To ensure the efficiency of robot autonomy under diverse real-world conditions, a high-quality heterogeneous dataset is essential to benchmark the operating algorithms' performance and robustness. Current benchmarks predominantly focus on urban terrains, specifically for on-road autonomous driving, leaving multi-degraded, densely vegetated, dynamic and feature-sparse environments, such as underground tunnels, natural fields, and modern indoor spaces underrepresented. To fill this gap, we introduce EnvoDat, a large-scale, multi-modal dataset collected in diverse environments and conditions, including high illumination, fog, rain, and zero visibility at different times of the day. Overall, EnvoDat contains 26 sequences from 13 scenes, 10 sensing modalities, over 1.9TB of data, and over 89 K fine-grained polygon-based annotations for more than 82 object and terrain classes. We post-processed EnvoDat in different formats that support benchmarking SLAM and supervised learning algorithms, and fine-tuning multimodal vision models. With EnvoDat, we contribute to environment-resilient robotic autonomy in areas where the conditions are extremely challenging. The datasets and other relevant resources can be accessed through https://linusnep.github.io/EnvoDat/.

Abstract:
This paper tackles the challenging problem of semi-supervised monocular 3D object detection with a general framework. In specific, having observed that the bottleneck of this task lies in lacking reliable and informative samples from unlabeled data for detector learning, we introduce a novel simple yet effective ‘Augment and Criticize’ pipeline that mines abundant informative samples for robust detection. To be more specific, in the ‘Augment’ stage, we present the Augmentation-based Prediction aGgregation (APG), which applies automatically learned transformations to unlabeled images and aggregates detections from various augmented views as pseudo labels. Since not all the pseudo labels from APG are beneficially informative, the subsequent ‘Criticize’ phase is introduced. Particularly, we present the Critical Retraining Strategy (CRS) that, unlike simply filtering pseudo labels using a fixed threshold, employs a learnable network to evaluate the contribution of unlabeled images at different training timestamps. This way, the noisy samples prohibitive to model evolution can be effectively suppressed. In order to validate ‘Augment-Criticize’, we apply it to MonoDLE [1] and MonoFlex [2], and the two new detectors, dubbed 3DSeMoDLE and 3DSeMoFLEX, achieve state-of-the-art results with consistent improvements, evidencing its effectiveness and generality.

Abstract:
In this work, the Divergent Component of Motion (DCM) method is expanded to include angular coordinates for the first time. This work introduces the idea of spatial DCM, which adds an angular objective to the existing linear DCM theory. To incorporate the angular component into the framework, a discussion is provided on extending beyond the linear motion of the Linear Inverted Pendulum model (LIPM) towards the Single Rigid Body model (SRBM) for DCM. This work presents the angular DCM theory for a 1D rotation, simplifying the SRBM rotational dynamics to a flywheel to satisfy necessary linearity constraints. The 1D angular DCM is mathematically identical to the linear DCM and defined as an angle which is ahead of the current body rotation based on the angular velocity. This theory is combined into a 3D linear and 1D angular DCM framework, with discussion on the feasibility of simultaneously achieving both sets of objectives. A simulation in MATLAB and hardware results on the TORO humanoid are presented to validate the framework's performance.

Abstract:
Humanoid robots that mimic human movement have garnered significant attention in recent years. This study focuses on mimicking the efficient pitching motion of humans by incorporating two main approaches into a humanoid robot: (1) the use of elastic elements to assist joint torque, and (2) the optimization of motor torque to minimize energy consumption. This robot is intended to emulate human physical characteristics, such as mass, link length, and center of gravity, with a particular focus on utilizing the elastic energy generated during shoulder internal and external rotation. A leaf spring is attached in parallel with the motor at the shoulder pitch joint to release the elastic energy stored during shoulder external rotation, thereby assisting internal rotation in a manner similar to human biomechanics. Additionally, motor torque optimization is addressed by formulating the torque minimization problem as a combinatorial optimization challenge and solving it using Fujitsu's quantum-inspired Digital Annealer. Experiments conducted through simulations and with an actual pitching robot assessed the effectiveness of these technologies in mimicking human-like pitching motion. The results suggest that combining elastic elements with motion optimization techniques enable robots to achieve more efficient human-like movements.

Abstract:
Among the many challenges of parallel kinematic manipulators, achieving high-speed and accurate control remains crucial. Estimating their dynamic properties is essential for designing precise and efficient control schemes. Conventional methods for dynamic model identification have been effective, though deep learning approaches have historically faced limitations due to data inefficiencies. However, recent advancements in physics-informed neural networks (PINNs) offer a way to improve both control and the extraction of interpretable physical properties from these robots. In this work, we propose and validate a PINN-based dynamic model for a Delta parallel robot, specifically the ABB IRB 360-6/1600. Our approach incorporates known physical properties, such as mass matrix sparsity, to improve accuracy and computational efficiency in dynamic model identification. To the best of our knowledge, this is the first study applying PINNs to model parallel robots. The method is validated experimentally, and its performance is compared to a validated identification technique for physically consistent identification, demonstrating the effectiveness of this approach for real-world applications in parallel robots.

Abstract:
People with visual impairments rely on Electronic Travel Aids (ETAs), such as sensor-equipped guide canes, for safe and effective navigation. Misalignment or improper handling of these devices can reduce their effectiveness, increasing the risk of collisions and injuries. This paper presents an AIbased embedded system designed to predict and correct the orientation of a guide cane in real time. By integrating an Inertial Measurement Unit (IMU) with a neural network, the system continuously monitors the cane's lateral angle and orientation while providing feedback to help the user self-correct. The feedback is proportional to the degree of error, guiding users to maintain proper cane positioning during mobility. The device logs data that can be visualized remotely, offering mobility trainers valuable insights into the user's navigation patterns. Evaluation by visually impaired users demonstrated that the system effectively aided in real-time orientation correction. This effort contributes towards safe use of ETAs for navigation by visually challenged persons.

Affiliations: School of Integrated Technology, Gwangju Institute of Science and Technology, Gwangju, Republic of Korea; Information School, University of Washington, Seattle, Washington, USA; Department of Control and Robot Engineering & School of Aerospace Engineering, Gyeongsang National University, Jinju-si, Republic of Korea; Electrical and Computer Engineering, University of Washington, Seattle, Washington, USA; Computer Science Artificial Intelligence Laboratory, Massachusetts Institute of Technology, USA

Abstract:
In this paper, we present an adaptive walker system designed to address limitations in current intelligent walker technologies. While recent advancements have been made in this field, existing systems often struggle to seamlessly interpret user intent for speed control and lack adaptability across diverse scenarios and terrain. Our proposed solution incorporates high-resolution tactile sensors, deep learning algorithms, IMU sensors, and linear motors to dynamically adjust to the user's intentions and terrain changes. The system is capable of predicting the user's desired speed with an error margin of only 20.99%, relying solely on tactile input from hand and arm contact points. Additionally, it maintains the walker's horizontal stability with an error of less than 1 degree by adjusting leg lengths in response to variations in ground angle. This adaptive walker enhances user safety and comfort, particularly for individuals with reduced strength or cognitive abilities, and offers reliable assistance on uneven terrain such as uphill and downhill paths.

Abstract:
Amputations and limb loss can have detrimental effects on personal well-being. Although prosthetic devices can offer significant benefits helping amputees regain some of the lost dexterity, they often lack the required affordability and durability. Current affordable prosthetic designs have trended towards underactuation, which contributes to stable grasping but is often characterized by low durability. In this paper, a new chain-driven, adaptive, underactuated finger design has been proposed for the development of affordable and highly durable prosthetic hands. The transmission mechanism used is composed of a steel roller chain and several routing sprockets. The finger phalanges are constructed of 3D printed PLA, and finger flexion is produced by pulling the internally routed roller chain. In total, six 3D printed PLA sprockets are used for chain routing, with a design emphasis on high force transmission. The performance of the proposed chaindriven finger was experimentally validated and compared with an analogous tendon-driven version. The metrics employed for this comparison were longevity, pinch grasp efficiency, force response, and maximum force capability. The chain-driven finger was shown to have a higher maximum transmissible force, better long term durability, and no issues related to elongation (such as tendon elongation). The cost to manufacture the chain-driven robotic finger is only 91 USD, making it an excellent solution for affordable prostheses.

Abstract:
We introduce RMP-YOLO, a unified framework designed to provide robust motion predictions even with incomplete input data. Our key insight stems from the observation that complete and reliable historical trajectory data plays a pivotal role in ensuring accurate motion prediction. Therefore, we propose a new paradigm that prioritizes the reconstruction of intact historical trajectories before feeding them into the prediction modules. Our approach introduces a novel scene tokenization module to enhance the extraction and fusion of spatial and temporal features. Following this, our proposed recovery module reconstructs agents' incomplete historical trajectories by leveraging local map topology and interactions with nearby agents. The reconstructed, clean historical data is then integrated into the downstream prediction modules. Our framework is able to effectively handle missing data of varying lengths and remains robust against observation noise while maintaining high prediction accuracy. Furthermore, our recovery module is compatible with existing prediction models, ensuring seamless integration. Extensive experiments validate the effectiveness of our approach, and deployment in real-world autonomous vehicles confirms its practical utility. In the 2024 Waymo Motion Prediction Competition, our method, RMP-YOLO, achieves state-of-the-art performance, securing third place. Our code is open-source at https://github.com/ggosjw/RMP-YOLO.

Abstract:
We study human perceptions of a robot that performs robot-to-human (R2H) handovers controlled to grasp, transport, and transfer 34 objects by mimicking human givers in human-human (H2H) handover data. Recognizing the importance of human-like robotic behavior for successful collaboration, R2H studies use models of human behavior or observations of H2H data to plan robot giver motion. However, R2H studies have been limited in object counts. In this work, we use the Human-Object-Human (HOH) dataset, consisting of H2H interactions performed by 20 giver-receiver pairs with 136 objects, to conduct an R2H study with 34 objects. We teleoperate a Kinova Gen3 manipulator to grip an object as grasped by an HOH human giver, and program it to automatically transport and orient the object to a participant by mimicking the HOH giver's trajectory and transfer pose. We survey participants on safety, naturalness, and preferred choice over linear trajectory and random orientation baselines. We find that transfer pose influences perceptions of naturalness, with HOH poses showing higher naturalness ratings. Participants prefer handovers with HOH end poses when asked to pick their preferred interaction.

Abstract:
Object manipulation is a high-frequency task required in assistive robotic systems in order to aid the elderly or those with disabilities that impact motor control. In the instance where arms cannot be used to command a robot, gaze-tracking via smart glasses is a suitable candidate. In this work, we develop a modeling method and model-based filtering and control strategy for direct gaze-guided teleoperation of robotic manipulators. We demonstrate the feasibility of this control strategy in an object manipulation case study with six participants. The results indicate that a model-based gaze filtering and control strategy produces smooth commands for the robot that are easy for the participants to use. These methods can reduce the perceived workload of the user by 37.51% and lower the gripper positioning error by 39.09% compared to using unfiltered gaze data.

Abstract:
Individuals with Severe Speech and Motor Impairment (SSMI) struggle to interact with their surroundings due to physical and communicative limitations. To address these challenges, this paper presents a gaze-controlled robotic system that helps SSMI users perform stamp printing tasks. The system includes gaze-controlled interfaces and a robotic arm with a gripper, designed specifically for SSMI users to enhance accessibility and interaction. User studies with gazecontrolled interfaces such as video see-through (VST), video pass-through (VPT), and optical see-through (OST) displays demonstrated the system's effectiveness. Results showed that VST had the average stamping time of 28.45 s(SD = 15.44s) and the average stamp count 7.36(SD = 3.83), outperforming VPT and OST.

Abstract:
Individuals with tetraplegia have their independence and quality of life severely affected. Assistive robotic arms can enhance their autonomy, but effective control interfaces are essential for optimizing their usability and performance. This study aims to evaluate the performance and user experience of three control interfaces for an assistive robotic arm: Graphical User Interfaces (GUI), Embedded Interface, and Directional Gaze. Thirty-three able-bodied participants were recruited to control an assistive robotic arm through the three different interfaces in a between-subjects experiment. Performance was measured using the Yale-CMU-Berkeley (YCB) Block Pick and Place Protocol. Usability (SUS) and task, workload (NASATLX) were measured through subjective questionnaires. Additionally, we report saccades per minute and fixation duration. The results revealed statistically significant differences showing that Embedded and GUI interfaces, when compared to the Directional Gaze interface, can lead to lower workloads and higher performance in pick-up tasks.

Abstract:
Achieving monocular camera localization within pre-built LiDAR maps can bypass the simultaneous mapping process of visual SLAM systems, potentially reducing the computational overhead of autonomous localization. To this end, one of the key challenges is cross-modal place recognition, which involves retrieving 3D scenes (point clouds) from a LiDAR map according to online RGB images. In this paper, we introduce an efficient framework to learn descriptors for both RGB images and point clouds. It takes visual state space model (VMamba) as the backbone and employs a pixel-view-scene joint training strategy for cross-modal contrastive learning. To address the field-of-view differences, independent descriptors are generated from multiple evenly distributed viewpoints for point clouds. A visible 3D points overlap strategy is then designed to quantify the similarity between point cloud views and RGB images for multi-view supervision. Additionally, when generating descriptors from pixel-level features using NetVLAD, we compensate for the loss of geometric information, and introduce an efficient scheme for multi-view generation. Experimental results on the KITTI and KITTI-360 datasets demonstrate the effectiveness and generalization of our method. The code is available at https://github.com/y2w-oc/I2P-CMPR.

Abstract:
Place recognition is essential for achieving closedloop or global positioning in autonomous vehicles and mobile robots. Despite recent advancements in place recognition using 2D cameras or 3D LiDAR, it remains to be seen how to use 4D radar for place recognition - an increasingly popular sensor for its robustness against adverse weather and lighting conditions. Compared to LiDAR point clouds, radar data are drastically sparser, noisier and in much lower resolution, which hampers their ability to effectively represent scenes, posing significant challenges for 4D radar-based place recognition. This work addresses these challenges by leveraging multimodal information from sequential 4D radar scans and effectively extracting and aggregating spatio-temporal features. Our approach follows a principled pipeline that comprises (1) dynamic points removal and ego-velocity estimation from velocity property, (2) bird's eye view (BEV) feature encoding on the refined point cloud, (3) feature alignment using BEV feature map motion trajectory calculated by ego-velocity, (4) multiscale spatio-temporal features of the aligned BEV feature maps are extracted and aggregated. Real-world experimental results validate the feasibility of the proposed method and demonstrate its robustness in handling dynamic environments. Source codes are available.

Abstract:
Actuator faults in dynamic systems pose significant challenges, particularly for robotic systems operating in hostile environments such as Autonomous Underwater Vehicles (AUVs), risking loss of stability and performance degradation. Fault Tolerant Control (FTC) strategies, including Control Reallocation (CR), have been developed to mitigate such risks. However, these strategies extensively depend on explicit fault diagnosis, which may present challenges regarding computational demands and efficiency, particularly when dealing with unknown faults. This paper presents a novel method that performs CR with Deep Reinforcement Learning (DRL) for actuator fault recovery without explicit fault diagnosis. The approach is implemented on a BlueROV2 underwater vehicle and demonstrates improved performance for fault recovery compared to a standard Proportional-Integral-Derivative (PID) controller and a variable gain PID controller, both in simulation and in real-world conditions. The DRL-based CR method demonstrates generalisability by successfully handling faults not encountered during training, highlighting its adaptability to unforeseen circumstances.

Abstract:
Autonomous robotic wiping is an important task in various industries, ranging from industrial manufacturing to sanitization in healthcare. Deep reinforcement learning (Deep RL) has emerged as a promising algorithm, however, it often suffers from a high demand for repetitive reward engineering. Instead of relying on manual tuning, we first analyze the convergence of quality-critical robotic wiping, which requires both high-quality wiping and fast task completion, to show the poor convergence of the problem and propose a new bounded reward formulation to make the problem feasible. Then, we further improve the learning process by proposing a novel visual-language model (VLM) based curriculum, which actively monitors the progress and suggests hyperparameter tuning. We demonstrate that the combined method can find a desirable wiping policy on surfaces with various curvatures, frictions, and waypoints, which cannot be learned with the baseline formulation. The demo of this project can be found at: https://sites.google.com/view/highqualitywiping

Abstract:
Cable-driven exosuits assist users in ambulatory activities by transmitting assistive torques from motors to the actuated joints. State-of-the-art exosuits typically use Bowden cable transmissions, albeit their limited efficiencies (40–60 %) and non-linear response in curved paths. This paper evaluates the efficiency and responsiveness of a new cable-pulley transmission compared to a Bowden transmission, using both steel and Dyneema cables. The analysis includes three experiments: a test bench simulating a curved transmission path, followed by a static and dynamic experiment where six unimpaired participants donned an exosuit featuring both transmissions across the hips and knees. Our findings demonstrate that the pulley transmission consistently outperformed the Bowden's efficiency by absolute margins of 18.77 \pm 7.29% using a steel cable and by 40.60 \pm 6.76% using a Dyneema cable across all experiments. Additionally, the steel cable was on average 19.19 \pm 5.29% more efficient than the Dyneema cable in the pulley transmission and 41.02 \pm 6.34% in the Bowden tube. These results led to the development of the Stillsuit, a novel lower-limb cable-driven exosuit that uses a pulley transmission and steel cable. The Stillsuit sets a new benchmark for exosuits with 87.56 \pm 3.92 % transmission efficiency, generating similar biological torques to those found in literature (16.4% and 19.0% of the biological knee and hip torques, respectively) while using smaller motors, resulting in a lighter actuation unit (1.92 kg).

Abstract:
Safe navigation of cluttered environments is a critical challenge in robotics. It is typically approached by separating the planning and tracking problems, with planning executed on a reduced order model to generate reference trajectories, and control techniques used to track these trajectories on the full order dynamics. Inevitable tracking error necessitates robustification of the nominal plan to ensure safety; in many cases, this is accomplished via worst-case bounding, which ignores the fact that some trajectories of the planning model may be easier to track than others. In this work, we present a novel method leveraging massively parallel simulation to learn a dynamic tube representation, which characterizes tracking performance as a function of actions taken by the planning model. Planning model trajectories are then optimized such that the dynamic tube lies in the free space, allowing a balance between performance and safety to be traded off in real time. The resulting Dynamic Tube MPC is applied to the 3D hopping robot ARCHER, enabling agile and performant navigation of cluttered environments, and safe collision-free traversal of narrow corridors.

Abstract:
Embodied intelligence requires precise reconstruction and rendering to simulate large-scale real-world data. Although 3D Gaussian Splatting (3DGS) has recently demonstrated high-quality results with real-time performance, it still faces challenges in indoor scenes with large, textureless regions, resulting in incomplete and noisy reconstructions due to poor point cloud initialization and underconstrained optimization. Inspired by the continuity of signed distance field (SDF), which naturally has advantages in modeling surfaces, we propose a unified optimization framework that integrates neural signed distance fields (SDFs) with 3DGS for accurate geometry reconstruction and real-time rendering. This framework incorporates a neural SDF field to guide the densification and pruning of Gaussians, enabling Gaussians to model scenes accurately even with poor initialized point clouds. Simultaneously, the geometry represented by Gaussians improves the efficiency of the SDF field by piloting its point sampling. Additionally, we introduce two regularization terms based on normal and edge priors to resolve geometric ambiguities in textureless areas and enhance detail accuracy. Extensive experiments in ScanNet and ScanNet++ show that our method achieves state-of-the-art performance in both surface reconstruction and novel view synthesis. Project page: https://xhd0612.github.io/GaussianRoom.github.io/

Abstract:
Regular cleaning and maintenance of high-altitude pipes and curved surfaces on high-rise buildings are high-risk tasks for human workers due to the difficulty of working on curved planes. To address such challenge, automated robots are widely used for cleaning buildings with flat walls, but they cannot climb on curved surfaces, limiting their practical applications. This paper proposes a novel biped curved-surface climbing robot (BCCR) with five-degree-of-freedom (5-DOF) motion. The BCCR features adaptive vacuum suction modules that can adhere to both curved and flat surfaces, allowing seamless movement of the BCCR across various surfaces. Each terminal suction module is composed of three small suction cups, which are capable of rotating in all directions to achieve adaptive adhesion on various surfaces. The 5-DOF structure enables the robot to cross obstacles and makes it highly versatile for various cleaning tasks on a wide range of surfaces, including large curved pipes. The mechanism design and analytical modeling of the BCCR are carried out, demonstrating its robust curved-surface climbing capabilities. Moreover, a prototype is fabricated for experimental investigation. The results indicate that the proposed 5-DOF BCCR can achieve stable climbing on curved surfaces.

Abstract:
Tracking and manipulating irregularly-shaped, previously unseen objects in dynamic environments is important for robotic applications in manufacturing, assembly, and logistics. Recently introduced Gaussian Splats [1] efficiently model object geometry, but lack persistent state estimation for taskoriented manipulation. We present Persistent Object Gaussian Splat (POGS), a system that embeds semantics, self-supervised visual features, and object grouping features into a compact representation that can be continuously updated to estimate the pose of scanned objects. POGS updates object states without requiring expensive rescanning or prior CAD models of objects. After an initial multi-view scene capture and training phase, POGS uses a single stereo camera to integrate depth estimates along with self-supervised vision encoder features for object pose estimation. POGS supports grasping, reorientation, and natural language-driven manipulation by refining object pose estimates, facilitating sequential object reset operations with human-induced object perturbations and tool servoing, where robots recover tool pose despite tool perturbations of up to 30°. POGS achieves up to 12 consecutive successful object resets and recovers from 80% of in-grasp tool perturbations.

Abstract:
This paper proposes a trajectory tracking control law for a mobile robot with two front differential wheels and a tail wheel. The dynamics is given by mimicking Ackerman steering model for the dynamics of position and orientation, associated with the actuator dynamics of the tail wheel's angle modeled by a first-order response with respect to the robot's angular velocity. First we develop a nominal trajectory tracking control law to track a given desired trajectory by applying differential-flatness property of the unicycle model and backstepping approach to handle the actuator dynamics. The effectiveness of the trajectory tracking is demonstrated by conducting hardware robot experiment after performing system identification, which illustrates the superior performance over a benchmark method. The design is also extended to an adaptive tracking control under parameter uncertainty in the tail wheel dynamics through introducing the adaptation law of the parameters, and the performance is demonstrated in numerical simulation.

Abstract:
The stiffness of passive lower-limb exoskeletons and orthoses governs their assistance. A common practice in the design of these systems is to assume the stiffness of the device is determined only by the intended elastic element (e.g., spring), while the structural components, human attachments, and soft tissues are considered rigid. In practice, the mechanical behavior of orthoses is significantly affected by the compliance of these elements, which drastically impacts the assistance provided. In this work, we present a linkage model with compliant elements that can accurately predict the applied stiffness of ankle-foot orthoses, and retroactively estimate the stiffness of unintended spring elements from published data. The compliant model accurately predicted the torque trajectories of two published passive orthoses with modeled peak torques within 4 % to 7 % of measured values. In contrast, the rigid model greatly overestimated the peak torques, predicting 203 % to 376 % of the measured values. The compliant model also indicated that an onboard joint encoder could only measure 52 % to 69 % of the peak ankle angle recorded with motion capture. The compliant model was also used to reassess the stiffness range of a variable-stiffness orthosis, indicating that its adjustable range is likely 69 % of rigid model predictions. Overall, this work highlights the need to consider how unmodeled compliance affects the mechanical behavior of orthoses and provides a foundation for further exploration.

Abstract:
To maximize the autonomy of individuals with upper limb amputations in daily activities, leveraging forearm muscle information to infer movement intent is a promising research direction. While current prosthetic hand technologies can utilize forearm muscle data to achieve basic movements such as grasping, accurately estimating finger joint angles remains a significant challenge. Therefore, we propose a Multi-Stage Cascade Convolutional Neural Network with Long Short-Term Memory Network, where an upsampling module is introduced before the downsampling module to enhance model generalization. Additionally, we designed a transfer learning (TL) framework based on parameter freezing, where the pre-trained downsampling module is fixed, and only the upsampling module is updated with a small amount of out-ofdistribution data to achieve TL. Furthermore, we compared the performance of unimodal and multimodal models, collecting surface electromyography (sEMG) signals, brightness mode ultrasound images (B-mode US images), and motion capture data simultaneously. The results show that on the validation set, the US image had the lowest error, while on the prediction set, the four-channel sEMG achieved the lowest error. The performance of the multimodal model in both datasets was intermediate between the unimodal models. On the prediction set, the average normalized root mean square error values for the four-channel sEMG, US images, and sensor fusion models across three subjects were 0.170,0.203, and 0.186, respectively. By utilizing advanced sensor fusion techniques and TL, our approach can reduce the need for extensive data collection and training for new users, making prosthetic control more accessible and adaptable to individual needs.

Abstract:
Developing whole-body tactile skins for robots remains a challenging task, as existing solutions often prioritize modular, one-size-fits-all designs, which, while versatile, fail to account for the robot's specific shape and the unique demands of its operational context. In this work, we introduce GenTact Toolbox, a computational pipeline for creating versatile wholebody tactile skins tailored to both robot shape and application domain. Our method includes procedural mesh generation for conforming to a robot's topology, task-driven simulation to refine sensor distribution, and multi-material 3D printing for shape-agnostic fabrication. We validate our approach by creating and deploying six capacitive sensing skins on a Franka Research 3 robot arm in a human-robot interaction scenario. This work represents a shift from “one-size-fits-all” tactile sensors toward context-driven, highly adaptable designs that can be customized for a wide range of robotic systems and applications. The project website is available at https://hiro-group.ronc.one/gentacttoolbox

Abstract:
Robot programming for complex assembly tasks is challenging and demands expert knowledge. With Augmented Reality (AR), immersive 3D visualization can be placed in the robot's intrinsic coordinate system to support robot programming. However, AR interfaces introduce usability challenges. To address these, we introduce a hybrid user interface (HUI) that combines a 2D desktop, a smartphone, and an AR head-mounted display (HMD) application, enabling operators to choose the most suitable device for each sub-task. The evaluation with an expert user study shows that an HUI can enhance efficiency and user experience by selecting the appropriate device for each sub-task. Generally, the HMD is preferred for tasks involving 3D content, the desktop for creating the program structure and parametrization, and the smartphone for mobile parametrization. However, the device selection depends on individual user characteristics and their familiarity with the devices.

Abstract:
In this paper, we propose a novel and generalizable zero-shot knowledge transfer framework that distills expert sports navigation strategies from web videos into robotic systems with adversarial constraints and out-of-distribution image trajectories. Our pipeline enables diffusion-based imitation learning by reconstructing the full 3D task space from multiple partial views, warping it into 2D image space, closing the planning loop within this 2D space, and transfer constrained motion of interest back to task space. Additionally, we demonstrate that the learned policy can serve as a local planner in conjunction with position control. We apply this framework in the wheelchair tennis navigation problem to guide the wheelchair into the ball-hitting region. Our pipeline achieves a navigation success rate of \mathbf9 7. 6 7 % in reaching real-world recorded tennis ball trajectories with a physical robot wheelchair, and achieve a success rate of 68.49% in a real-world, real-time experiment on a full-sized tennis court22Code is at https://github.gatech.edu/MCG-Lab/tennis_gameplay_learning.

Abstract:
To robustly handle objects, robots must perceive mechanical interactions through touch with sufficient richness. New tactile sensors leverage miniature cameras to provide dense measurements of these interactions, allowing for the extraction of material properties and frictional information. Among the plethora of solutions, retrographic sensing is popular for its ability to finely resolve the shape of the object being touched. These sensors use a reflective membrane, illuminated at a shallow angle by three RGB lights from which fine details of the surface can be recovered. However, these retrographic sensors are unable to detect the lateral displacement of the membrane and, therefore overlook frictional information, which is crucial for grasping and manipulation. Embedding and tracking opaque markers has been a makeshift solution, but these markers occlude the membrane and are difficult to manufacture. In this paper, we introduce ShadowTac, a tactile sensor that combines retrographic illumination with non-intrusive markers created by colored shadows. We patterned the retrographic surface with a dense array of submillimeter dimples, which are small enough not to obstruct the view yet cast shadows large enough to be visible to the camera. ShadowTac captures a dense image of both the normal displacement field with fine details and a precise lateral displacement field by tracking the markers. Additionally, our sensor is easy to manufacture, as the dimple pattern can simply be molded. We evaluated the measurement reliability of ShadowTac and its effectiveness in estimating the incipient slip of arbitrary objects. The dense measurement of both the normal and shear deformation that the sensor captures makes it ideal for tracking dynamic interactions between robotic fingertips and manipulated objects.

Abstract:
Wheeled self-reconfigurable robots (WSRRs), a new type of multi-robot system with flexible configurations and task adaptability, have an extensive application prospects in unstructured mission environments. In this paper, based on the nonholonomic constraints and Lagrange method, the combinatorial modal kinematics and dynamics of WSRRs with arbitrary reconfiguration scale are established. At the kinematic level, based on the nonholonomic constraints, a smooth obstacle avoidance strategy based on the safety geofences is designed to ensure safety. At the dynamic level, an adaptive fault-tolerant mechanism is introduced to ensure reasonable torque distribution and avoid tracking performance degradation. Meanwhile, an improved extended state observer (IESO) is elaborated, through which the high-frequency ocsillation from measurement noises and peaking phenomenon from initial observer errors can be suppressed, and the robust velocity tracking control under unknown lumped disturbances is realized. Finally, a real-world WSRRs experiment is constructed to verify the proposed method's fault tolerance, robustness, and safety comparatively.

Abstract:
Automating small-scale experiments in materials science presents challenges due to the heterogeneous nature of experimental setups. This study introduces the SCU-Hand (Soft Conical Universal Robot Hand), a novel end-effector designed to automate the task of scooping powdered samples from various container sizes using a robotic arm. The SCU-Hand employs a flexible, conical structure that adapts to different container geometries through deformation, maintaining consistent contact without complex force sensing or machine learning-based control methods. Its reconfigurable mechanism allows for size adjustment, enabling efficient scooping from diverse container types. By combining soft robotics principles with a sheet-morphing design, our end-effector achieves high flexibility while retaining the necessary stiffness for effective powder manipulation. We detail the design principles, fabrication process, and experimental validation of the SCU-Hand. Experimental validation showed that the scooping capacity is about 20% higher than that of a commercial tool, with a scooping performance of more than 95% for containers of sizes between 67 mm to 110 mm. This research contributes to laboratory automation by offering a cost-effective, easily implementable solution for automating tasks such as materials synthesis and characterization processes.

Abstract:
Soft robotic hands compensate for uncertainty in perception and actuation by leveraging passive deformation in their intrinsically compliant hardware, facilitating robust and dexterous interactions with their environment. The ability to adjust the level of compliance during operation has the potential to further improve the performance of these hands by enabling novel interaction strategies. However, achieving variable stiffness mechanically typically requires significant engineering complexity, making these systems difficult to manufacture, prone to error, and expensive. We present a novel, very simple mechanism for achieving variable stiffness. This mechanism employs tendon-driven antagonistic actuation, with Bowden cables connecting elastic elements to servomotors. It supports compact actuator designs, while the Bowden cables facilitate flexible component placement within a robotic system. Following our approach, variable stiffness actuators can be easily manufactured at low-cost from readily available materials. Despite its simplicity, we demonstrate that our mechanism provides consistent and precise control over stiffness levels and contact torques, showcasing its potential for a broad range of applications in soft robotic systems.

Abstract:
Helmet-mounted wearable positioning systems are crucial for enhancing safety and facilitating coordination in industrial, construction, and emergency rescue environments. These systems, including LiDAR-Inertial Odometry (LIO) and Visual-Inertial Odometry (VIO), often face challenges in localization due to adverse environmental conditions such as dust, smoke, and limited visual features. To address these limitations, we propose a novel head-mounted Inertial Measurement Unit (IMU) dataset with ground truth, aimed at advancing data-driven IMU pose estimation. Our dataset captures human head motion patterns using a helmet-mounted system, with data from ten participants performing various activities. We explore the application of neural networks, specifically Long Short-Term Memory (LSTM) and Transformer networks, to correct IMU biases and improve localization accuracy. Additionally, we evaluate the performance of these methods across different IMU data window dimensions, motion patterns, and sensor types. We release a publicly available dataset, demonstrate the feasibility of advanced neural network approaches for helmet-based localization, and provide evaluation metrics to establish a baseline for future studies in this field. Data and code can be found at https://lqiutong.github.io/HelmetPoser.github.io/.

Abstract:
3D detection of vehicles is an essential component for autonomous driving applications. Nevertheless, collecting the supervised training data for learning 3D vehicle detectors would be costly (e.g. utilization of expensive LiDAR sensors) and labor-intensive (for human annotation). In comparison to 3D detection, 2D object detection has achieved a welldeveloped status, boosting stable and robust performance with widespread application in numerous fields, thanks to the large scale (i.e. amount of samples) of existing training datasets of 2D object detection. Hence, in our work, we propose to realize 3D detection via leveraging the robustness of 2D detectors and developing a network that lifts 2D detections to 3D. With the flexibility of building upon various backbone models (e.g. the models which take image regions detected by 2D detector as inputs to predict their corresponding 3D bounding boxes, or the existing monocular 3D detection models which have the intermediate output of 2 D bounding boxes), we propose several geometry-driven objectives, including projection consistency loss, geometry depth loss, and opposite bin loss, to improve the training upon 2D-to-3D lifting. Our extensive experimental results demonstrate that our proposed geometrydriven objectives not only contribute to the superior results of 3D detection but also provide better generalizability across datasets.

Abstract:
Non-contact manipulation at the air-liquid interface holds significant potential for applications in microrobotics, non-invasive assembly, and biochemistry analysis. However, achieving simultaneous position and orientation (pose) control of floating objects remains a considerable challenge, particularly for adaptive control without prior modeling of the objects. Here, we introduce the Vision-based Adaptive Laser Gripper (VALG) system addressing these challenges. By leveraging the distributed thermocapillary flow induced by patterned laser scanning, a pose control strategy based on the equidistant contour scanning laser is proposed and validated. The proposed system relies solely on visual recognition to generate adaptive laser grippers, which achieve static equilibrium to simultaneously constrain the position and orientation of the floating objects. Experimental validation demonstrates the effectiveness of the VALG system in independent position and orientation control, coupled pose control, and path following. The VALG system facilitates smooth, precise, fast, and adaptive pose control of generalized floating objects, establishing it as a universal and versatile platform for non-contact manipulation at the air-liquid interface.

Abstract:
Managing indirect access in laparoscopy as a minimally invasive procedure poses challenges to physicians. In particular, an endoscope must be navigated to achieve adequate visualization of the surgical anatomy, while coping with unergonomic poses, tremor, and fatigue. Furthermore, the alignment of visual perception and physical movement, dictated by the endoscope's position relative to the monitor, can lead to hand-eye coordination challenges. We propose unified deployment of a robotic endoscope holder together with an augmented reality display to counteract the aforementioned challenges in laparoscopy. Our augmented reality system provides an interactive, stereoscopic, virtual monitor displaying an endoscopic stream. In addition, our method design enables direct control of the robotic endoscope holder. Our user study demonstrates the potential of the proposed method to significantly improve hand-eye coordination, while insights from our usability study for robotic control indicate promising trends, including high usability and low cognitive demand.

Abstract:
In this work, we propose a high-voltage, high-frequency control circuit for the untethered applications of dielectric elastomer actuators (DEAs). The circuit board leverages low-voltage resistive components connected in series to control voltages of up to 1.8 kV within a compact size, suitable for frequencies ranging from 0 to 1kHz. A single-channel control board weighs only 2.5 g. We tested the performance of the control circuit under different load conditions and power supplies. Based on this control circuit, along with a commercial miniature high-voltage power converter, we construct an untethered crawling robot driven by a cylindrical DEA. The 42-g untethered robots successfully obtained crawling locomotion on a bench and within a pipeline at a driving frequency of 15 Hz, while simultaneously transmitting real-time video data via an onboard camera and antenna. Our work provides a practical way to use low-voltage control electronics to achieve the untethered driving of DEAs, and therefore portable and wearable devices.

Abstract:
Visual navigation in dynamic environments poses a considerable challenge, particularly in scenarios with diverse pedestrian behaviors. Traditional simulators primarily focus on static scenes, while existing dynamic pedestrian simulators often suffer limitations such as monotonous pedestrian models, lack of interaction with the environment, and constrained scenarios. These deficiencies lead to notable discrepancies from real-world dynamic pedestrian environments. To bridge this gap, we introduce DP-Habitat, a dynamic pedestrian simulator developed on the Habitat platform. DP-Habitat efficiently simulates a wide range of complex and realistic human behaviors, with flexible interactions between pedestrian models and environments. It also supports rapid deployment of pedestrian models across various scenes, thereby more accurately replicating the complexities of real-world dynamic pedestrian settings. Additionally, we present Adaptive Object Navigation with Dynamic Mapping (AON-DM), a novel baseline method specifically designed for dynamic pedestrian settings. AON-DM integrates real-time pedestrian tracking and predictive modeling with a hybrid path planning strategy, markedly improving navigation efficiency and success rates. Our experimental results reveal that dynamic pedestrians significantly affect visual navigation performance within DP-Habitat, with AON-DM achieving superior effectiveness compared to existing methods under these challenging conditions. Furthermore, our approach maintains high performance in real-world scenarios, highlighting its practical applicability and robustness. The code and data are available at https://github.com/qinliangql/DP-Habitat.git.

Abstract:
We introduce SeaSplat, a method to enable real-time rendering of underwater scenes leveraging recent advances in 3D radiance fields. Underwater scenes are challenging visual environments, as rendering through a medium such as water introduces both range and color dependent effects on image capture. We constrain 3D Gaussian Splatting (3DGS), a recent advance in radiance fields enabling rapid training and real-time rendering of full 3D scenes, with a physically grounded under-water image formation model. Applying SeaSplat to the real-world scenes from SeaThru-NeRF dataset, a scene collected by an underwater vehicle in the US Virgin Islands, and simulation-degraded real-world scenes, not only do we see increased quantitative performance on rendering novel viewpoints from the scene with the medium present, but are also able to recover the underlying true color of the scene and restore renders to be without the presence of the intervening medium. We show that the underwater image formation helps learn scene structure, with better depth maps, as well as show that our improvements maintain the significant computational improvements afforded by leveraging a 3D Gaussian representation. Code, data, and visualizations are available at https://seasplat.github.io

Abstract:
While collaborative robots effectively combine robotic precision with human capabilities, traditional control methods such as button presses or hand guidance can be slow and physically demanding. This has led to an increasing interest in natural user interfaces that integrate hand gesturebased interactions for more intuitive and flexible robot control. Therefore, this paper systematically explores mid-air robot control by comparing position and rate control modes with different state-of-the-art and novel sensor placements. A user study was conducted to evaluate each combination in terms of accuracy, task duration, perceived workload, and physical exertion. Our results indicate that position control is more efficient than rate control. Traditional desk-mounted sensors can provide a good balance between accuracy and comfort. However, robot-mounted sensors are a viable alternative for short-term, accurate control with less spatial requirements. Legmounted sensors, while comfortable, pose challenges to handeye coordination. Based on these findings, we provide design implications for improving the usability and comfort of midair human-robot interaction. Future research should extend this evaluation to a wider range of tasks and environments.

Abstract:
Indoor air quality plays an essential role in the safety and well-being of occupants, especially in the context of airborne diseases. This paper introduces AeroSafe, a novel approach aimed at enhancing the efficacy of indoor air purification systems through a robotic cough emulator testbed and a digital-twins-based aerosol residence time analysis. Current portable air filters often overlook the concentrations of respiratory aerosols generated by coughs, posing a risk, particularly in high-exposure environments like healthcare facilities and public spaces. To address this gap, we present a robotic dual-agent physical emulator comprising a maneuverable mannequin simulating cough events and a portable air purifier autonomously responding to aerosols. The generated data from this emulator trains a digital twins model, combining a physics-based compartment model with a machine learning approach, using Long Short-Term Memory (LSTM) networks and graph convolution layers. Experimental results demonstrate the model's ability to predict aerosol concentration dynamics with a mean residence time prediction error within 35 seconds. The proposed system's real-time intervention strategies outperform static air filter placement, showcasing its potential in mitigating airborne pathogen risks.

Abstract:
Multi-Agent Path Finding (MAPF) is the problem of effectively finding efficient collision-free paths for a group of agents in a shared workspace. The MAPF community has largely focused on developing high-performance heuristic search methods. Recently, several works have applied various machine learning (ML) techniques to solve MAPF, usually involving sophisticated architectures, reinforcement learning techniques, and set-ups, but none using large amounts of high-quality supervised data. Our initial objective in this work was to show how simple large-scale imitation learning of high-quality heuristic search methods can lead to state-of-the-art ML MAPF performance. However, we find that, at least with our model architecture, simple large-scale (700k examples with hundreds of agents per example) imitation learning does not produce impressive results. Instead, we find that by using prior work that post-processes MAPF model predictions to resolve 1-step collisions (CS-PIBT), we can train a simple ML MAPF policy in minutes that dramatically outperforms existing ML MAPF policies. This has serious implications for all future ML MAPF policies (with local communication) which currently struggle to scale. In particular, this finding implies that future learnt policies should always (1) use smart 1-step collision shields (e.g, CS-PIBT) and (2) include the collision shield with greedy actions as a baseline (e.g. PIBT), as well as (3) motivates future models to focus on longer horizon / more complex planning as 1-step collisions can be efficiently resolved.

Abstract:
We consider the problem of indoor building-scale social navigation, where the robot must reach a point goal as quickly as possible without colliding with humans who are freely moving around. Factors such as varying crowd densities, unpredictable human behavior, and the constraints of indoor spaces add significant complexity to the navigation task, necessitating a more advanced approach. We propose a modular navigation framework that leverages the strengths of both classical methods and deep reinforcement learning (DRL). Our approach employs a global planner to generate waypoints, assigning soft costs around anticipated pedestrian locations, encouraging caution around potential future positions of humans. Simultaneously, the local planner, powered by DRL, follows these waypoints while avoiding collisions. The combination of these planners enables the agent to perform complex maneuvers and effectively navigate crowded and constrained environments while improving reliability. Many existing studies on social navigation are conducted in simplistic or open environments, limiting the ability of trained models to perform well in complex, real-world settings. To advance research in this area, we introduce a new 2D benchmark designed to facilitate development and testing of social navigation strategies in indoor environments.22Simulator and code: https://github.com/arnabGMU/hybrid_social_nav We benchmark our method against traditional and RL-based navigation strategies, demonstrating that our approach outperforms both.

Abstract:
Navigating robots in dynamic environments, such as human crowds, is a major challenge due to the trade-off between performance and robustness. Traditional reinforcement learning methods, such as Proximal Policy Optimization (PPO), have shown strong adaptation capabilities but require extensive training and lack explicit mechanisms for collision avoidance. On the other hand, rule-based approaches, such as the Dynamic Window Approach (DWA), offer computational efficiency but struggle with generalization to unseen crowd behaviors. The proposed SafePCA framework aims to address this trade-off by integrating Cellular Automata (CA) into PPO-based navigation. CA enhances robustness by predicting high-risk areas based on pedestrian movement patterns, reducing unnecessary collisions. However, this approach may lead to conservative behavior, potentially affecting navigation performance in reaching the goal efficiently. The core research question addressed in this work is whether SafePCA can balance these trade-offs to ensure safe yet efficient robot navigation in dynamic crowds. Experiments demonstrate that SafePCA outperforms traditional PPO by providing superior risk assessment and avoidance strategies, achieving optimal performance with fewer training episodes. SafePCA's real-time adaptability ensures robust navigation in dynamic environments. By leveraging PPO's adaptive learning and CA's risk analysis, SafePCA offers an efficient solution for autonomous robot navigation in crowded environments, advancing the field and broadening application possibilities.

Abstract:
LiDAR odometry is essential for many robotics applications, including 3D mapping, navigation, and simultaneous localization and mapping. LiDAR odometry systems are usually based on some form of point cloud registration to compute the ego-motion of a mobile robot. Yet, few of today's LiDAR odometry systems consider domain-specific knowledge or the kinematic model of the mobile platform during the point cloud alignment. In this paper, we present Kinematic-ICP, a LiDAR odometry system that focuses on wheeled mobile robots equipped with a 3D LiDAR and moving on a planar surface, which is a common assumption for warehouses, offices, hospitals, etc. Our approach introduces kinematic constraints within the optimization of a traditional point-to-point iterative closest point scheme. In this way, the resulting motion follows the kinematic constraints of the platform, effectively exploiting the robot's wheel odometry and the 3D LiDAR observations. We dynamically adjust the influence of LiDAR measurements and wheel odometry in our optimization scheme, allowing the system to handle degenerate scenarios such as feature-poor corridors. We evaluate our approach on robots operating in large-scale warehouse environments, but also outdoors. The experiments show that our approach achieves top performances and is more accurate than wheel odometry and common LiDAR odometry systems. Kinematic-ICP has been recently deployed in the Dexory fleet of robots operating in warehouses worldwide at their customers' sites, showing that our method can run in the real world alongside a complete navigation stack.

Abstract:
Human activity recognition (HAR) based on millimeter wave (mmWave) radar has recently attracted significant interest due to its diverse applications in intelligent robots and human-computer interaction (HCI), including the healthcare monitoring robot. 2-dimensional (2D) histogram features of radar point clouds have demonstrated high accuracy in HAR. But further expansion and refinement of this technique is needed. This paper presents a new precise non-invasive HAR framework based on radar point cloud 2D histograms. Our method enhances conventional 2D histograms by integrating fixed radar sensing boundaries into the histograms, which shows the relative spatial position changes of the target points detected by radar. Additionally, we have concatenated Doppler features (i.e., range-Doppler and angle-Doppler histograms) with the point cloud histograms, resulting in a more comprehensive feature representation than conventional point cloud histograms. We investigated the overfitting issue in stacked hybrid networks and established a multi-layer hybrid network with an optimal number of stacked layers for HAR. In the evaluation, our approach achieves state-of-the-art accuracy, with 99.72% on mmWaveRadarWalking dataset and 98.67% on CI4R-Human-Activity-Recognition dataset, respectively. The proposed method can be applied in the fields of robotics and HCI.

Abstract:
Tangram assembly, the art of human intelligence and manipulation dexterity, is a new challenge for robotics and reveals the limitations of state-of-the-arts. Here, we describe our initial exploration and highlight key problems in reasoning, planning, and manipulation for robotic tangram assembly. We present MRChaos (Master Rules from Chaos), a robust and general solution for learning assembly policies that can generalize to novel objects. In contrast to conventional methods based on prior geometric and kinematic models, MRChaos learns to assemble randomly generated objects through self-exploration in simulation without prior experience in assembling target objects. The reward signal is obtained from the visual observation change without manually designed models or annotations. MRChaos retains its robustness in assembling various novel tangram objects that have never been encountered during training, with only silhouette prompts. We show the potential of MRChaos in wider applications such as cutlery combinations. The presented work indicates that radical generalization in robotic assembly can be achieved by learning in much simpler domains. The code will be available https://robotll.github.io/MasterRulesFromChaos/.

Abstract:
Continuum robots offer high flexibility and multiple degrees of freedom, making them ideal for navigating narrow lumens. However, accurately modeling their behavior under large deformations and frequent environmental contacts remains challenging. Current methods for solving the deformation of these robots, such as the Model Order Reduction and Gauss-Seidel (GS) methods, suffer from significant drawbacks. They experience reduced computational speed as the number of contact points increases and struggle to balance speed with model accuracy. To overcome these limitations, we introduce a novel finite element method (FEM) named Acc-FEM. Acc-FEM employs a large deformation quasi-static finite element model and integrates an accelerated solver scheme to handle multi-contact simulations efficiently. Additionally, it utilizes parallel computing with Graphics Processing Units (GPU) for real-time updates of the finite element models and collision detection. Extensive numerical experiments demonstrate that Acc-Fem significantly improves computational efficiency in modeling continuum robots with multiple contacts while achieving satisfactory accuracy, addressing the deficiencies of existing methods.

Abstract:
With robots becoming increasingly prevalent in various domains, it has become crucial to equip them with tools to achieve greater fluency in interactions with humans. One of the promising areas for further exploration lies in human trust. A real-time, objective model of human trust could be used to maximize productivity, preserve safety, and mitigate failure. In this work, we attempt to use physiological measures, gaze, and facial expressions to model human trust in a robot partner. We are the first to design an inperson, human-robot supervisory interaction study to create a dedicated trust dataset. Using this dataset, we train machine learning algorithms to identify the objective measures that are most indicative of trust in a robot partner, advancing trust prediction in human-robot interactions. Our findings indicate that a combination of sensor modalities (blood volume pulse, electrodermal activity, skin temperature, and gaze) can enhance the accuracy of detecting human trust in a robot partner. Furthermore, the Extra Trees, Random Forest, and Decision Trees classifiers exhibit consistently better performance in measuring the person's trust in the robot partner. These results lay the groundwork for constructing a real-time trust model for human-robot interaction, which could foster more efficient interactions between humans and robots.

Abstract:
Retrieving target objects from unknown, confined spaces remains a challenging task that requires integrated, task-driven active sensing and rearrangement planning. Previous approaches have independently addressed active sensing and rearrangement planning, limiting their practicality in real-world scenarios. This paper presents a new, integrated heuristic-based active sensing and Monte-Carlo Tree Search (MCTS)-based retrieval planning approach. These components provide feedback to one another to actively sense critical, unobserved areas suitable for the retrieval planner to plan a sequence for relocating path-blocking obstacles and a collisionfree trajectory for retrieving the target object. We demonstrate the effectiveness of our approach using a robot arm equipped with an in-hand camera in both simulated and real-world confined, cluttered scenarios. Our framework is compared against various state-of-the-art methods. The results indicate that our proposed approach outperforms baseline methods by a significant margin in terms of the success rate, the object rearrangement planning time consumption and the number of planning trials before successfully retrieving the target.

Abstract:
We present the Magnetic Origami Reprogram-ming and Folding System (MORF), a magnetically repro-grammable system capable of precise shape control, repeated transformations, and adaptive functionality for robotic applications. Unlike current self-folding systems, which often lack re-programmability or lose rigidity after folding, MORF generates stiff structures over multiple folding cycles without degradation in performance. The ability to reconfigure and maintain structural stability is crucial for tasks such as reconfigurable tooling. The system utilizes a thermoplastic layer sandwiched within a thin magnetically responsive laminate sheet, enabling structures to self-fold in response to a combination of external magnetic field and heating. We demonstrate that the resulting folded structures can bear loads over 40 times their own weight and can undergo up to 50 cycles of repeated transformations without losing structural integrity. We showcase these strengths in a reconfigurable tool for unscrewing and screwing bolts and screws of various sizes, allowing the tool to adapt its shape to different bolt sizes while withstanding the mechanical stresses involved. This capability highlights the system's potential for task-varying, load-bearing applications in robotics, where both versatility and durability are essential.

Abstract:
We present the design and implementation of HASTA (Hopper with Adjustable Stiffness for Terrain Adaption), a vertical hopping robot with real-time tunable leg stiffness, aimed at optimizing energy efficiency across various ground profiles (a pair of ground stiffness and damping conditions). By adjusting leg stiffness, we aim to maximize apex hopping height, a key metric for energy-efficient vertical hopping. We hypothesize that softer legs perform better on soft, damped ground by minimizing penetration and energy loss, while stiffer legs excel on hard, less damped ground by reducing limb deformation and energy dissipation. Through experimental tests and simulations, we find the best leg stiffness within our selection for each combination of ground stiffness and damping, enabling the robot to achieve maximum steadystate hopping height with a constant energy input. These results support our hypothesis that tunable stiffness improves energyefficient locomotion in controlled experimental conditions. In addition, the simulation provides insights that could aid in future development of controllers for selecting leg stiffness.

Abstract:
The environments in which the collaboration of a robot would be the most helpful to a person are frequently uncontrolled and cluttered with many objects present. Legible robot arm motion is crucial in tasks like these in order to avoid possible collisions, improve the workflow and help ensure the safety of the person. Prior work in this area, however, focuses on solutions that are tested only in uncluttered environments and there are not many results taken from cluttered environments. In this research we present a measure for clutteredness based on an entropic measure of the environment, and a novel motion planner based on potential fields. Both our measure and the planner were tested in a cluttered environment meant to represent a more typical tool-sorting task for which the person would collaborate with a robot. The in-person validation study with Baxter robots shows a significant improvement in legibility of our proposed legible motion planner compared to the current state-of-the-art legible motion planner in cluttered environments. Further, the results show a significant difference in the performance of the planners in cluttered and uncluttered environments, and the need to further explore legible motion in cluttered environments. We argue that the inconsistency of our results in cluttered environments with those obtained from uncluttered environments points out several important issues with the current research performed in the area of legible motion planners.

Abstract:
In this paper, we present a novel synergistic framework for learning shape estimation and a shape-aware whole-body control policy for tendon driven continuum robots. Our approach leverages the interaction between two Augmented Neural Ordinary Differential Equations (ANODEs) — the Shape-NODE and Control-NODE — to achieve continuous shape estimation and shape-aware control. The Shape-NODE integrates prior knowledge from Cosserat rod theory, allowing it to adapt and account for model mismatches, while the Control-NODE uses this shape information to optimize a whole-body control policy, trained in a Model Predictive Control (MPC) fashion. This unified framework effectively overcomes limitations of existing data-driven methods, such as poor shape awareness and challenges in capturing complex nonlinear dynamics. Extensive evaluations in both simulation and real-world environments demonstrate the framework's robust performance in shape estimation, trajectory tracking, and obstacle avoidance. The proposed method consistently outperforms state-of-the-art end-to-end, Neural-ODE, and Recurrent Neural Network (RNN) models, particularly in terms of tracking accuracy and generalization capabilities. The code and pretrained models are available at https://github.com/SIRGLab/WholeBodyControl_CTR.

Abstract:
A neural network-based framework is developed and experimentally demonstrated for the problem of estimating the shape of a soft continuum arm (SCA) from noisy measurements of the pose at a finite number of locations along the length of the arm. The neural network takes as input these measurements and produces as output a finitedimensional approximation of the strain, which is further used to reconstruct the infinite-dimensional smooth posture. This problem is important for various soft robotic applications. It is challenging due to the flexible aspects that lead to the infinitedimensional reconstruction problem for the continuous posture and strains. Because of this, past solutions to this problem are computationally intensive. The proposed fast smooth reconstruction method is shown to be five orders of magnitude faster while having comparable accuracy. The framework is evaluated on two testbeds: a simulated octopus muscular arm and a physical BR2 pneumatic soft manipulator.

Abstract:
In this paper we detail the methods used for obstacle avoidance, path planning, and trajectory tracking that helped us win the adult-sized, autonomous humanoid soccer league in RoboCup 2024. Our team was undefeated for all seated matches and scored 45 goals over 6 games, winning the championship game 6 to 1. During the competition, a major challenge for collision avoidance was the measurement noise coming from bipedal locomotion and a limited field of view (FOV). Furthermore, obstacles would sporadically jump in and out of our planned trajectory. At times our estimator would place our robot inside a hard constraint. Any planner in this competition must also be be computationally efficient enough to re-plan and react in real time. This motivated our approach to trajectory generation and tracking. In many scenarios long-term and short-term planning is needed. To efficiently find a long-term general path that avoids all obstacles we developed DAVG (Dynamic Augmented Visibility Graphs). DAVG focuses on essential path planning by setting certain regions to be active based on obstacles and the desired goal pose. By augmenting the states in the graph, turning angles are considered, which is crucial for a large soccer playing robot as turning may be more costly. A trajectory is formed by linearly interpolating between discrete points generated by DAVG. A modified version of model predictive control (MPC) is used to then track this trajectory called cf-MPC (Collision-Free MPC). This ensures short-term planning. Without having to switch formulations cf-MPC takes into account the robot dynamics and collision free constraints. Without a hard switch the control input can smoothly transition in cases where the noise places our robot inside a constraint boundary. The nonlinear formulation runs at approximately 120 Hz, while the quadratic version achieves around 400 Hz.

Abstract:
The 3D perception of satellites, including both their shape and pose, is a key foundation for robotic on-orbit servicing. However, the demanding space environment-such as intense and dim illumination-presents significant challenges. Previous non-cooperative methods focus on specific geometric features like solar panel brackets or docking rings, overlooking the satellite's overall shape and increasing the risk of collisions during grasping. Additionally, satellites are often weakly textured, limiting the accuracy of 3D perception. To address these issues, we propose, for the first time, a 3D perceptionbased visual servo system of non-cooperative satellites. This system combines reconstruction and tracking to enhance shape perception and pose estimation accuracy in orbital conditions. Specifically, we employ an alternating iterative strategy to simultaneously reconstruct and track the satellite and introduce a novel constraint to fuse different cues under extreme conditions. Further, we develop a simulation environment platform, a dualarm microgravity grasping system, and an online monitoring module to enhance system capabilities for on-orbit servicing. Synthetic and real-world datasets from the simulation environment are also created for experimental validation. Results show that each module of our system achieves state-of-the-art performance.

Abstract:
This paper exploits the robotic capabilities of an orbital manipulator equipped with an actuation module at its end-effector to perform close-proximity robotic operations. The proposed control strategy enables repositioning the system's center-of-mass by reconfiguring the manipulator configuration and using the end-effector-mounted thrusting mechanism to achieve displacement. The key advantage of the proposed method is that the plume impingement due to thruster firing of the servicer satellite in close-proximity operations towards the client is mitigated. This is achieved by regulating the internal motion of the manipulator such that the thrust firing does not occur near the space asset. The effectiveness of the controller is verified through a multibody dynamic simulation of an orbital manipulator.

Abstract:
This paper presents a novel method for assessing the resilience of the iterative closest point (ICP) algorithm via learning-based, worst-case attacks on lidar point clouds. For safety-critical applications such as autonomous navigation, ensuring the resilience of algorithms before deployments is crucial. The ICP algorithm is the standard for lidar-based localization, but its accuracy can be greatly affected by corrupted measurements from various sources, including occlusions, adverse weather, or mechanical sensor issues. Unfortunately, the complex and iterative nature of ICP makes assessing its resilience to corruption challenging. While there have been efforts to create challenging datasets and develop simulations to evaluate the resilience of ICP, our method focuses on finding the maximum possible ICP error that can arise from corrupted measurements at a location. We demonstrate that our perturbation-based adversarial attacks can be used pre-deployment to identify locations on a map where ICP is particularly vulnerable to corruptions in the measurements. With such information, autonomous robots can take safer paths when deployed, to mitigate against their measurements being corrupted. The proposed attack outperforms baselines more than 88% of the time across a wide range of scenarios.

Abstract:
Robust and accurate 2D/3D registration, which aligns preoperative models with intraoperative images of the same anatomy, is crucial for successful interventional navigation. To mitigate the challenge of a limited field of view in single-image intraoperative scenarios, multi-view 2D/3D registration is required by leveraging multiple intraoperative images. In this paper, we propose a novel multi-view 2D/3D rigid registration approach comprising two stages. In the first stage, a combined loss function is designed, incorporating both the differences between predicted and ground-truth poses and the dissimilarities (e.g., normalized cross-correlation) between simulated and observed intraoperative images. More importantly, additional cross-view training loss terms are introduced for both pose and image losses to explicitly enforce cross-view constraints. In the second stage, test-time optimization is performed to refine the estimated poses from the coarse stage. Our method exploits the mutual constraints of multi-view projection poses to enhance the robustness of the registration process. The proposed framework achieves a mean target registration error (mTRE) of 0.79+2.17\ \mathbfmm on six specimens from the DeepFluoro dataset, demonstrating superior performance compared to state-of-the-art registration algorithms.

Abstract:
The fusion of LiDAR and camera sensors has demonstrated significant effectiveness in achieving accurate detection for short-range tasks in autonomous driving. However, this fusion approach could face challenges when dealing with long-range detection scenarios due to disparity between sparsity of LiDAR and high-resolution camera data. Moreover, sensor corruption introduces complexities that affect the ability to maintain robustness, despite the growing adoption of sensor fusion in this domain. We present SaViD, a novel framework comprised of a three-stage fusion alignment mechanism designed to address long-range detection challenges in the presence of natural corruption. The SaViD framework consists of three key elements: the Global Memory Attention Network (GMAN), which enhances the extraction of image features through offering a deeper understanding of global patterns; the Attentional Sparse Memory Network (ASMN), which enhances the inte-gration of LiDAR and image features; and the KNNnectivity Graph Fusion (KGF), which enables the entire fusion of spatial information. SaViD achieves superior performance on the long-range detection Argoverse-2 (AV2) dataset with a performance improvement of 9.87% in AP value and an improvement of 2.39% in mAPH for L2 difficulties on the Waymo Open dataset (WOD). Comprehensive experiments are carried out to showcase its robustness against 14 natural sensor corruptions. SaViD exhibits a robust performance improvement of 31.43% for AV2 and 16.13% for WOD in RCE value compared to other existing fusion-based methods while considering all the corruptions for both datasets. Our code is available at SaVil).

Abstract:
Laboratory automation is a key driver for higher efficiency and reproducibility of experiments and measurements in natural science laboratories. One process that is particularly susceptible to both manual errors in the physical handling of labware, faulty data analyses, and incomplete reporting is the quantitative Polymerase Chain Reaction (qPCR). It is a ubiquitous analysis method in biolaboratories to amplify and measure the amount of a specific DNA sequence in a sample. Our system, which we call the qPCRBot, addresses these issues through three key pillars: automating data analysis and handling processes, standardizing data management and system communication protocols, and utilizing a robotic manipulator for labware transport. To achieve this, we developed a SiLA 2-based client-server architecture for unified and standardized access to both the qPCR device and the robot. For the manipulator, we implemented a Cartesian motion generator to ensure proper labware transport. We transform all experiment data to a standardized, XML-based format and integrate a widely-used Laboratory Information Management System for its storage. These developments collectively enable streamlined qPCR measurements without human interaction, thus enhancing both efficiency and reproducibility.

Abstract:
Humanoid robotics faces significant challenges in achieving stable locomotion and recovering from falls in dynamic environments. Traditional methods, such as Model Predictive Control (MPC) and Key Frame Based (KFB) routines, either require extensive fine-tuning or lack real-time adaptability. This paper introduces FRASA, a Deep Reinforcement Learning (DRL) agent that integrates fall recovery and stand up strategies into a unified framework. Leveraging the Cross-Q algorithm, FRASA significantly reduces training time and offers a versatile recovery strategy that adapts to unpredictable disturbances. Comparative tests on Sigmaban humanoid robots demonstrate FRASA superior performance against the KFB method deployed in the RoboCup 2023 by the Rhoban Team, world champion of the KidSize League.

Abstract:
Robots operating in unstructured environments face significant challenges when interacting with everyday objects like doors. They particularly struggle to generalize across diverse door types and conditions. Existing vision-based and open-loop planning methods often lack the robustness to handle varying door designs, mechanisms, and push/pull configurations. In this work, we propose a haptic-aware closed-loop hierarchical control framework that enables robots to explore and open different unseen doors in the wild. Our approach leverages real-time haptic feedback, allowing the robot to adjust its strategy dynamically based on force feedback during manipulation. We test our system on 20 unseen doors across different buildings, featuring diverse appearances and mechanical types. Our framework achieves a 90% success rate, demonstrating its ability to generalize and robustly handle varied door-opening tasks. This scalable solution offers potential applications in broader open-world articulated object manipulation tasks.

Abstract:
This paper introduces CognitiveOS, the first operating system designed for cognitive robots capable of functioning across diverse robotic platforms. CognitiveOS is structured as a multi-agent system comprising modules built upon a transformer architecture, facilitating communication through an internal monologue format. These modules collectively empower the robot to tackle intricate real-world tasks. The paper delineates the operational principles of the system along with the descriptions of its nine distinct modules. The modular design endows the system with distinctive advantages over traditional end-to-end methodologies, notably in terms of adaptability and scalability. The system's modules are configurable, modifiable, or deactivatable depending on the task requirements, while new modules can be seamlessly integrated. This system serves as a foundational resource for researchers and developers in the Cognitive Robotics domain, alleviating the burden of constructing a cognitive robot system from scratch. Experimental findings demonstrate the system's advanced task comprehension and adaptability across varied tasks, robotic platforms, and module configurations, underscoring its potential for realworld applications. Moreover, in the category of Reasoning it outperformed CognitiveDog (by 15%) and RT2 (by 31%), achieving the highest to date rate of 77 %. We provide a code repository and dataset for the replication of CognitiveOS: https://github.com/Arcwy0/cognitiveos

Abstract:
In recent years, magnetically controlled microrobots have garnered significant attention. This paper presents the H-robot, a self-designed microrobot featuring an innovative structure. The H-robot features a honeycomb porous spherical design specifically engineered to enhance cargo capacity. A new dynamic model for this structure has been developed for low Reynolds number fluid environments, along with a robust backstepping sliding mode control (RBSMC) strategy. Experiments were conducted in a calibrated magnetic field generated by a magnetic field generator to achieve precise motion control. The results demonstrate that the H-robot accurately tracks standard trajectories, with root mean square errors (RMSE) of 9.09 × 10^-4 \mathbf~ m for the Number-8 path and 8.29 × 10^-4 \mathbf~ m for the S-shaped path. Additionally, the proposed resistance model enhances tracking accuracy by 73.61% compared to traditional models, effectively adjusting the dynamic behavior of the H-robot in low Reynolds number fluids and significantly improving its motion performance. Finally, path planning experiments in a maze demonstrate the H-robot's ability to navigate and avoid obstacles.

Abstract:
The nonlinear dynamics required to model walking with multi-joint lower limb exoskeleton assistance results in high computational burden. To address this, we derive a Koopman-based linearized model of the human-exoskeleton system using electromyography and ultrasound-derived metrics of volitional muscle activity during exoskeleton-assisted walking. Data are collected from one participant with spinal cord injury (SCI) and two participants with no disabilities. Various electromyography and ultrasound-derived features in addition to normalized motor currents are used to derive predictive models, and we identify which muscle activation metrics produce the most accurate model for each subject. For both subjects without disabilities, the most accurate model uses only ultrasound-derived echogenicity as a metric of muscle activity, while the most accurate model for the subject with SCI uses only EMG wave length. Furthermore, the inclusion of ground reaction force increases the prediction accuracy of all models for one participant with no disabilities while decreasing the accuracy of most models for the participant with SCI. For all subjects, the most accurate subject-speclfic linear model has a root-mean-square error (averaged across limb segment angles) of < 8°.

Abstract:
Autonomous robot manipulation is a complex and continuously evolving robotics field. This paper focuses on data augmentation methods in imitation learning. Imitation learning consists of three stages: data collection from experts, learning model, and execution. However, collecting expert data requires manual effort and is time-consuming. Additionally, as sensors have different data acquisition intervals, preprocessing such as downsampling to match the lowest frequency is necessary. Downsampling enables data augmentation and also contributes to the stabilization of robot operations. In light of this background, this paper proposes the Data Augmentation Method for Bilateral Control-Based Imitation Learning with Images, called “DABI”. DABI collects robot joint angles, velocities, and torques at 1000 Hz, and uses images from gripper and environmental cameras captured at 100 Hz as the basis for data augmentation. This enables a tenfold increase in data. In this paper, we collected just 5 expert demonstration datasets. We trained the bilateral control Bi-ACT model with the unaltered dataset and two augmentation methods for comparative experiments and conducted real-world experiments. The results confirmed a significant improvement in success rates, thereby proving the effectiveness of DABI. For additional material, please check: https://mertcookimg.github.io/dabi

Abstract:
This research aims to demonstrate how integrating human-robot teaming dynamics into mission planning tools impacts the abilities of robot operators as they coordinate multiple robot agents during a mission. This was investigated in a pilot study using two inter-robot collaboration modalities and interface tools, which required different human-robot interaction techniques to execute a mission with a team of four robots. In the first modality, the operator manually inserted waypoints for each robot, as they acted as individual agents. In the second modality, the operator used the Planning Execution to After-Action Review (PETAAR) toolset to plot a single waypoint for the team of robots, as the robots coordinated their movement as a group. One novel component of this study is the investigation of how human-robot teaming dynamics and the PETAAR toolset impacted robot operators' real-time situation awareness and perceived cognitive load as well as team performance. Although the teaming modalities differed greatly with respect to the level of operator input needed, the time required to complete the simulation, the participant's perceived cognitive load, and interface usability were very similar for both modalities. In contrast, the results revealed statistically significant differences between the two teaming modalities related to participants' abilities to maintain a wedge formation while remaining situationally aware. Results from this work will be used to guide development of PETAAR along with the design of future studies investigating more complex teaming scenarios and for creating a baseline for comparing future results.

Abstract:
We present Dur360BEV, a novel spherical camera autonomous driving dataset equipped with a high-resolution 128-channel 3D LiDAR and a RTK-refined GNSS/INS system, along with a benchmark architecture designed to generate Bird-Eye-View (BEV) maps using only a single spherical camera. This dataset and benchmark address the challenges of BEV generation in autonomous driving, particularly by reducing hardware complexity through the use of a single 360-degree camera instead of multiple perspective cameras. Within our benchmark architecture, we propose a novel spherical-image-to-BEV module that leverages spherical imagery and a refined sampling strategy to project features from 2D to 3D. Our approach also includes an innovative application of focal loss, specifically adapted to address the extreme class imbalance often encountered in BEV segmentation tasks, that demonstrates improved segmentation performance on the Dur360BEV dataset. The results show that our benchmark not only simplifies the sensor setup but also achieves competitive performance. Code + Dataset: https://github.com/Tom-E-DurhamJDur360BEV

Abstract:
With the ongoing miniaturization, recently, lightweight, commercial underwater vehicle-manipulator systems (UVMSs) have emerged that massively lower the entry barrier into underwater manipulation. Within this research field, dynamic and accurate end effector trajectory tracking is a crucial first step in developing autonomous capabilities. In this context, coupling effects between the manipulator and vehicle dynamics are expected to pose a considerable challenge. However, UVMS control strategies analyzed in detailed experimental studies are particularly rare. We present a holistic approach based on task-priority control that we describe and discuss from modeling towards extensive experimental studies, which are crucial for the notoriously hard-to-simulate underwater domain. We demonstrate this framework on the widely used platform of a BlueROV2 and an Alpha 5 manipulator. The end effector trajectory tracking is shown to be highly accurate, with <4 \textcm median position error. Moreover, our experimental findings on the consideration of dynamic coupling within UVMS control motivate further research. The code is available at https://github.com/HippoCampusRobotics/uvms. A video of the results is available at https://youtu.be/IDM1I5KqlVI.

Abstract:
An embodied agent assisting humans is often asked to complete new tasks, and there may not be sufficient time or labeled examples to train the agent to perform these new tasks. Large Language Models (LLMs) trained on considerable knowledge across many domains can be used to predict a sequence of abstract actions for completing such tasks, although the agent may not be able to execute this sequence due to task-, agent-, or domain-specific constraints. Our framework addresses these challenges by leveraging the generic predictions provided by LLM and the prior domain knowledge encoded in a Knowledge Graph (KG), enabling an agent to quickly adapt to new tasks. The robot also solicits and uses human input as needed to refine its existing knowledge. Based on experimental evaluation in the context of cooking and cleaning tasks in simulation domains, we demonstrate that the interplay between LLM, KG, and human input leads to substantial performance gains compared with just using the LLM. Project website1§Project supported in part by TCS Research India: https://sssshivvvv.github.io/adaptbot/

Abstract:
Robot-assisted Endoscopic Submucosal Dissection (ESD) improves the surgical procedure by providing a more comprehensive view through advanced robotic instruments and bimanual operation, thereby enhancing dissection efficiency and accuracy. Accurate prediction of dissection trajectories is crucial for better decision-making, reducing intraoperative errors, and improving surgical training. Nevertheless, predicting these trajectories is challenging due to variable tumor margins and dynamic visual conditions. To address this issue, we create the ESD Trajectory and Confidence Map-based Safety Margin (ETSM) dataset with 1849 short clips, focusing on submucosal dissection with a dual-arm robotic system. We also introduce a framework that combines optimal dissection trajectory prediction with a confidence map-based safety margin, providing a more secure and intelligent decision-making tool to minimize surgical risks for ESD procedures. Additionally, we propose the Regression-based Confidence Map Prediction Network (RCMNet), which utilizes a regression approach to predict confidence maps for dissection areas, thereby delineating various levels of safety margins. We evaluate our RCMNet using three distinct experimental setups: in-domain evaluation, robustness assessment, and out-of-domain evaluation. Experimental results show that our approach excels in the confidence map-based safety margin prediction task, achieving a mean absolute error (MAE) of only 3.18. To the best of our knowledge, this is the first study to apply a regression approach for visual guidance concerning delineating varying safety levels of dissection areas. Our approach bridges gaps in current research by improving prediction accuracy and enhancing the safety of the dissection process, showing great clinical significance in practice. The dataset and code are available at https://github.com/FrankMOWJ/RCMNet.

Abstract:
Recent advancements in soft actuators have enabled soft continuum swimming robots to achieve higher efficiency and more closely mimic the behaviors of real marine animals. However, optimizing the design and control of these soft continuum robots remains a significant challenge. In this paper, we present a practical framework for the co-optimization of the design and control of soft continuum robots, approached from a geometric locomotion analysis perspective. This framework is based on the principles of geometric mechanics, accounting for swimming at both low and high Reynolds numbers. By generalizing geometric principles to continuum bodies, we achieve efficient geometric variational co-optimization of designs and gaits across different power consumption metrics and swimming environments. The resulting optimal designs and gaits exhibit greater efficiencies at both low and high Reynolds numbers compared to three-link or serpenoid swimmers with the same degrees of freedom, approaching or even surpassing the efficiencies of infinitely flexible swimmers and those with higher degrees of freedom.

Abstract:
This paper presents a framework for multi-agent navigation in structured but dynamic environments, integrating three key components: a shared semantic map encoding metric and semantic environmental knowledge, a claim policy for coordinating access to areas within the environment, and a Model Predictive Controller for generating motion trajectories that respect environmental and coordination constraints. The main advantages of this approach include: (i) enforcing area occupancy constraints derived from specific task requirements; (ii) enhancing computational scalability by eliminating the need for collision avoidance constraints between robotic agents; and (iii) the ability to anticipate and avoid deadlocks between agents. The paper includes both simulations and physical experiments demonstrating the framework's effectiveness in various representative scenarios.

Abstract:
Multi-quadrotor systems face significant challenges in decentralized control, particularly with safety and coordination under sensing and communication limitations. State-of-the-art methods leverage Control Barrier Functions (CBFs) to provide safety guarantees but often neglect actuation constraints and limited detection range. To address these gaps, we propose a novel decentralized Nonlinear Model Predictive Control (NMPC) that integrates Exponential CBFs (ECBFs) to enhance safety and optimality in multi-quadrotor systems. We provide both conservative and practical minimum bounds of the range that preserve the safety guarantees of the ECBFs. We validate our approach through extensive simulations with up to 10 quadrotors and 20 obstacles, as well as real-world experiments with 3 quadrotors. Results demonstrate the effectiveness of the proposed framework in realistic settings, highlighting its potential for reliable quadrotor teams operations.

Abstract:
This paper presents an iterative approach for heterogeneous multi-agent route planning in environments with unknown resource distributions. We focus on a team of robots with diverse capabilities tasked with executing missions specified using Capability Temporal Logic (CaTL), a formal framework built on Signal Temporal Logic to handle spatial, temporal, capability, and resource constraints. The key challenge arises from the uncertainty in the initial distribution and quantity of resources in the environment. To address this, we introduce an iterative algorithm that dynamically balances exploration and task fulfillment. Robots are guided to explore the environment, identifying resource locations and quantities while progressively refining their understanding of the resource landscape. At the same time, they aim to maximally satisfy the mission objectives based on the current information, adapting their strategies as new data is uncovered. This approach provides a robust solution for planning in dynamic, resource-constrained environments, enabling efficient coordination of heterogeneous teams even under conditions of uncertainty. Our method's effectiveness and performance are demonstrated through simulated case studies.

Abstract:
The moving target traveling salesman problem with obstacles (MT-TSP-O) seeks an obstacle-free trajectory for an agent that intercepts a given set of moving targets, each within specified time windows, and returns to the agent's starting position. Each target moves with a constant velocity within its time windows, and the agent has a speed limit no smaller than any target's speed. We present FMC-TSP, the first complete and bounded-suboptimal algorithm for the MT-TSP-O, and results for an agent whose configuration space is \mathbbR^3. Our algorithm interleaves a high-level search and a lowlevel search, where the high-level search solves a generalized traveling salesman problem with time windows (GTSP-TW) to find a sequence of targets and corresponding time windows for the agent to visit. Given such a sequence, the low-level search then finds an associated agent trajectory. To solve the low-level planning problem, we develop a new algorithm called FMC, which finds a shortest path on a graph of convex sets (GCS) via implicit graph search and pruning techniques specialized for problems with moving targets. We test FMC-TSP on 280 problem instances with up to 40 targets and demonstrate its smaller median runtime than a baseline based on prior work.

Abstract:
Robot interaction control is often limited to low dynamics or low flexibility, depending on whether an active or passive approach is chosen. In this work, we introduce a hybrid control scheme that combines the advantages of active and passive interaction control. To accomplish this, we propose the design of a novel Active Remote Center of Compliance (ARCC), which is based on a passive and active element which can be used to directly control the interaction forces. We introduce surrogate models for a dynamic comparison against purely robot-based interaction schemes. In a comparative validation, ARCC drastically improves the interaction dynamics, leading to an increase in the motion bandwidth of up to 31 times. We introduce further our control approach as well as the integration in the robot controller. Finally, we analyze ARCC on different industrial benchmarks like peg-in-hole, top-hat rail assembly and contour following problems and compare it against the state of the art, to highlight the dynamic and flexibility. The proposed system is especially suited if the application requires a low cycle time combined with a sensitive manipulation.

Abstract:
Multi-modal fusion can improve perceptual robustness and accuracy by fully utilizing multi-source sensor data. Current RGB-T fusion methods still falter with adverse illumination and weather. Recent advances in generative methods have shown the ability to enhance and restore visible images in adverse conditions. However, the fusion of RGB-T based on generative methods has not been studied in depth, due to limited attention given to the degradation of multi-modal features under challenging circumstances. Motivated by this observation, we propose CDMFusion, a three-branch conditional diffusion model that achieves fusion with dynamically enhancing multi-modal features and suppressing high-frequency interference. Specifically, we achieve feature-preserving fusion through three branches and establish a dynamic gating prediction module to adjust the enhancement of multi-modal features adaptively. In addition, considering the high time cost of existing diffusion models for generating fused images, we propose a skip patrol mechanism to achieve accelerated high-quality generation with no need for additional training. Experiments demonstrate our method achieves excellent performance in multiple datasets. The code and datasets are available at https://github.com/yangluojie/CDMFusion.

Abstract:
Miniature magnetic robots have attracted considerable attention as promising tools in biomedical applications due to their wireless actuation and precise controllability in a minimally invasive manner. Traditionally, magnetic microrobots have been controlled by globally applied magnetic torques and forces generated by external magnetic actuation systems (MASs), which typically require closed-loop control with real-time vision tracking—a challenging requirement in in-vivo environments. To address this issue, this paper suggests a novel open-loop control scheme for magnetic robots, using two-dimensional (2D) divergence control of a magnetic force generated by stationary electromagnets. Constraint equations for the currents applied to the electromagnets were established to achieve 2D divergence control of a magnetic force. Numerical simulation and experimental validations demonstrate that this approach can generate sufficient magnetic forces that either converge at or diverge from a target point, enabling effective open-loop position control of a miniature magnetic robot. Due to the absence of vision feedback and mechanical motions of magnets, the proposed control strategy could be more clinically applicable for medical applications of magnetic robots.

Abstract:
Building upon the foundations laid by our previous work, this paper introduces Arena 4.0, a significant advancement of Arena 3.0 [1], Arena-Bench [2], Arena 1.0 [3], and Arena 2.0 [4]. Arena 4.0 provides three main novel contributions: 1) a generative-model-based world and scenario generation approach using large language models (LLMs) and diffusion models, to dynamically generate complex, humancentric environments from text prompts or 2D floorplans that can be used for development and benchmarking of social navigation strategies. 2) A comprehensive 3D model database which can be extended with 3D assets and semantically linked and annotated using a variety of metrics for dynamic spawning and arrangements inside 3D worlds. 3) The complete migration towards ROS 2, which ensures operation with state-of-the-art hardware and functionalities for improved navigation, usability, and simplified transfer towards real robots. We evaluated the platforms performance through a comprehensive user study and its world generation capabilities for benchmarking demonstrating significant improvements in usability and efficiency compared to previous versions. Arena 4.0 is openly available at https://github.com/Arena-Rosnav.

Abstract:
Deep reinforcement learning (RL) often relies on simulators as abstract oracles to model interactions within complex environments. While differentiable simulators have recently emerged for multi-body robotic systems, they remain underutilized, despite their potential to provide richer information. This underutilization, coupled with the high computational cost of exploration-exploitation in high-dimensional state spaces, limits the practical application of RL in the real-world. We propose a method that integrates learning with differentiable simulators to enhance the efficiency of exploration-exploitation. Our approach learns value functions, state trajectories, and control policies from locally optimal runs of a model-based trajectory optimizer. The learned value function acts as a proxy to shorten the preview horizon, while approximated state and control policies guide the trajectory optimization. We benchmark our algorithm on three classical control problems and a torque-controlled 7 degree-of-freedom robot manipulator arm, demonstrating faster convergence and a more efficient symbiotic relationship between learning and simulation for end-to-end training of complex, poly-articulated systems.

Abstract:
In this paper, a soft rotary wing robot capable of flight and control is presented. The Collapsible Airfoil Single Actuator ROtor-craft (CASARO) is a single actuator monocopter that derives its geometric properties from the Samara seed. CASARO achieves better flight efficiency, lift, and handling ergonomics by reducing its overall volume by 91.7% when collapsed and stowed. Unlike conventional rotorcraft, CASARO uses a non-rigid fabric wing to produce lift in flight. It utilizes the robot's rotational velocity to maintain tension within its fabric and airframe, providing adequate lift during its hover state. The conception, design, construction, and control of the soft monowing are demonstrated, including its capability to reduce its footprint with its soft fabric construction. To analyze the flight dynamics of CASARO, the craft is flown indoors autonomously, tracking its wing surface, craft body attitude, and position with various step inputs to observe different wing dynamics. CASARO is also capable of being deployed outdoors for real-life human-operated flight.

Abstract:
Gas concentration estimation is crucial for understanding and mitigating climate change. While most research and monitoring efforts focus on major greenhouse gases such as CO2, significantly less attention has been given to trace gases like NO2, which play a critical role in atmospheric chemistry and air quality. This paper aims to enhance trace gas concentration estimation by integrating physics-based models into data-driven neural network frameworks. Furthermore, to improve large-scale estimation accuracy, we incorporate in-situ measurements to refine neural network models trained on satellite observations. The resulting model can provide reliable large-scale gas concentration estimates, particularly for locations lacking precise in-situ measurements. This approach offers a novel pathway to enhance the accuracy and applicability of gas monitoring for climate and environmental research. While NO2 serves as the target trace gas in this study, the proposed framework is potentially applicable to the prediction of other atmospheric gas concentrations.

Abstract:
Humanoid robots offer significant advantages for search and rescue tasks, thanks to their capability to traverse rough terrains and perform transportation tasks. In this study, we present a task and motion planning framework for search and rescue operations using a heterogeneous robot team composed of humanoids and aerial robots. We propose a terrain-aware Model Predictive Controller (MPC) that incorporates terrain elevation gradients learned using Gaussian processes (GP). This terrain-aware MPC generates safe navigation paths for the bipedal robots to traverse rough terrain while minimizing terrain slopes, and it directs the quadrotors to perform aerial search and mapping tasks. The rescue subjects' locations are estimated by a target belief GP, which is updated online during the map exploration. A high-level planner for task allocation is designed by encoding the navigation tasks using syntactically cosafe Linear Temporal Logic (scLTL), and a consensus-based algorithm is designed for task assignment of individual robots. We evaluate the efficacy of our planning framework in simulation in an uncertain environment with various terrains and random rescue subject placements.

Abstract:
This paper presents Shadow Program Inversion with Differentiable Planning (SPI-DP), a novel first-order optimizer capable of optimizing robot programs with respect to both high-level task objectives and motion-level constraints. To that end, we introduce Differentiable Gaussian Process Motion Planning for N-DoF Manipulators (dGPMP2-ND), a differentiable collision-free motion planner for serial N-DoF kinematics, and integrate it into an iterative, gradient-based optimization approach for generic, parameterized robot program representations. SPI-DP allows first-order optimization of planned trajectories and program parameters with respect to objectives such as cycle time or smoothness subject to e.g. collision constraints, while enabling humans to understand, modify or even certify the optimized programs. We provide a comprehensive evaluation on two practical household and industrial applications.

Abstract:
Aotearoa's apple industry struggles to maintain the skilled workforce required for fruitlet thinning each year. Skilled labourers play a pivotal role in managing crop loads by precisely thinning fruitlets to a desired number to achieve the desired spacing for high-quality apple growth. This complex task requires accurate mapping of the fruitlets along each branch. This paper presents a novel vision system capable of mapping the orientation and clustering information of apple fruitlets. Fruitlet pose estimation has been validated against data collected from a real-world commercial apple orchard. The results show an improved counting accuracy of 83.97% on prior implementations, an orientation estimate accuracy of 88.1%, and a clustering accuracy of 94.3%. Future work will utilise this information to determine which fruitlets to remove and then robotically thin them from the canopy.

Abstract:
This paper addresses the problem of category-level pose estimation for articulated objects in robotic manipulation tasks. Recent works have shown promising results in estimating part pose and size at the category level. However, these approaches primarily follow a complex multi-stage pipeline that first segments part instances in the point cloud and then estimates the Normalized Part Coordinate Space (NPCS) representation for 6D poses. These approaches suffer from high computational costs and low performance in real-time robotic tasks. To address these limitations, we propose YOEO, a single-stage method that simultaneously outputs instance segmentation and NPCS representations in an end-to-end manner. We use a unified network to generate point-wise semantic labels and centroid offsets, allowing points from the same part instance to vote for the same centroid. We further utilize a clustering algorithm to distinguish points based on their estimated centroid distances. Finally, we first separate the NPCS region of each instance. Then, we align the separated regions with the real point cloud to recover the final pose and size. Experimental results on the GAPart dataset demonstrate the pose estimation capabilities of our proposed single-shot method. We also deploy our synthetically-trained model in a real-world setting, providing real-time visual feedback at 200Hz, enabling a physical Kinova robot to interact with unseen articulated objects. This showcases the utility and effectiveness of our proposed method 2.

Abstract:
This paper presents a novel monitoring framework that infers the level of collision risk for autonomous vehicles (AV s) based on their object detection performance. The framework takes two sets of predictions from different algorithms and associates their inconsistencies with the collision risk via fuzzy inference. The first set of predictions is obtained by retrieving safety-critical 2.5D objects from a depth map, and the second set comes from the ordinary AV's 3D object detector. We experimentally validate that, based on Intersection-over-Union (IoU) and a depth discrepancy measure, the inconsis-tencies between the two sets of predictions strongly correlate to the error of the 3D object detector against ground truths. This correlation allows us to construct a fuzzy inference system and map the inconsistency measures to an AV collision risk indicator. In particular, we optimize the fuzzy inference system towards an existing offline metric that matches AV collision rates well. Lastly, we validate our monitor's capability to produce relevant risk estimates with the large-scale nuScenes dataset and demonstrate that it can safeguard an AV in closed-loop simulations.

Abstract:
We present Residual Descent Differential Dynamic Game (RD3G), a Newton-based solver for constrained multiagent game-control problems. The proposed solver seeks a local Nash equilibrium for games where agents are coupled through their rewards and state constraints. By maintaining a dynamic set of active constraints, combined with a barrier function on satisfied constraints and a backtracking line search, the proposed method is able to satisfy state constraints while keeping the dimension of the Newton descent direction problem to a minimum. We compare the proposed method against state-of-the-art techniques and showcase the computational benefits of the RD3G algorithm on several example problems. The RD3G is up to \mathbf4 X faster and has \mathbf2 X higher convergence rate than existing approaches in higher dimensional games.

Abstract:
Robot manipulation researchers reference human performance as a goal for their work, however, human data is seldom present in robotics benchmarks. We introduce a real-world benchmark targeting manipulation skills for performing electrical circuit inspection with a multimeter using an Internet-connected electronic task board. We present timing study results and an exemplary robot solution across six different tasks from the Robothon Grand Challenge at the automatica conference in 2023. Contributions from 16 robot teams were collected using task boards we manufactured and distributed as part of the 30-day international competition as an initial performance database. Our work systematically highlights the skill gap between the winning robot solution and the best human performance from a group of 30 subjects. Our goal is to chronicle progress over time in robot manipulation skills and provide a standardized, physical benchmark across the global community. Videos of the team submissions, the exemplary robot solution, as well as the project reproduction code are provided in the included repository.

Abstract:
A modern smart factory runs a manufacturing procedure using a collection of programmable machines. Typically, materials are ferried between these machines using a team of mobile robots. To embed a manufacturing procedure in a smart factory, a factory operator must a) assign its processes to the smart factory's machines and b) determine how agents should carry materials between machines. A good embedding maximizes the smart factory's throughput; the rate at which it outputs products. Existing smart factory management systems solve the aforementioned problems sequentially, limiting the throughput that they can achieve. In this paper we introduce ACES, the Anytime Cyclic Embedding Solver, the first solver which jointly optimizes the assignment of processes to machines and the assignment of paths to agents. We evaluate ACES and show that it can scale to real industrial scenarios.

Abstract:
Non-cooperative interactions commonly occur in multi-agent scenarios such as car racing, where an ego vehicle can choose to overtake the rival, or stay behind it until a safe overtaking “corridor” opens. While an expert human can do well at making such time-sensitive decisions, autonomous agents are incapable of rapidly reasoning about complex, potentially conflicting options, leading to suboptimal behaviors such as deadlocks. Recently, the nonlinear opinion dynamics (NOD) model has proven to exhibit fast opinion formation and avoidance of decision deadlocks. However, NOD modeling parameters are oftentimes assumed fixed, limiting their applicability in complex and dynamic environments. It remains an open challenge to determine such parameters automatically and adaptively, accounting for the ever-changing environment. In this work, we propose for the first time a learning-based and game-theoretic approach to synthesize a Neural NOD model from expert demonstrations, given as a dataset containing (possibly incomplete) state and action trajectories of interacting agents. We demonstrate Neural NOD's ability to make fast and deadlock-free decisions in a simulated autonomous racing example. We find that Neural NOD consistently outperforms the state-of-the-art data-driven inverse game baseline in terms of safety and overtaking performance.

Abstract:
Soft robotics are characterized by their high deformability, mechanical robustness, and inherent resistance to damage. These unique properties present exciting new opportunities to enhance both emerging and existing fields such as healthcare, manufacturing, and exploration. However, to function effectively in unstructured environments, these technologies must withstand the same real-world conditions to which human skin and other soft biological materials are typically subjected. Here, we present a novel soft material architecture designed for active detection of material damage and autonomous repair in soft robotic actuators. By integrating liquid metal (LM) microdroplets within a silicone elastomer, the system can detect and localize damage through the formation of conductive pathways that arise from extreme pressure (> 1 MPa) or puncture events. These newly formed conductive networks function as in situ Joule heating elements, facilitating the reprocessing and healing of the material. The architecture allows for the reconfiguration of the newly formed electrical network using controlled electrical and thermal mechanisms to restore functionality. The entire process from damage detection to repair and reconfiguration occurs without any manual intervention or external mechanisms to facilitate healing. This innovative approach not only enhances the resilience and performance of soft materials but also supports a wide range of applications in soft robotics and wearable technologies, where adaptive and autonomous systems are crucial for operation in dynamic and unpredictable environments.

Abstract:
Articulated object manipulation poses a unique challenge compared to rigid object manipulation as the object itself represents a dynamic environment. In this work, we present a novel RL-based pipeline equipped with variable impedance control and motion adaptation leveraging observation history for generalizable articulated object manipulation, focusing on smooth and dexterous motion during zero-shot sim-to-real transfer (Fig. 1). To mitigate the sim-to-real gap, our pipeline diminishes reliance on vision by not leveraging the vision data feature (RGBD/pointcloud) directly as policy input but rather extracting useful low-dimensional data first via off-the-shelf modules. Additionally, we experience less sim-to-real gap by inferring object motion and its intrinsic properties via observation history as well as utilizing impedance control both in the simulation and in the real world. Furthermore, we develop a well-designed training setting with great randomization and a specialized reward system (task-aware and motion-aware) that enables multi-staged, end-to-end manipulation without heuristic motion planning. To the best of our knowledge, our policy is the first to report 84% success rate in the real world via extensive experiments with various unseen objects. Webpage: https://watch-less-feel-more.github.io/

Abstract:
Manipulating arbitrary objects in unstructured environments is a significant challenge in robotics, primarily due to difficulties in determining an object's center of mass. This paper introduces U-GRAPH: Uncertainty-Guided Rotational Active Perception with Haptics, a novel framework to enhance the center of mass estimation using active perception. Traditional methods often rely on single interaction and are limited by the inherent inaccuracies of Force-Torque (F/T) sensors. Our approach circumvents these limitations by integrating a Bayesian Neural Network (BNN) to quantify uncertainty and guide the robotic system through multiple, information-rich interactions via grid search and a neural network that scores each action. We demonstrate the remarkable generalizability and transferability of our method with training on a small dataset with limited variation yet still perform well on unseen complex real-world objects.

Abstract:
This paper presents the Spinning Blimp, a novel lighter-than-air (LTA) aerial vehicle designed for low-energy stable flight. Using an oblate spheroid helium balloon for buoyancy, the vehicle achieves minimal energy consumption while maintaining prolonged airborne states. The unique and low-cost design employs a passively arranged wing coupled with a propeller to induce a spinning behavior, providing inherent pendulum-like stabilization. We propose a control strategy that takes advantage of the continuous revolving nature of the spinning blimp to control translational motion. The cost-effectiveness of the vehicle makes it highly suitable for a variety of applications, such as patrolling, localization, air and turbulence monitoring, and domestic surveillance. Experimental evaluations affirm the design's efficacy and underscore its potential as a versatile and economically viable solution for aerial applications.

Abstract:
Autonomous ships utilize automation systems to achieve unmanned navigation, driving innovation in maritime transportation. However, sea conditions, influenced by dynamic factors such as wave height, wind speed, and ocean currents, present a challenge in accurately assessing these conditions. Traditional classification models often assume accurate labels, but noisy labels are prevalent in real-world applications. Existing methods, such as noise sample filtering or loss function adjustment, have limited applicability and poor generalization when dealing with complex sea condition data. To address this issue, this study proposes an end-to-end neural network model. The model's feature extraction module uses deep representation learning to capture latent patterns in the data, and a loss function is designed to mitigate the impact of outliers. The integration of these components allows the model to perform accurate classification even in the presence of noisy labels. Extensive experiments on public and sea condition datasets validate the effectiveness of this approach, demonstrating that the model exhibits strong generalization capabilities and holds great promise for practical applications.

Abstract:
In this paper, we study the vehicle routing problem (VRP) for a fleet of cooperative autonomous agricultural robots (agribots) equipped with detachable implements, with the goal of efficiently and sustainably completing agricultural tasks in precision crop farming. State of the art in the area of agribot fleet routing with detachable implements is lacking. Consequently, we propose the Capacitated Agriculture Fleet Vehicle Routing Problem with Implements and Limited Autonomy (CAFVRPILA), designed to optimize the agribot fleet's routes across a set of given agricultural tasks while considering implement capacities, agribot-implement compatibilities, and agribots' limited battery autonomies. A heuristic two-phase decomposition approach is proposed for this problem. Simulation experiments show that minimizing travel distances and costs with CAFVRPILA enhances sustainable farming while maximizing productivity and resource use. The results also demonstrate that synchronizing multiple operations improves efficiency, particularly in larger fleets.

Abstract:
Transparent household objects present a challenge for domestic service robots, since neither regular cameras nor RGB-D cameras can provide accurate points for shape reconstruction. The new type of pretouch dual-modality distance and material sensor (PDM2) can provide reliable and accurate depth readings, but it is a point sensor and scanning the object exclusively with the sensor is too inefficient. Hence, we present a sensor fusion approach by combining a regular camera with the PDM2 sensor. The approach is based on a data fusion algorithm for shape reconstruction and an active perception algorithm for scan planning for the PDM2 sensor. The data fusion algorithm is a distributed Gaussian process (GP)-based shape reconstruction method that allows for incremental local update to reduce computational time. The active perception algorithm is an optimization-based approach by increasing the information gain (IG) and prioritizing the boundary points under a preset travel distance constraint. We have implemented and tested the algorithms with six different transparent household items. The results show satisfactory shape reconstruction results in all test cases with an average increase in intersection over union (IoU) from 0.73 to 0.96.

Abstract:
This paper uses the ellipsoidal parameters associated with volume moments of inertia of a bounded solid object to construct a motion sweep joining two poses of the solid object, in contrast to earlier works on motion interpolation in SE(3) without taking into account the shape of the moving object. The paper borrows the concept of shape-dependent object norms introduced by Kazerounian and Rastegar [1] and refined by Chirikjian and Zhou [2] to compute as a metric the average of the squared distances (or ASD) among all homologous points of the bounded body between two given poses and seeks to obtain an optimal interpolating motion that minimizes a combination of two ASD distances from each intermediate pose to the two given poses. It is found that the ASD minimizing motion sweep is a novel straight-line motion such that while the centroid of the object follows a straight line, the orientation of the object is constrained so that the ASD metric is minimized. Furthermore, the rotational component can be determined by polar decomposition of the linearly interpolated rotation matrices, scaled by the object's inertial parameters. As an illustration of one of its applications, this motion sweep is repeatedly applied using the de Casteljau algorithm to generate Bézier-like freeform motions, whose paths are in general dependent on the shape of the inertia ellipsoid.

Abstract:
Formation control is essential for swarm robotics, enabling coordinated behavior in complex environments. In this paper, we introduce a novel formation control system for an indoor blimp swarm using a specialized leader-follower approach enhanced with a dynamic leader-switching mechanism. This strategy allows any blimp to take on the leader role, distributing maneuvering demands across the swarm and enhancing overall formation stability. Only the leader blimp is manually controlled by a human operator, while follower blimps use onboard monocular cameras and a laser altimeter for relative position and altitude estimation. A leader-switching scheme is proposed to assist the human operator to maintain stability of the swarm, especially when a sharp turn is performed. Experimental results confirm that the leader-switching mechanism effectively maintains stable formations and adapts to dynamic indoor environments while assisting human operator.

Abstract:
Providing oral capsule robots with additional degrees of freedom (DOF), such as robotic arms, is crucial for enhancing their functionality within the body. However, a key challenge arises when using rotating magnetic fields to drive the motor within the robot, as the resulting torque causes the entire capsule to rotate. In this work, we propose a novel approach to actuate a 2 DOF parallel link robot arm integrated into a capsule robot, using external magnetic fields. Our method employs two identical magnetic motors we proposed in a previous study, each driven by an oscillating magnetic field, which alternates direction along a specific axis. By independently controlling the rotation of each motor through the same magnetic field, ensemble control is achieved. The symmetrically arranged motors exhibit different angular velocities, enabling dexterous movement of the robot arm. We further theoretically show that this approach significantly reduces the torque exerted on the robot compared to traditional approaches using rotating magnetic fields. Finally, we demonstrate the performance of the robot by moving its arms and the attached end-effector along a pre-defined trajectory.

Abstract:
This study explores the utility of various internet data sources to select among a set of template robot behaviors to perform skills. Learning contact-rich skills involving tool use from internet data sources has typically been challenging due to the lack of physical information such as contact existence, location, areas, and force in this data. Prior works have generally used internet data and foundation models trained on this data to generate low-level robot behavior. We hypothesize that these data and models may be better suited to selecting among a set of basic robot behaviors to perform these contact-rich skills. We explore three methods of template selection: querying large language models, comparing video of robot execution to retrieved human video using features from a pretrained video encoder common in prior work, and performing the same comparison using features from an optic flow encoder trained on internet data. Our results show that LLMs are surprisingly capable template selectors despite their lack of visual information, optical flow encoding significantly outperforms video encoders trained with an order of magnitude more data, and important synergies exist between various forms of internet data for template selection. By exploiting these synergies, we create a template selector using multiple forms of internet data that achieves a 79% success rate on a set of 16 different cooking skills involving tool-use.

Abstract:
Haptic devices are widely used as control interfaces for robotic teleoperation, offering intuitive rendering of interactions between remote robot and environment. In particular, cutaneous feedback devices provide intrinsic stability and reduced form factor compared to kinesthetic feedback interfaces. However, the implementation of cutaneous feedback devices in industrial settings must be rigorously validated to prevent potential equipment accidents, which could lead to substantial economic losses due to unskilled robot manipulation. This paper presents a novel ungrounded haptic control interface (POstick-VF), designed specifically for high-risk steel production tasks. POstick-VF offers visuo-tactile feedback within an extensive workspace, enabling intuitive robot manipulation through its kinematic similarity with real tools ensuring safety. The performance of the developed POstick is rigorously validated and compared with conventional joystick controller through experiments conducted with an on-site hydraulic robot.

Abstract:
In-orbit automated servicing is a promising path towards lowering the cost of satellite operations and reducing the amount of orbital debris. For this purpose, we present a pipeline for automated satellite docking port detection and state estimation using monocular vision data from standard RGB sensing or an event camera. Rather than taking snapshots of the environment, an event camera has independent pixels that asynchronously respond to light changes, offering advantages such as high dynamic range, low power consumption and latency. This work focuses on satellite-agnostic operations (only a geometric knowledge of the actual port is required) using the recently released Lockheed Martin Mission Augmentation Port (LM-MAP) as the target. By leveraging shallow data-driven techniques to preprocess the incoming data to highlight the LM-MAP's reflective navigational aids and then using basic geometric models for state estimation, we present a lightweight and data-efficient pipeline that can be used independently with either RGB or event cameras. We demonstrate the soundness of the pipeline and perform a quantitative comparison of the two modalities based on data collected with a photometrically accurate test bench that includes a robotic arm to simulate the target satellite's uncontrolled motion. The data has been made publicly available: https://uts-ri.githubio/rgb_event_docking_port/.

Abstract:
To achieve a desired grasping posture (including object position and orientation), multi-finger motions need to be conducted according to the the current touch state. Specifically, when subtle changes happen during correcting the object state, not only proprioception but also tactile information from the entire hand can be beneficial. However, switching motions with high-DOFs of multiple fingers and abundant tactile information is still challenging. In this study, we propose a loss function with constraints of touch states and an attention mechanism for focusing on important modalities depending on the touch states. The policy model is AE-LSTM which consists of Autoencoder (AE) which compresses abundant tactile information and Long Short-Term Memory (LSTM) which switches the motion depending on the touch states. Motion for cap-opening was chosen as a target task which consists of sub tasks of sliding an object and opening its cap. As a result, the proposed method achieved the best success rates with a variety of objects for real time cap-opening manipulation. Furthermore, we could confirm that the proposed model acquired the features of each subtask and attention on specific modalities.

Abstract:
This paper investigates the vulnerability of bilat-eral teleoperation systems to perfectly undetectable False Data Injection Attacks (FDIAs). Teleoperation, one of the major applications in robotics, involves a leader manipulator operated by a human and a follower manipulator at a remote site, connected via a communication channel. While this setup en-ables operation in challenging environments, it also introduces cybersecurity risks, particularly in the communication link. The paper focuses on a specific class of cyberattacks: perfectly un-detectable FDIAs, where attackers alter signals without leaving detectable traces at all. Compared to previous research on linear and first-order nonlinear systems, this paper examines bilateral teleoperation systems with second-order nonlinear manipulator dynamics. The paper derives mathematical conditions based on Lie Group theory that enable such attacks, demonstrating how an attacker can modify the follower manipulator's motion while the operator perceives normal operation through the leader device. This vulnerability challenges conventional detection methods based on observable changes and highlights the need for advanced security measures in teleoperation systems. To validate the theoretical results, the paper presents experimental demonstrations using a teleoperation system connecting robots in the US and Japan.

Abstract:
It is challenging to find optimum kinematic designs for non-standard robotic manipulators, e.g., medical, nuclear, and space manipulators, which are demanded to adapt to arbitrary complex tasks in constraints. Such design optimization can be modelled as a multi-dimensional non-convex optimization problem with nonlinear constrained conditions. However, it is non-trivial to ensure the essential reachability condition, i.e., the existence of continuous trajectories between demand positions for serial articulated manipulators, given complex spatial constraints, like obstacles and boundaries. Traditional solutions integrate standard motion planning or inverse kinematics algorithms within a kinematic-design optimization process, resulting in significant demand for time and computing resources. To accelerate design optimization at improved efficiency, we design a novel robust design framework built on a new kinematic design synthesis, which allows for simultaneously optimizing dimension and topology of a serial manipulator's kinematics for arbitrary tasks in constrained environments, using a generalised parametric kinematic model. Significantly, in contrast to standard solutions, we develop a novel computationally effective reachability verification method, which rapidly aborts infeasible motions by exploiting efficient collision checks, based on the Rapidly-exploring Random Tree (RRT) algorithm. The effectiveness of the proposed design framework is verified and evaluated by comparing to baseline benchmarks. Results demonstrate the novel design framework can accelerate kinematic design optimization by an order of magnitude compared to the current state-of-the-art, and optimise link dimension and joint type simultaneously of serial robots for cluttered environments.

Abstract:
In this paper, we propose an earthworm-inspired miniature pipeline robot capable of self-sensing odometry. The robot features a dielectric elastomer actuator as its elongation body and two specially designed passive anchors to achieve unidirectional motion without slipping. The odometry was achieved through the self-sensing scheme of DEAs and the summation of all step sizes over a period. The careful implementation of the self-sensing method resulted in a small sensing resolution of 0.05 mm at a high actuation frequency of 20 Hz for a cylindrical DEA. Finally, the robot obtained a self-sensing odometry in a pipe, showing good consistency with the ground truth. This work paves a new way for a miniature in-pipe robot to sense its own state without additional sensors to save space and power.

Abstract:
This paper investigates the mathematical modeling and basic motion properties of planar seven-link robots that forms passive closed and active open chains. The passive closed model is formed by connecting seven rigid frames via seven viscoelastic joints, and the active open model is formed by connecting them via actuated joints. The former is a convex heptagonal model and can exhibit passive-dynamic rolling on a gentle downhill, whereas the latter virtually forms a forward-leaning octagonal shape by controlling the six relative joint angles. In the first half of this paper, we describe the model assumptions and develop the mathematical equations of motion and collision of the passive closed model, and numerically analyze the motion characteristics by changing the slope angle while checking the conditions necessary for stable motion generation. In the second half, we outline the active open model, develop the PD control system, and numerically analyze the motion characteristics by changing the target angle parameter that controls the degree of forward lean of the virtual octagon.

Abstract:
The phenomenon of the “traveling wave,” commonly observed in various organisms, involves a wave that propagates along the body, serving as a locomotion mechanism. Particularly, in aquatic environments, organisms such as fish and cetaceans utilize traveling waves to propel themselves through water, minimizing fluid drag and maximizing movement efficiency. Inspired by nature, robotics has extensively explored replicating such locomotion strategies. This work presents a fish robot with an innovative magnetic transmission system. The mechanism transforms the unidirectional rotation of a single motor into an oscillatory, phase-shifted movement across the modules of the kinematic chain, generating a traveling wave along the body. The robot's design and functionality are detailed, highlighting advancements in bio-inspired robotics for underwater applications, such as efficient and non-invasive monitoring and exploration of marine ecosystems. The fish robot achieved a swimming speed of approximately 2 body lengths per second (BL/s) with a tail-beat frequency of 3.24 Hz and a minimum Cost of Transport (CoT) of 5.33 ~\mathrmJ /(\textkg \cdot \mathrmm). Biomimetic robotics can play a key role in sustainable aquafarming, biodiversity conservation, and animal-robot interaction research, offering the potential to minimize ecosystem disruption and advance marine science.

Abstract:
In this paper, as an approach to stabilize humanoid walking where the height of CoM varies, a Novel Model Predictive Control framework based on three dimensional Divergent Component of Motion (3D-DCM) is proposed. To ensure the feasible utilization of contact forces for maintaining humanoid balance, constraints on the control inputs, Virtual Repellent Point (VRP) and footstep adjustment, and their correlation are analytically formulated into a quadratic form, resulting a Quadratically Constrained Quadratic Programming. Additionally, to enable the humanoid robot to withstand disturbances over a broader range of strides or safely navigates various terrains without encountering knee stretch, the distance between the CoM and the foot is constrained in the 3D-CoM trajectory planner. The effectiveness of the proposed method is validated through simulations and real-robot experiments in scenarios involving external disturbances and step down.

Abstract:
Aerial robots, especially multirotor type, have been utilized in various scenarios such as inspection, surveillance, and logistics. The most critical issue for multirotor type is the limited flight time due to the large power consumption to hover against gravity. Inspired by nature, various research areas focus on the perching and grasping ability by deploying a gripper on the multirotor to grasp arboreal environments to save energy; however, most of the mechanical design for gripper restricts the approach path, significantly limiting the performance of perching and grasping. In addition, it is also challenging to design a light gripper that also offers sufficiently large grip force to hang itself. Therefore, in this work, we develop a single-actuator hand for aerial robot that enables adaptive grasping of various objects, and thus can perch from various approach directions. First, we present the design of the lightweight three-fingered hand with a pair of special two-dimensional differential plates that enables adaptive grasping with a single actuator. In addition, we develop a unique control method for the over-actuated aerial robot equipped with this hand to perform both adaptive pendulum-like perching and detachment. Finally, we demonstrate the feasibility of the prototype hand via load bearing and object grasping experiments, along with in-flight perching experiments.

Abstract:
In endoluminal surgeries, inserting a flexible endo-scope is one of the fundamental procedures. During this process, vision remains the primary feedback, while the perception of tactile magnitude and location is insufficient. This limitation can hinder the clinician's efficiency when navigating the endoscope through various segments of the natural lumens. To address this issue, we propose a fiber Bragg grating (FBG)-based tissue-compliant sensor cap with multi-mode sensing capabilities, including contact location identification at the terminal surface and the three-dimensional contact force perception at the tip. The soft sensor cap can be affixed to the standard endoscope tip, like a distal attachment cap, for easy installation. Utilizing the relative contact location information, operators can adjust the steerable segment of the endoscope when transitioning from one segment of a natural orifice to a narrower segment, which may be obstructed by constricted lumens. A finite element analysis simulation and the corresponding calibration process based on learning-based approaches have been carried out. The FBG-based sensor can perceive the tip contact force and identify the axial contact location with high precision, where the force perception error is less than 3%, and the contact location identification accuracy is 98.8%. The experimental results demonstrate the potential of the proposed sensing mechanism to be applied in surgeries requiring endoscope insertions.

Abstract:
Mobile manipulators — robots with a moving base and an arm for grasping objects — are becoming more common in human-populated environments, such as hospitals, warehouses, and even homes. Yet most mobile manipulators lack clear ways to communicate intent to human interlocutors in a continuous, socially acceptable, and easy-to-interpret way. One possible solution for improving mobile manipulator communication is the addition of expressive eyes. This paper presents the design and evaluation of a custom expressive LED eye module for mobile manipulators, which can display both gaze and emotional expressions. Our evaluation study (N=32) involved a mock teamwork task alongside a Hello Robot Stretch RE2 mobile manipulator with the custom LED eye module. The results showed that both gaze and emotional expressions supported better participant performance in the task and more feelings of social closeness. Emotional eye expressions also yielded higher ratings of robot social warmth and competence. This work can inform mobile manipulator design for smoother integration into human-populated spaces.

Abstract:
The prevailing and most effective approach to teleoperate a robotic arm involves a direct position-to-position mapping, imposing robotic end-effector movements that mirrors those of the user, Fig. 1-top. However, due to this one-to-one mapping, the robot's motions are limited by the user's capability, particularly in translation. Drawing inspiration from head pointers utilized in the 1980s, originally designed to enable drawing with limited head motions for tetraplegic individuals, we proposed a “virtual wand” mapping which could be used by participants with reduced mobility. This mapping employs a virtual rigid linkage between the hand and the robot's endeffector, Fig. 1-bottom. With this approach, rotations produce amplified translations through a lever arm, creating a “rotation-to-position” coupling and expanding the translation workspace at the expense of a reduced rotation space. In this study, we compare the virtual wand approach to the one-to-one position mapping through the realization of 6-DoF reaching tasks. Results indicate that the two different mappings perform comparably well, are equally well-received by users, and exhibit similar motor control behaviors. Nevertheless, the virtual wand mapping is anticipated to outperform in tasks characterized by large translations and minimal effector rotations, whereas direct mapping is expected to demonstrate advantages in large rotations with minimal translations. These results pave the way for new interactions and interfaces, particularly in disability assistance utilizing residual body movements (instead of hands) as control input. Leveraging body parts with substantial rotations could enable the accomplishment of tasks previously deemed infeasible with standard direct coupling interfaces.

Abstract:
This paper introduces a novel method for controlling multirotor aerial robots connected by passive flexible elements. Despite the growing popularity of multirotor aerial robots, their real-world applications remain limited due to difficulties adapting to complex environments. Soft robotics, due to its inherent flexibility, offers a potential solution, although research on integrating flexible elements into aerial robots is still in the early stages. In this study, we propose control methods for a system where multiple aerial robots are interconnected with passive flexible elements. These robotic systems enhance adaptability, enabling tasks like object manipulation. We model the flexible parts using the piecewise constant strain (PCS) model, which allows for model-based closed-loop control and stabilizes various configurations of the system. Through simulations and experiments, we validated that the proposed method achieves both stable flight and flexible deformation. Notably, we succeeded in maintaining stable flight, which traditional methods could not achieve, and demonstrated both positional controllability and the ability of the flexible parts to bend dynamically during flight.

Abstract:
Therapeutic robotic systems have emerged as reliable tools for physical rehabilitation, providing variableintensity movement assistance to patients with motor impairments. Robot-assisted rehabilitation facilitates the restoration mobility and dexterity, promotes functional neuroplasticity and potentially enables workforce reentry through training-induced cognitive and motor learning. To boost participant engagement and visuomotor coordination, we propose ArmGuider Pro, an advanced upper-limb training system that integrates hand-eye collaboration and gazetriggered assistance within rehabilitation-tailored serious games. The system implements intuitive eye-tracking and visualtriggering strategies to align therapeutic interventions with participants' intentional focus, incorporating immersive gaming elements and adaptive control algorithms. Experimental validation demonstrates significant activation in motor and cognitive cerebral cortex regions, enhanced visual attention concentration in desired target areas (25.92 % improvement), and improved trajectory adherence across sequential sessions (27.27 % improvement). By harnessing visual attention valence, our proposed system could encourage neuroplasticity, supporting its viability for clinical application and widespread adaption in rehabilitation regimens.

Abstract:
Accurate Six Degrees of Freedom (6-DoF) pose estimation of Ship-To-Shore (STS) quay crane spreaders is crucial for ensuring safe and efficient container handling in port automation. However, existing pose estimation techniques face significant challenges, as camera-based systems either rely on markers, which are prone to damage, or struggle with depth estimation inaccuracies. Additionally, 3D sensor-based approaches, particularly point cloud registration (PCR), face challenges such as initial pose errors, high-latency inference, and difficulties in object identification based purely on geometric features. To address these limitations, we propose LCSPose, a LiDAR-camera fusion-based 6-DoF pose estimation method that is marker-free, accurate, efficient, and scalable. Our approach integrates three key modules: (1) a semantic-geometric segmentation module for spreader segmentation and outlier removal, (2) a spatial consistency template sampling module based on Spatial Consistency Score (SC-Score) for reliable template selection across varying distances, and (3) a multi-view coarse-to-fine pose refinement module which incorporates multi-view PCA alignment for robust initial posture prior estimation and iterative pose refinement strategy for long-range registration. Our method demonstrates a 60% improvement in registration recall over state-of-the-art (SOTA) PCR methods, achieving up to 6 cm in translation error and 0.19 degrees in rotation error, while maintaining real-time processing at 20Hz.

Abstract:
In this work, a control scheme for human-robot collaborative object transportation is proposed, considering a quadruped robot equipped with the MIGHTY suction cup that serves both as a gripper for holding the object and a force/torque sensor. The proposed control scheme is based on the notion of admittance control, and incorporates a variable damping term aiming towards increasing the controllability of the human and, at the same time, decreasing her/his effort. Furthermore, to ensure that the object is not detached from the suction cup during the collaboration, an additional control signal is proposed, which is based on a barrier artificial potential. The proposed control scheme is proven to be passive and its performance is demonstrated through experimental evaluations conducted using the Unitree Go1 robot equipped with the MIGHTY suction cup.

Abstract:
Mobility impairments, particularly those caused by stroke-induced hemiparesis, significantly impact independence and quality of life. Current smart walker controllers operate by using input forces from the user to control linear motion and input torques to dictate rotational movement; however, because they predominantly rely on user-applied torque exerted on the device handle as an indicator of user intent to turn, they fail to adequately accommodate users with unilateral upper limb impairments. This leads to increased physical strain and cognitive load. This paper introduces a novel smart walker equipped with a fuzzy control algorithm that leverages shoulder abduction angles to intuitively interpret user intentions using just one functional hand. By integrating a force sensor and stereo camera, the system enhances walker responsiveness and usability. Experimental evaluations with five participants showed that the fuzzy controller outperformed the traditional admittance controller, reducing wrist torque while using the right hand to operate the walker by 12.65 % for left turns, 80.36 % for straight paths, and 81.16 % for right turns. Additionally, average user comfort ratings on a Likert scale increased from 1 to 4. Results confirmed a strong correlation between shoulder abduction angles and directional intent, with users reporting decreased effort and enhanced ease of use. This study contributes to assistive robotics by providing an adaptable control mechanism for smart walkers, suggesting a pathway towards enhancing mobility and independence for individuals with mobility impairments. Project page: https://tbs-ualberta.github.io/fuzzy-sw/

Abstract:
Deep neural network (DNN) based perception models are indispensable in the development of autonomous vehicles (AVs). However, their reliance on large-scale, high-quality data is broadly recognized as a burdensome necessity due to the substantial cost of data acquisition and labeling. Further, the issue is not a one-time concern as AVs might need a new dataset if they are to be deployed to another region (real-target domain) that the in-hand dataset within the real-source domain cannot incorporate. To mitigate this burden, we propose leveraging synthetic environments as an auxiliary domain where the characteristics of real domains are reproduced. This approach could enable indirect experience about the real-target domain in a time- and cost-effective manner. As a practical demonstration of our methodology, nuScenes and South Korea are employed to represent real-source and real-target domains, respectively. That means we construct digital twins for several regions of South Korea, and the data-acquisition framework of nuScenes is reproduced. Blending the aforementioned components within a simulator allows us to obtain a synthetic-fusion domain in which we forge our novel driving dataset, MORDA: Mixture Of Real-domain characteristics for synthetic-data-assisted Domain Adaptation. To verify the value of synthetic features that MORDA provides in learning about driving environments of South Korea, 2D/3D detectors are trained solely on a combination of nuScenes and MORDA. Afterward, their performance is evaluated on the unforeseen real-world dataset (AI-Hub 11This research (paper) used datasets from High-precision data collection vehicle daytime city road data. All data information can be accessed through AI-Hub (http://www.aihub.or.kr).) collected in South Korea. Our experiments present that MORDA can significantly improve mean Average Precision (mAP) on AI-Hub dataset while that on nuScenes is retained or slightly enhanced. Details on MORDA can be accessed at https://morda-e8d07e.gitlab.io.

Abstract:
As over 11,000 people turn 65 each day in the U.S., our country, like many others, is facing growing challenges in caring for elderly persons, further exacerbated by a major shortfall of care workers. To address this, we introduce an elder-care robot (E-BAR) capable of lifting a human body, assisting with postural changes/ambulation, and catching a user during a fall, all without the use of any wearable device or harness. Our robot is the first to integrate these 3 tasks, and is capable of lifting the full weight of a human outside of the robot's base of support (across gaps and obstacles). In developing E-BAR, we interviewed nurses and care professionals and conducted userexperience tests with elderly persons. Based on their functional requirements, the design parameters were optimized using a computational model and trade-off analysis. We developed a novel 18-bar linkage to lift a person from a floor to a standing position along a natural trajectory, while providing maximal mechanical advantage at key points. An omnidirectional, nonholonomic drive base, in which the wheels could be oriented to passively maximize floor grip, enabled the robot to resist lateral forces without active compensation. With a minimum width of 38 cm, the robot's small footprint allowed it to navigate the typical home environment. Four airbags were used to catch and stabilize a user during a fall in \leq \mathbf2 5 0 ~ m s. We demonstrate E-BAR's utility in multiple typical home scenarios, including getting into/out of a bathtub, bending to reach for objects, sit-to-stand transitions, and ambulation.

Abstract:
Aiming at developing a safe, intuitive, and collaborative steerable drilling robotic system for pedicle screw fixation procedures, in this paper, we leverage our recently developed steerable drilling robotic framework, and developed a collaborative drilling mode to control this system. In this control mode, first a user positions a concentric tube steerable drilling robot (CT-SDR) in the workspace and aligns it based on a preplanned trajectory. Next, the CT-SDR is directly controlled by the user through an admittance mode to perform a drilling procedure and creating a J-shape tunnel. To evaluate the user comfort and intuitiveness of the drilling procedure using this system and the proposed control interface, we performed a user study with 11 subjects, who had no prior experience in using this system. The results of this study were analyzed using various qualitative and quantitative metrics.

Abstract:
For various teleoperation tasks, position-based control is not practical. An impedance-based control is superior e.g. for handling fragile objects, like harvesting fruits or grasping a paper cup. However, only a few researchers have focused on impedance control for teleoperation. In tele-impedance, the stiffness of a human is measured and transferred to a controller of a robot. Until now, human stiffness was mostly measured either for specific joints or in 2D Cartesian space. We introduce a new way of measuring Cartesian stiffness in 3D using electromyography. Users were asked to perform a peg-in-hole task in three different orientations (0°, 45°, 90°). Meanwhile, electromyography measurements at shoulder and elbow muscle groups are performed. In a proof-of-concept study, we showed that the measured stiffness matrix in Cartesian space differed significantly across the three differently oriented peg-in-hole scenarios. This demonstrates that human stiffness could be predicted in 3D Cartesian space based on the type of task at hand.